26. Pandas DataFrame

By Bernd Klein. Last modified: 21 Feb 2024.

On this page ➤

Playing Pandas

The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.

Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data typye, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.

Here's a brief introduction to key concepts related to DataFrames:

Rows and Columns: A DataFrame consists of rows and columns. Rows are indexed by default starting from 0, while columns have labels that identify them. The labels are typically strings but can also be integers.
Data Types: Each column in a DataFrame can have its own data type. Common data types include integers, floats, strings, datetime objects, and categorical data.
Indexing and Selection: DataFrames support various methods for indexing and selecting data, such as indexing by label, integer-location based indexing, boolean indexing, and more.
Operations: DataFrames support a wide range of operations, including arithmetic operations, element-wise operations, aggregation functions (e.g., mean, sum), and statistical operations.
Missing Data Handling: DataFrames provide methods for handling missing data, such as dropping missing values or filling them with a specified value.
Merging and Concatenation: DataFrames can be merged or concatenated to combine data from different sources, either by rows or columns.
Grouping and Aggregation: DataFrames support grouping operations, allowing you to group data based on one or more columns and perform aggregation functions on each group.
Visualization: DataFrames can be easily visualized using built-in plotting functions in pandas or by integrating with other visualization libraries like Matplotlib or Seaborn.

Overall, DataFrames are a powerful tool for data analysis and manipulation, providing a flexible and intuitive way to work with structured data in Python.

Connection between DataFrames and Series Objects

There is a close connection between the DataFrames and the Series of Pandas. A DataFrame can be seen as a concatenation of Series, each Series having the same index, i.e. the index of the DataFrame.

We will demonstrate this in the following example.

We define the following three Series:

import pandas as pd

years = range(2014, 2018)

shop1 = pd.Series([2409.14, 2941.01, 3496.83, 3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83, 3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19, 1963.10], index=years)

What happens, if we concatenate these "shop" Series? Pandas provides a concat function for this purpose:

pd.concat([shop1, shop2, shop3])

OUTPUT:

2014    2409.14
2015    2941.01
2016    3496.83
2017    3119.55
2014    1203.45
2015    3441.62
2016    3007.83
2017    3619.53
2014    3412.12
2015    3491.16
2016    3457.19
2017    1963.10
dtype: float64

This result is not what we have intended or expected. The reason is that concat used 0 as the default for the axis parameter. Let's do it with "axis=1":

shops_df = pd.concat([shop1, shop2, shop3], axis=1)
shops_df

	0	1	2
2014	2409.14	1203.45	3412.12
2015	2941.01	3441.62	3491.16
2016	3496.83	3007.83	3457.19
2017	3119.55	3619.53	1963.10

In this example, each column in shops_df represents a shop, and each cell contains the corresponding value from the original shop Series object. Each column is still a Pandas Series object, with its own index and values, whereas the whole structure is now a DataFrame object:

print(type(shops_df))
print(type(shops_df[0]))

OUTPUT:

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

Let's do some fine sanding by giving names to the columns:

cities = ["Zürich", "Winterthur", "Freiburg"]
shops_df.columns = cities 
print(shops_df)

# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"

print("------")
shops_df2 = pd.concat([shop1, shop2, shop3], axis=1)
print(shops_df2)

OUTPUT:

       Zürich  Winterthur  Freiburg
2014  2409.14     1203.45   3412.12
2015  2941.01     3441.62   3491.16
2016  3496.83     3007.83   3457.19
2017  3119.55     3619.53   1963.10
------
       Zürich  Winterthur  Freiburg
2014  2409.14     1203.45   3412.12
2015  2941.01     3441.62   3491.16
2016  3496.83     3007.83   3457.19
2017  3119.55     3619.53   1963.10

This was nice, but what kind of data type is our result?

print(type(shops_df))

OUTPUT:

<class 'pandas.core.frame.DataFrame'>

This means, we can arrange or concat Series into DataFrames!

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of Python training courses

Machine Learning with Python training courses

DataFrames from Dictionaries

A DataFrame has a row and column index; it's like a dict of Series with a common index.

cities = {"name": ["London", "Berlin", "Madrid", "Rome", 
                   "Paris", "Vienna", "Bucharest", "Hamburg", 
                   "Budapest", "Warsaw", "Barcelona", 
                   "Munich", "Milan"],
          "population": [8615246, 3562166, 3165235, 2874038,
                         2273305, 1805681, 1803425, 1760433,
                         1754000, 1740119, 1602386, 1493900,
                         1350680],
          "country": ["England", "Germany", "Spain", "Italy",
                      "France", "Austria", "Romania", 
                      "Germany", "Hungary", "Poland", "Spain",
                      "Germany", "Italy"]}

city_frame = pd.DataFrame(cities)
city_frame

	name	population	country
0	London	8615246	England
1	Berlin	3562166	Germany
2	Madrid	3165235	Spain
3	Rome	2874038	Italy
4	Paris	2273305	France
5	Vienna	1805681	Austria
6	Bucharest	1803425	Romania
7	Hamburg	1760433	Germany
8	Budapest	1754000	Hungary
9	Warsaw	1740119	Poland
10	Barcelona	1602386	Spain
11	Munich	1493900	Germany
12	Milan	1350680	Italy

Retrieving the Column Names

It's possible to get the names of the columns as a list:

city_frame.columns.values

OUTPUT:

array(['name', 'population', 'country'], dtype=object)

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

10 Mar 2025 to 14 Mar 2025
07 Apr 2025 to 11 Apr 2025
23 Jun 2025 to 27 Jun 2025
28 Jul 2025 to 01 Aug 2025

Data Analysis with Python

12 Mar 2025 to 14 Mar 2025
09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025
30 Jul 2025 to 01 Aug 2025

Efficient Data Analysis with Pandas

10 Mar 2025 to 11 Mar 2025
07 Apr 2025 to 08 Apr 2025
02 Jun 2025 to 03 Jun 2025
23 Jun 2025 to 24 Jun 2025
28 Jul 2025 to 29 Jul 2025

Python Text Processing Course

09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025

Overview of Python training courses

Machine Learning with Python training courses

Custom Index

We can see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. We can also assign a custom index to the DataFrame object:

ordinals = ["first", "second", "third", "fourth",
            "fifth", "sixth", "seventh", "eigth",
            "ninth", "tenth", "eleventh", "twelvth",
            "thirteenth"]
city_frame = pd.DataFrame(cities, index=ordinals)
city_frame

	name	population	country
first	London	8615246	England
second	Berlin	3562166	Germany
third	Madrid	3165235	Spain
fourth	Rome	2874038	Italy
fifth	Paris	2273305	France
sixth	Vienna	1805681	Austria
seventh	Bucharest	1803425	Romania
eigth	Hamburg	1760433	Germany
ninth	Budapest	1754000	Hungary
tenth	Warsaw	1740119	Poland
eleventh	Barcelona	1602386	Spain
twelvth	Munich	1493900	Germany
thirteenth	Milan	1350680	Italy

Rearranging the Order of Columns

We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a dictionary. Dictionaries are not ordered, as you have seen in our chapter on Dictionaries in our Python tutorial, so we cannot know in advance what the ordering of our columns will be:

city_frame = pd.DataFrame(cities,
                          columns=["name", 
                                   "country", 
                                   "population"])
city_frame

	name	country	population
0	London	England	8615246
1	Berlin	Germany	3562166
2	Madrid	Spain	3165235
3	Rome	Italy	2874038
4	Paris	France	2273305
5	Vienna	Austria	1805681
6	Bucharest	Romania	1803425
7	Hamburg	Germany	1760433
8	Budapest	Hungary	1754000
9	Warsaw	Poland	1740119
10	Barcelona	Spain	1602386
11	Munich	Germany	1493900
12	Milan	Italy	1350680

We change both the column order and the ordering of the index with the function reindex with the following code:

city_frame.reindex(index=[0, 2, 4, 6,  8, 10, 12, 1, 3, 5, 7, 9, 11], 
                   columns=['country', 'name', 'population'])

	country	name	population
0	England	London	8615246
2	Spain	Madrid	3165235
4	France	Paris	2273305
6	Romania	Bucharest	1803425
8	Hungary	Budapest	1754000
10	Spain	Barcelona	1602386
12	Italy	Milan	1350680
1	Germany	Berlin	3562166
3	Italy	Rome	2874038
5	Austria	Vienna	1805681
7	Germany	Hamburg	1760433
9	Poland	Warsaw	1740119
11	Germany	Munich	1493900

Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This method supports two calling conventions

(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)

We will rename the columns of our DataFrame into Romanian names in the following example. We set the parameter inplace to True so that our DataFrame will be changed instead of returning a new DataFrame, if inplace is set to False, which is the default!

city_frame.rename(columns={"name":"Soyadı", 
                           "country":"Ülke", 
                           "population":"Nüfus"},
                 inplace=True)
city_frame

	Soyadı	Ülke	Nüfus
0	London	England	8615246
1	Berlin	Germany	3562166
2	Madrid	Spain	3165235
3	Rome	Italy	2874038
4	Paris	France	2273305
5	Vienna	Austria	1805681
6	Bucharest	Romania	1803425
7	Hamburg	Germany	1760433
8	Budapest	Hungary	1754000
9	Warsaw	Poland	1740119
10	Barcelona	Spain	1602386
11	Munich	Germany	1493900
12	Milan	Italy	1350680

Existing Column as the Index of a DataFrame

We want to create a more useful index in the following example. We will use the country name as the index, i.e. the list value associated to the key "country" of our cities dictionary:

city_frame = pd.DataFrame(cities,
                          columns=["name", "population"],
                          index=cities["country"])

city_frame

	name	population
England	London	8615246
Germany	Berlin	3562166
Spain	Madrid	3165235
Italy	Rome	2874038
France	Paris	2273305
Austria	Vienna	1805681
Romania	Bucharest	1803425
Germany	Hamburg	1760433
Hungary	Budapest	1754000
Poland	Warsaw	1740119
Spain	Barcelona	1602386
Germany	Munich	1493900
Italy	Milan	1350680

Alternatively, we can change an existing DataFrame. We can us the method set_index to turn a column into an index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:

city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)

OUTPUT:

              name  population
country                       
England     London     8615246
Germany     Berlin     3562166
Spain       Madrid     3165235
Italy         Rome     2874038
France       Paris     2273305
Austria     Vienna     1805681
Romania  Bucharest     1803425
Germany    Hamburg     1760433
Hungary   Budapest     1754000
Poland      Warsaw     1740119
Spain    Barcelona     1602386
Germany     Munich     1493900
Italy        Milan     1350680

We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created:

city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)

OUTPUT:

              name  population
country                       
England     London     8615246
Germany     Berlin     3562166
Spain       Madrid     3165235
Italy         Rome     2874038
France       Paris     2273305
Austria     Vienna     1805681
Romania  Bucharest     1803425
Germany    Hamburg     1760433
Hungary   Budapest     1754000
Poland      Warsaw     1740119
Spain    Barcelona     1602386
Germany     Munich     1493900
Italy        Milan     1350680

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of Python training courses

Machine Learning with Python training courses

Accessing Rows via Index Values

So far we have accessed DataFrames via the columns. It is often necessary to select certain rows via the index names. We will demonstrate now, how we can access rows from DataFrames via the locators 'loc' and 'iloc'. We will not cover 'ix' because it is deprecated and will be removed in the future.

We select all the German cities in the following example by using 'loc'. The result is a DataFrame:

city_frame = pd.DataFrame(cities, 
                          columns=("name", "population"), 
                          index=cities["country"])
print(city_frame.loc["Germany"])

OUTPUT:

            name  population
Germany   Berlin     3562166
Germany  Hamburg     1760433
Germany   Munich     1493900

It is also possible to simultaneously extracting rows by chosen more than on index labels. To do this we use a list of indices:

print(city_frame.loc[["Germany", "France"]])

OUTPUT:

            name  population
Germany   Berlin     3562166
Germany  Hamburg     1760433
Germany   Munich     1493900
France     Paris     2273305

We will also need to select pandas DataFrame rows based on conditions, which are applied to column values. We can use the operators '>', '=', '=', '<=', '!=' for this purpose. We select all cities with a population of more than two million in the following example:

condition = city_frame.population>2000000
condition

OUTPUT:

England     True
Germany     True
Spain       True
Italy       True
France      True
Austria    False
Romania    False
Germany    False
Hungary    False
Poland     False
Spain      False
Germany    False
Italy      False
Name: population, dtype: bool

We can use this Boolean DataFrame condition with locto finally create the selection:

print(city_frame.loc[condition])

OUTPUT:

           name  population
England  London     8615246
Germany  Berlin     3562166
Spain    Madrid     3165235
Italy      Rome     2874038
France    Paris     2273305

It is also possible to logically combine more than one condition with & and |:

condition1 = (city_frame.population>1500000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 & condition2])

OUTPUT:

            name  population
Italy       Rome     2874038
Germany  Hamburg     1760433

We use a logical or | in the following example to see all cities of the Pandas DataFrame, where either the city name contains the letter 'm' or the population number is greater than three million:

condition1 = (city_frame.population>3000000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 | condition2])

OUTPUT:

            name  population
England   London     8615246
Germany   Berlin     3562166
Spain     Madrid     3165235
Italy       Rome     2874038
Germany  Hamburg     1760433

Adding Rows to a DataFrame

milan = ['Milan', 1399860]
city_frame.iloc[-1] = milan
city_frame.loc['Switzerland'] = ['Zurich', 415215]
city_frame

	name	population
England	London	8615246
Germany	Berlin	3562166
Spain	Madrid	3165235
Italy	Rome	2874038
France	Paris	2273305
Austria	Vienna	1805681
Romania	Bucharest	1803425
Germany	Hamburg	1760433
Hungary	Budapest	1754000
Poland	Warsaw	1740119
Spain	Barcelona	1602386
Germany	Munich	1493900
Italy	Milan	1399860
Switzerland	Zurich	415215

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of Python training courses

Machine Learning with Python training courses

Accessing Rows by Position

The iloc method of a Pandas DataFrame object can be used to select rows and columns by number, i.e. in the order that they appear in the data frame. iloc allows selections of the rows, as if they were numbered by integers 0, 1, 2, ....

We demonstrate this in the following example:

df = city_frame.iloc[3]
print(df)

OUTPUT:

name             Rome
population    2874038
Name: Italy, dtype: object

To get a DataFrame with selected rows by numbers, we use a list of integers. We can see that we can change the order of the rows and we are also able to select rows multiple times:

df = city_frame.iloc[[3, 2, 0, 5, 0]]
print(df)

OUTPUT:

           name  population
Italy      Rome     2874038
Spain    Madrid     3165235
England  London     8615246
Austria  Vienna     1805681
England  London     8615246

Sum and Cumulative Sum

The DataFrame object of Pandas provides a method to sum both columns and rows. Before we will explain the usage of the sum method, we will create a new DataFrame object on which we will apply our examples. We will start by creating an empty DataFrame without columns but an index. We populate this DataFrame by adding columns with random values:

import numpy as np

years = range(2014, 2019)
cities = ["Zürich", "Freiburg", "München", "Konstanz", "Saarbrücken"]
shops = pd.DataFrame(index=years)
for city in cities:
    shops.insert(loc=len(shops.columns),
                 column=city,
                 value=(np.random.uniform(0.7, 1, (5,)) * 1000).round(2))
    
print(shops)

OUTPUT:

      Zürich  Freiburg  München  Konstanz  Saarbrücken
2014  734.01    791.84   813.07    806.50       913.84
2015  991.69    906.50   939.86    832.32       951.74
2016  945.20    766.72   956.92    744.36       800.71
2017  824.21    856.26   745.96    790.43       956.18
2018  917.76    923.75   732.05    992.13       879.36

Let's apply sum to the DataFrame shops:

shops.sum()

OUTPUT:

Zürich         4412.87
Freiburg       4245.07
München        4187.86
Konstanz       4165.74
Saarbrücken    4501.83
dtype: float64

We can see that it summed up all the columns of our DataFrame. What about calculating the sum of the rows? We can do this by using the axis parameter of sum.

shops.sum(axis=1)

OUTPUT:

2014    4059.26
2015    4622.11
2016    4213.91
2017    4173.04
2018    4445.05
dtype: float64

You only want to the the sums for the first, third and the last column and for all the years:

s = shops.iloc[:, [0, 2, -1]]
print(s)
print("and now the sum:")
print(s.sum())

OUTPUT:

      Zürich  München  Saarbrücken
2014  734.01   813.07       913.84
2015  991.69   939.86       951.74
2016  945.20   956.92       800.71
2017  824.21   745.96       956.18
2018  917.76   732.05       879.36
and now the sum:
Zürich         4412.87
München        4187.86
Saarbrücken    4501.83
dtype: float64

Of course, you could have also have achieved it in the following way, if the column names are known:

shops[["Zürich", "München", "Saarbrücken"]].sum()

OUTPUT:

Zürich         4412.87
München        4187.86
Saarbrücken    4501.83
dtype: float64

We can use "cumsum" to calculate the cumulative sum over the years:

x = shops.cumsum()
print(x)

OUTPUT:

       Zürich  Freiburg  München  Konstanz  Saarbrücken
2014   734.01    791.84   813.07    806.50       913.84
2015  1725.70   1698.34  1752.93   1638.82      1865.58
2016  2670.90   2465.06  2709.85   2383.18      2666.29
2017  3495.11   3321.32  3455.81   3173.61      3622.47
2018  4412.87   4245.07  4187.86   4165.74      4501.83

Using the keyword parameter axis with the value 1, we can build the cumulative sum over the rows:

x = shops.cumsum(axis=1)
print(x)

OUTPUT:

      Zürich  Freiburg  München  Konstanz  Saarbrücken
2014  734.01   1525.85  2338.92   3145.42      4059.26
2015  991.69   1898.19  2838.05   3670.37      4622.11
2016  945.20   1711.92  2668.84   3413.20      4213.91
2017  824.21   1680.47  2426.43   3216.86      4173.04
2018  917.76   1841.51  2573.56   3565.69      4445.05

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of Python training courses

Machine Learning with Python training courses

Assigning New Columns

We will assign a new column to a DataFrame. We will add the cumulative population sum as a new column with the name "cum_population":

cities = {"name": ["London", "Berlin", "Madrid", "Rome", 
                   "Paris", "Vienna", "Bucharest", "Hamburg", 
                   "Budapest", "Warsaw", "Barcelona", 
                   "Munich", "Milan"],
          "population": [8615246, 3562166, 3165235, 2874038,
                         2273305, 1805681, 1803425, 1760433,
                         1754000, 1740119, 1602386, 1493900,
                         1350680],
          "country": ["England", "Germany", "Spain", "Italy",
                      "France", "Austria", "Romania", 
                      "Germany", "Hungary", "Poland", "Spain",
                      "Germany", "Italy"]}

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "population",
                                   "cum_population"],
                          index=cities["name"])

city_frame

	country	population	cum_population
London	England	8615246	NaN
Berlin	Germany	3562166	NaN
Madrid	Spain	3165235	NaN
Rome	Italy	2874038	NaN
Paris	France	2273305	NaN
Vienna	Austria	1805681	NaN
Bucharest	Romania	1803425	NaN
Hamburg	Germany	1760433	NaN
Budapest	Hungary	1754000	NaN
Warsaw	Poland	1740119	NaN
Barcelona	Spain	1602386	NaN
Munich	Germany	1493900	NaN
Milan	Italy	1350680	NaN

We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.

We will assign now the cumulative sums to this column:

city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame

	country	population	cum_population
London	England	8615246	8615246
Berlin	Germany	3562166	12177412
Madrid	Spain	3165235	15342647
Rome	Italy	2874038	18216685
Paris	France	2273305	20489990
Vienna	Austria	1805681	22295671
Bucharest	Romania	1803425	24099096
Hamburg	Germany	1760433	25859529
Budapest	Hungary	1754000	27613529
Warsaw	Poland	1740119	29353648
Barcelona	Spain	1602386	30956034
Munich	Germany	1493900	32449934
Milan	Italy	1350680	33800614

We can also include a column name which is not contained in the dictionary, when we create the DataFrame from the dictionary. In this case, all the values of this column will be set to NaN:

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "area",
                                   "population"],
                          index=cities["name"])
print(city_frame)

OUTPUT:

           country area  population
London     England  NaN     8615246
Berlin     Germany  NaN     3562166
Madrid       Spain  NaN     3165235
Rome         Italy  NaN     2874038
Paris       France  NaN     2273305
Vienna     Austria  NaN     1805681
Bucharest  Romania  NaN     1803425
Hamburg    Germany  NaN     1760433
Budapest   Hungary  NaN     1754000
Warsaw      Poland  NaN     1740119
Barcelona    Spain  NaN     1602386
Munich     Germany  NaN     1493900
Milan        Italy  NaN     1350680

Accessing the Columns of a DataFrame

There are two ways to access a column of a DataFrame. The result is in both cases a Series:

# in a dictionary-like way:
print(city_frame["population"])

OUTPUT:

London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

# as an attribute
print(city_frame.population)

OUTPUT:

London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

print(type(city_frame.population))

OUTPUT:

<class 'pandas.core.series.Series'>

city_frame.population

OUTPUT:

London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of Python training courses

Machine Learning with Python training courses

Assigning New Values to a Column

The column area is still not defined. We can set all elements of the column to the same value:

city_frame["area"] = 1572
print(city_frame)

OUTPUT:

           country  area  population
London     England  1572     8615246
Berlin     Germany  1572     3562166
Madrid       Spain  1572     3165235
Rome         Italy  1572     2874038
Paris       France  1572     2273305
Vienna     Austria  1572     1805681
Bucharest  Romania  1572     1803425
Hamburg    Germany  1572     1760433
Budapest   Hungary  1572     1754000
Warsaw      Poland  1572     1740119
Barcelona    Spain  1572     1602386
Munich     Germany  1572     1493900
Milan        Italy  1572     1350680

In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame.

# area in square km:
area = [1572, 891.85, 605.77, 1285, 
        105.4, 414.6, 228, 755, 
        525.2, 517, 101.9, 310.4, 
        181.8]
# area could have been designed as a list, a Series, an array or a scalar   

city_frame["area"] = area
print(city_frame)

OUTPUT:

           country     area  population
London     England  1572.00     8615246
Berlin     Germany   891.85     3562166
Madrid       Spain   605.77     3165235
Rome         Italy  1285.00     2874038
Paris       France   105.40     2273305
Vienna     Austria   414.60     1805681
Bucharest  Romania   228.00     1803425
Hamburg    Germany   755.00     1760433
Budapest   Hungary   525.20     1754000
Warsaw      Poland   517.00     1740119
Barcelona    Spain   101.90     1602386
Munich     Germany   310.40     1493900
Milan        Italy   181.80     1350680

Sorting DataFrames

Let's sort our DataFrame according to the city area:

city_frame = city_frame.sort_values(by="area", ascending=False)
print(city_frame)

OUTPUT:

           country     area  population
London     England  1572.00     8615246
Rome         Italy  1285.00     2874038
Berlin     Germany   891.85     3562166
Hamburg    Germany   755.00     1760433
Madrid       Spain   605.77     3165235
Budapest   Hungary   525.20     1754000
Warsaw      Poland   517.00     1740119
Vienna     Austria   414.60     1805681
Munich     Germany   310.40     1493900
Bucharest  Romania   228.00     1803425
Milan        Italy   181.80     1350680
Paris       France   105.40     2273305
Barcelona    Spain   101.90     1602386

Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well:

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "area",
                                   "population"],
                          index=cities["name"])

some_areas = pd.Series([1572, 755, 181.8], 
                    index=['London', 'Hamburg', 'Milan'])

city_frame['area'] = some_areas
print(city_frame)

OUTPUT:

           country    area  population
London     England  1572.0     8615246
Berlin     Germany     NaN     3562166
Madrid       Spain     NaN     3165235
Rome         Italy     NaN     2874038
Paris       France     NaN     2273305
Vienna     Austria     NaN     1805681
Bucharest  Romania     NaN     1803425
Hamburg    Germany   755.0     1760433
Budapest   Hungary     NaN     1754000
Warsaw      Poland     NaN     1740119
Barcelona    Spain     NaN     1602386
Munich     Germany     NaN     1493900
Milan        Italy   181.8     1350680

insert(self, loc, column, value, allow_duplicates=False)`

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "population"],
                          index=cities["name"])


idx = 1
city_frame.insert(loc=idx, column='area', value=area)
city_frame

	country	area	population
London	England	1572.00	8615246
Berlin	Germany	891.85	3562166
Madrid	Spain	605.77	3165235
Rome	Italy	1285.00	2874038
Paris	France	105.40	2273305
Vienna	Austria	414.60	1805681
Bucharest	Romania	228.00	1803425
Hamburg	Germany	755.00	1760433
Budapest	Hungary	525.20	1754000
Warsaw	Poland	517.00	1740119
Barcelona	Spain	101.90	1602386
Munich	Germany	310.40	1493900
Milan	Italy	181.80	1350680

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of Python training courses

Machine Learning with Python training courses

Creating a DataFrame by appending Rows

import pandas as pd
from numpy.random import randint

df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
for i in range(5):
     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
    
df

	lib	qty1	qty2
0	name0	1	4
1	name1	6	5
2	name2	4	2
3	name3	1	9
4	name4	3	1

DataFrame from Nested Dictionaries

A nested dictionary of dicts can be passed to a DataFrame as well. The indices of the outer dictionary are taken as the the columns and the inner keys. i.e. the keys of the nested dictionaries, are used as the row indices:

growth = {"Switzerland": {"2010": 3.0, "2011": 1.8, "2012": 1.1, "2013": 1.9},
          "Germany": {"2010": 4.1, "2011": 3.6, "2012":	0.4, "2013": 0.1},
          "France": {"2010":2.0,  "2011":2.1, "2012": 0.3, "2013": 0.3},
          "Greece": {"2010":-5.4, "2011":-8.9, "2012":-6.6, "2013":	-3.3},
          "Italy": {"2010":1.7, "2011":	0.6, "2012":-2.3, "2013":-1.9}
          }

growth_frame = pd.DataFrame(growth)
growth_frame

	Switzerland	Germany	France	Greece	Italy
2010	3.0	4.1	2.0	-5.4	1.7
2011	1.8	3.6	2.1	-8.9	0.6
2012	1.1	0.4	0.3	-6.6	-2.3
2013	1.9	0.1	0.3	-3.3	-1.9

You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data:

growth_frame.T

	2010	2011	2012	2013
Switzerland	3.0	1.8	1.1	1.9
Germany	4.1	3.6	0.4	0.1
France	2.0	2.1	0.3	0.3
Greece	-5.4	-8.9	-6.6	-3.3
Italy	1.7	0.6	-2.3	-1.9

growth_frame = growth_frame.T

growth_frame2 = growth_frame.reindex(["Switzerland", 
                                      "Italy", 
                                      "Germany", 
                                      "Greece"])
print(growth_frame2)

OUTPUT:

             2010  2011  2012  2013
Switzerland   3.0   1.8   1.1   1.9
Italy         1.7   0.6  -2.3  -1.9
Germany       4.1   3.6   0.4   0.1
Greece       -5.4  -8.9  -6.6  -3.3

Filling a DataFrame with random values:

import numpy as np

names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara']
index = ["January", "February", "March",
         "April", "May", "June",
         "July", "August", "September",
         "October", "November", "December"]
df = pd.DataFrame((np.random.randn(12, 5)*1000).round(2),
                  columns=names,
                  index=index)
df

	Frank	Eve	Stella	Guido	Lara
January	-1312.08	232.99	1686.19	-1182.34	-1916.60
February	676.24	-1226.14	-1175.01	-159.25	-855.56
March	2044.31	1143.64	281.80	273.50	-493.31
April	-568.10	-407.86	-1020.66	136.64	-259.10
May	-1292.68	-1025.80	-8.08	-281.49	-173.00
June	966.28	723.29	88.75	57.01	-745.23
July	-2067.34	-33.33	-1190.31	-816.72	368.75
August	114.27	-295.51	299.64	1186.75	-873.99
September	1040.65	291.92	-501.46	-1465.73	187.93
October	129.06	955.29	-732.68	-1507.39	2333.80
November	1490.33	-658.89	12.76	-620.20	-381.90
December	-33.15	-1147.61	-1057.52	-3.90	-785.99

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

10 Mar 2025 to 14 Mar 2025
07 Apr 2025 to 11 Apr 2025
23 Jun 2025 to 27 Jun 2025
28 Jul 2025 to 01 Aug 2025

Data Analysis with Python

12 Mar 2025 to 14 Mar 2025
09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025
30 Jul 2025 to 01 Aug 2025

Efficient Data Analysis with Pandas

10 Mar 2025 to 11 Mar 2025
07 Apr 2025 to 08 Apr 2025
02 Jun 2025 to 03 Jun 2025
23 Jun 2025 to 24 Jun 2025
28 Jul 2025 to 29 Jul 2025

Python Text Processing Course

09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025

Overview of Python training courses

Machine Learning with Python training courses