26. Pandas DataFrame
By Bernd Klein. Last modified: 21 Feb 2024.
The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.
Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data typye, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.
Here's a brief introduction to key concepts related to DataFrames:
-
Rows and Columns: A DataFrame consists of rows and columns. Rows are indexed by default starting from 0, while columns have labels that identify them. The labels are typically strings but can also be integers.
-
Data Types: Each column in a DataFrame can have its own data type. Common data types include integers, floats, strings, datetime objects, and categorical data.
-
Indexing and Selection: DataFrames support various methods for indexing and selecting data, such as indexing by label, integer-location based indexing, boolean indexing, and more.
-
Operations: DataFrames support a wide range of operations, including arithmetic operations, element-wise operations, aggregation functions (e.g., mean, sum), and statistical operations.
-
Missing Data Handling: DataFrames provide methods for handling missing data, such as dropping missing values or filling them with a specified value.
-
Merging and Concatenation: DataFrames can be merged or concatenated to combine data from different sources, either by rows or columns.
-
Grouping and Aggregation: DataFrames support grouping operations, allowing you to group data based on one or more columns and perform aggregation functions on each group.
-
Visualization: DataFrames can be easily visualized using built-in plotting functions in pandas or by integrating with other visualization libraries like Matplotlib or Seaborn.
Overall, DataFrames are a powerful tool for data analysis and manipulation, providing a flexible and intuitive way to work with structured data in Python.
Connection between DataFrames and Series Objects
There is a close connection between the DataFrames and the Series of Pandas. A DataFrame can be seen as a concatenation of Series, each Series having the same index, i.e. the index of the DataFrame.
We will demonstrate this in the following example.
We define the following three Series:
import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83, 3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83, 3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19, 1963.10], index=years)
What happens, if we concatenate these "shop" Series? Pandas provides a concat function for this purpose:
pd.concat([shop1, shop2, shop3])
OUTPUT:
2014 2409.14 2015 2941.01 2016 3496.83 2017 3119.55 2014 1203.45 2015 3441.62 2016 3007.83 2017 3619.53 2014 3412.12 2015 3491.16 2016 3457.19 2017 1963.10 dtype: float64
This result is not what we have intended or expected. The reason is that concat used 0 as the default for the axis parameter. Let's do it with "axis=1":
shops_df = pd.concat([shop1, shop2, shop3], axis=1)
shops_df
0 | 1 | 2 | |
---|---|---|---|
2014 | 2409.14 | 1203.45 | 3412.12 |
2015 | 2941.01 | 3441.62 | 3491.16 |
2016 | 3496.83 | 3007.83 | 3457.19 |
2017 | 3119.55 | 3619.53 | 1963.10 |
In this example, each column in shops_df represents a shop, and each cell contains the corresponding value from the original shop Series object. Each column is still a Pandas Series object, with its own index and values, whereas the whole structure is now a DataFrame object:
print(type(shops_df))
print(type(shops_df[0]))
OUTPUT:
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>
Let's do some fine sanding by giving names to the columns:
cities = ["Zürich", "Winterthur", "Freiburg"]
shops_df.columns = cities
print(shops_df)
# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3], axis=1)
print(shops_df2)
OUTPUT:
Zürich Winterthur Freiburg 2014 2409.14 1203.45 3412.12 2015 2941.01 3441.62 3491.16 2016 3496.83 3007.83 3457.19 2017 3119.55 3619.53 1963.10 ------ Zürich Winterthur Freiburg 2014 2409.14 1203.45 3412.12 2015 2941.01 3441.62 3491.16 2016 3496.83 3007.83 3457.19 2017 3119.55 3619.53 1963.10
This was nice, but what kind of data type is our result?
print(type(shops_df))
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
This means, we can arrange or concat Series into DataFrames!
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
DataFrames from Dictionaries
A DataFrame has a row and column index; it's like a dict of Series with a common index.
cities = {"name": ["London", "Berlin", "Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235, 2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain", "Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame
name | population | country | |
---|---|---|---|
0 | London | 8615246 | England |
1 | Berlin | 3562166 | Germany |
2 | Madrid | 3165235 | Spain |
3 | Rome | 2874038 | Italy |
4 | Paris | 2273305 | France |
5 | Vienna | 1805681 | Austria |
6 | Bucharest | 1803425 | Romania |
7 | Hamburg | 1760433 | Germany |
8 | Budapest | 1754000 | Hungary |
9 | Warsaw | 1740119 | Poland |
10 | Barcelona | 1602386 | Spain |
11 | Munich | 1493900 | Germany |
12 | Milan | 1350680 | Italy |
Retrieving the Column Names
It's possible to get the names of the columns as a list:
city_frame.columns.values
OUTPUT:
array(['name', 'population', 'country'], dtype=object)
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses
Custom Index
We can see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. We can also assign a custom index to the DataFrame object:
ordinals = ["first", "second", "third", "fourth",
"fifth", "sixth", "seventh", "eigth",
"ninth", "tenth", "eleventh", "twelvth",
"thirteenth"]
city_frame = pd.DataFrame(cities, index=ordinals)
city_frame
name | population | country | |
---|---|---|---|
first | London | 8615246 | England |
second | Berlin | 3562166 | Germany |
third | Madrid | 3165235 | Spain |
fourth | Rome | 2874038 | Italy |
fifth | Paris | 2273305 | France |
sixth | Vienna | 1805681 | Austria |
seventh | Bucharest | 1803425 | Romania |
eigth | Hamburg | 1760433 | Germany |
ninth | Budapest | 1754000 | Hungary |
tenth | Warsaw | 1740119 | Poland |
eleventh | Barcelona | 1602386 | Spain |
twelvth | Munich | 1493900 | Germany |
thirteenth | Milan | 1350680 | Italy |
Rearranging the Order of Columns
We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a dictionary. Dictionaries are not ordered, as you have seen in our chapter on Dictionaries in our Python tutorial, so we cannot know in advance what the ordering of our columns will be:
city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame
name | country | population | |
---|---|---|---|
0 | London | England | 8615246 |
1 | Berlin | Germany | 3562166 |
2 | Madrid | Spain | 3165235 |
3 | Rome | Italy | 2874038 |
4 | Paris | France | 2273305 |
5 | Vienna | Austria | 1805681 |
6 | Bucharest | Romania | 1803425 |
7 | Hamburg | Germany | 1760433 |
8 | Budapest | Hungary | 1754000 |
9 | Warsaw | Poland | 1740119 |
10 | Barcelona | Spain | 1602386 |
11 | Munich | Germany | 1493900 |
12 | Milan | Italy | 1350680 |
We change both the column order and the ordering of the index with the function reindex
with the following code:
city_frame.reindex(index=[0, 2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11],
columns=['country', 'name', 'population'])
country | name | population | |
---|---|---|---|
0 | England | London | 8615246 |
2 | Spain | Madrid | 3165235 |
4 | France | Paris | 2273305 |
6 | Romania | Bucharest | 1803425 |
8 | Hungary | Budapest | 1754000 |
10 | Spain | Barcelona | 1602386 |
12 | Italy | Milan | 1350680 |
1 | Germany | Berlin | 3562166 |
3 | Italy | Rome | 2874038 |
5 | Austria | Vienna | 1805681 |
7 | Germany | Hamburg | 1760433 |
9 | Poland | Warsaw | 1740119 |
11 | Germany | Munich | 1493900 |
Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This method supports two calling conventions
- (index=index_mapper, columns=columns_mapper, ...)
- (mapper, axis={'index', 'columns'}, ...)
We will rename the columns of our DataFrame into Romanian names in the following example. We set the parameter inplace to True so that our DataFrame will be changed instead of returning a new DataFrame, if inplace is set to False, which is the default!
city_frame.rename(columns={"name":"Soyadı",
"country":"Ülke",
"population":"Nüfus"},
inplace=True)
city_frame
Soyadı | Ülke | Nüfus | |
---|---|---|---|
0 | London | England | 8615246 |
1 | Berlin | Germany | 3562166 |
2 | Madrid | Spain | 3165235 |
3 | Rome | Italy | 2874038 |
4 | Paris | France | 2273305 |
5 | Vienna | Austria | 1805681 |
6 | Bucharest | Romania | 1803425 |
7 | Hamburg | Germany | 1760433 |
8 | Budapest | Hungary | 1754000 |
9 | Warsaw | Poland | 1740119 |
10 | Barcelona | Spain | 1602386 |
11 | Munich | Germany | 1493900 |
12 | Milan | Italy | 1350680 |
Existing Column as the Index of a DataFrame
We want to create a more useful index in the following example. We will use the country name as the index, i.e. the list value associated to the key "country" of our cities dictionary:
city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame
name | population | |
---|---|---|
England | London | 8615246 |
Germany | Berlin | 3562166 |
Spain | Madrid | 3165235 |
Italy | Rome | 2874038 |
France | Paris | 2273305 |
Austria | Vienna | 1805681 |
Romania | Bucharest | 1803425 |
Germany | Hamburg | 1760433 |
Hungary | Budapest | 1754000 |
Poland | Warsaw | 1740119 |
Spain | Barcelona | 1602386 |
Germany | Munich | 1493900 |
Italy | Milan | 1350680 |
Alternatively, we can change an existing DataFrame. We can us the method set_index to turn a column into an index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:
city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)
OUTPUT:
name population country England London 8615246 Germany Berlin 3562166 Spain Madrid 3165235 Italy Rome 2874038 France Paris 2273305 Austria Vienna 1805681 Romania Bucharest 1803425 Germany Hamburg 1760433 Hungary Budapest 1754000 Poland Warsaw 1740119 Spain Barcelona 1602386 Germany Munich 1493900 Italy Milan 1350680
We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created:
city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
OUTPUT:
name population country England London 8615246 Germany Berlin 3562166 Spain Madrid 3165235 Italy Rome 2874038 France Paris 2273305 Austria Vienna 1805681 Romania Bucharest 1803425 Germany Hamburg 1760433 Hungary Budapest 1754000 Poland Warsaw 1740119 Spain Barcelona 1602386 Germany Munich 1493900 Italy Milan 1350680
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Accessing Rows via Index Values
So far we have accessed DataFrames via the columns. It is often necessary to select certain rows via the index names. We will demonstrate now, how we can access rows from DataFrames via the locators 'loc' and 'iloc'. We will not cover 'ix' because it is deprecated and will be removed in the future.
We select all the German cities in the following example by using 'loc'. The result is a DataFrame:
city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])
OUTPUT:
name population Germany Berlin 3562166 Germany Hamburg 1760433 Germany Munich 1493900
It is also possible to simultaneously extracting rows by chosen more than on index labels. To do this we use a list of indices:
print(city_frame.loc[["Germany", "France"]])
OUTPUT:
name population Germany Berlin 3562166 Germany Hamburg 1760433 Germany Munich 1493900 France Paris 2273305
We will also need to select pandas DataFrame rows based on conditions, which are applied to column values. We can use the operators '>', '=', '=', '<=', '!=' for this purpose. We select all cities with a population of more than two million in the following example:
condition = city_frame.population>2000000
condition
OUTPUT:
England True Germany True Spain True Italy True France True Austria False Romania False Germany False Hungary False Poland False Spain False Germany False Italy False Name: population, dtype: bool
We can use this Boolean DataFrame condition
with loc
to finally create the selection:
print(city_frame.loc[condition])
OUTPUT:
name population England London 8615246 Germany Berlin 3562166 Spain Madrid 3165235 Italy Rome 2874038 France Paris 2273305
It is also possible to logically combine more than one condition with &
and |
:
condition1 = (city_frame.population>1500000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 & condition2])
OUTPUT:
name population Italy Rome 2874038 Germany Hamburg 1760433
We use a logical or |
in the following example to see all cities of the Pandas DataFrame, where either the city name contains the letter 'm' or the population number is greater than three million:
condition1 = (city_frame.population>3000000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 | condition2])
OUTPUT:
name population England London 8615246 Germany Berlin 3562166 Spain Madrid 3165235 Italy Rome 2874038 Germany Hamburg 1760433
Adding Rows to a DataFrame
milan = ['Milan', 1399860]
city_frame.iloc[-1] = milan
city_frame.loc['Switzerland'] = ['Zurich', 415215]
city_frame
name | population | |
---|---|---|
England | London | 8615246 |
Germany | Berlin | 3562166 |
Spain | Madrid | 3165235 |
Italy | Rome | 2874038 |
France | Paris | 2273305 |
Austria | Vienna | 1805681 |
Romania | Bucharest | 1803425 |
Germany | Hamburg | 1760433 |
Hungary | Budapest | 1754000 |
Poland | Warsaw | 1740119 |
Spain | Barcelona | 1602386 |
Germany | Munich | 1493900 |
Italy | Milan | 1399860 |
Switzerland | Zurich | 415215 |
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Accessing Rows by Position
The iloc
method of a Pandas DataFrame object can be used to select rows and columns by number, i.e. in the order that they appear in the data frame. iloc
allows selections of the rows, as if they were numbered by integers 0
, 1
, 2
, ....
We demonstrate this in the following example:
df = city_frame.iloc[3]
print(df)
OUTPUT:
name Rome population 2874038 Name: Italy, dtype: object
To get a DataFrame with selected rows by numbers, we use a list of integers. We can see that we can change the order of the rows and we are also able to select rows multiple times:
df = city_frame.iloc[[3, 2, 0, 5, 0]]
print(df)
OUTPUT:
name population Italy Rome 2874038 Spain Madrid 3165235 England London 8615246 Austria Vienna 1805681 England London 8615246
Sum and Cumulative Sum
The DataFrame object of Pandas provides a method to sum both columns and rows. Before we will explain the usage of the sum method, we will create a new DataFrame object on which we will apply our examples. We will start by creating an empty DataFrame without columns but an index. We populate this DataFrame by adding columns with random values:
import numpy as np
years = range(2014, 2019)
cities = ["Zürich", "Freiburg", "München", "Konstanz", "Saarbrücken"]
shops = pd.DataFrame(index=years)
for city in cities:
shops.insert(loc=len(shops.columns),
column=city,
value=(np.random.uniform(0.7, 1, (5,)) * 1000).round(2))
print(shops)
OUTPUT:
Zürich Freiburg München Konstanz Saarbrücken 2014 734.01 791.84 813.07 806.50 913.84 2015 991.69 906.50 939.86 832.32 951.74 2016 945.20 766.72 956.92 744.36 800.71 2017 824.21 856.26 745.96 790.43 956.18 2018 917.76 923.75 732.05 992.13 879.36
Let's apply sum
to the DataFrame shops
:
shops.sum()
OUTPUT:
Zürich 4412.87 Freiburg 4245.07 München 4187.86 Konstanz 4165.74 Saarbrücken 4501.83 dtype: float64
We can see that it summed up all the columns of our DataFrame. What about calculating the sum of the rows? We can do this by using the axis
parameter of sum
.
shops.sum(axis=1)
OUTPUT:
2014 4059.26 2015 4622.11 2016 4213.91 2017 4173.04 2018 4445.05 dtype: float64
You only want to the the sums for the first, third and the last column and for all the years:
s = shops.iloc[:, [0, 2, -1]]
print(s)
print("and now the sum:")
print(s.sum())
OUTPUT:
Zürich München Saarbrücken 2014 734.01 813.07 913.84 2015 991.69 939.86 951.74 2016 945.20 956.92 800.71 2017 824.21 745.96 956.18 2018 917.76 732.05 879.36 and now the sum: Zürich 4412.87 München 4187.86 Saarbrücken 4501.83 dtype: float64
Of course, you could have also have achieved it in the following way, if the column names are known:
shops[["Zürich", "München", "Saarbrücken"]].sum()
OUTPUT:
Zürich 4412.87 München 4187.86 Saarbrücken 4501.83 dtype: float64
We can use "cumsum" to calculate the cumulative sum over the years:
x = shops.cumsum()
print(x)
OUTPUT:
Zürich Freiburg München Konstanz Saarbrücken 2014 734.01 791.84 813.07 806.50 913.84 2015 1725.70 1698.34 1752.93 1638.82 1865.58 2016 2670.90 2465.06 2709.85 2383.18 2666.29 2017 3495.11 3321.32 3455.81 3173.61 3622.47 2018 4412.87 4245.07 4187.86 4165.74 4501.83
Using the keyword parameter axis
with the value 1, we can build the cumulative sum over the rows:
x = shops.cumsum(axis=1)
print(x)
OUTPUT:
Zürich Freiburg München Konstanz Saarbrücken 2014 734.01 1525.85 2338.92 3145.42 4059.26 2015 991.69 1898.19 2838.05 3670.37 4622.11 2016 945.20 1711.92 2668.84 3413.20 4213.91 2017 824.21 1680.47 2426.43 3216.86 4173.04 2018 917.76 1841.51 2573.56 3565.69 4445.05
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Assigning New Columns
We will assign a new column to a DataFrame. We will add the cumulative population sum as a new column with the name "cum_population":
cities = {"name": ["London", "Berlin", "Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235, 2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain", "Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame
country | population | cum_population | |
---|---|---|---|
London | England | 8615246 | NaN |
Berlin | Germany | 3562166 | NaN |
Madrid | Spain | 3165235 | NaN |
Rome | Italy | 2874038 | NaN |
Paris | France | 2273305 | NaN |
Vienna | Austria | 1805681 | NaN |
Bucharest | Romania | 1803425 | NaN |
Hamburg | Germany | 1760433 | NaN |
Budapest | Hungary | 1754000 | NaN |
Warsaw | Poland | 1740119 | NaN |
Barcelona | Spain | 1602386 | NaN |
Munich | Germany | 1493900 | NaN |
Milan | Italy | 1350680 | NaN |
We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.
We will assign now the cumulative sums to this column:
city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame
country | population | cum_population | |
---|---|---|---|
London | England | 8615246 | 8615246 |
Berlin | Germany | 3562166 | 12177412 |
Madrid | Spain | 3165235 | 15342647 |
Rome | Italy | 2874038 | 18216685 |
Paris | France | 2273305 | 20489990 |
Vienna | Austria | 1805681 | 22295671 |
Bucharest | Romania | 1803425 | 24099096 |
Hamburg | Germany | 1760433 | 25859529 |
Budapest | Hungary | 1754000 | 27613529 |
Warsaw | Poland | 1740119 | 29353648 |
Barcelona | Spain | 1602386 | 30956034 |
Munich | Germany | 1493900 | 32449934 |
Milan | Italy | 1350680 | 33800614 |
We can also include a column name which is not contained in the dictionary, when we create the DataFrame from the dictionary. In this case, all the values of this column will be set to NaN:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)
OUTPUT:
country area population London England NaN 8615246 Berlin Germany NaN 3562166 Madrid Spain NaN 3165235 Rome Italy NaN 2874038 Paris France NaN 2273305 Vienna Austria NaN 1805681 Bucharest Romania NaN 1803425 Hamburg Germany NaN 1760433 Budapest Hungary NaN 1754000 Warsaw Poland NaN 1740119 Barcelona Spain NaN 1602386 Munich Germany NaN 1493900 Milan Italy NaN 1350680
Accessing the Columns of a DataFrame
There are two ways to access a column of a DataFrame. The result is in both cases a Series:
# in a dictionary-like way:
print(city_frame["population"])
OUTPUT:
London 8615246 Berlin 3562166 Madrid 3165235 Rome 2874038 Paris 2273305 Vienna 1805681 Bucharest 1803425 Hamburg 1760433 Budapest 1754000 Warsaw 1740119 Barcelona 1602386 Munich 1493900 Milan 1350680 Name: population, dtype: int64
# as an attribute
print(city_frame.population)
OUTPUT:
London 8615246 Berlin 3562166 Madrid 3165235 Rome 2874038 Paris 2273305 Vienna 1805681 Bucharest 1803425 Hamburg 1760433 Budapest 1754000 Warsaw 1740119 Barcelona 1602386 Munich 1493900 Milan 1350680 Name: population, dtype: int64
print(type(city_frame.population))
OUTPUT:
<class 'pandas.core.series.Series'>
city_frame.population
OUTPUT:
London 8615246 Berlin 3562166 Madrid 3165235 Rome 2874038 Paris 2273305 Vienna 1805681 Bucharest 1803425 Hamburg 1760433 Budapest 1754000 Warsaw 1740119 Barcelona 1602386 Munich 1493900 Milan 1350680 Name: population, dtype: int64
From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame.
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Assigning New Values to a Column
The column area is still not defined. We can set all elements of the column to the same value:
city_frame["area"] = 1572
print(city_frame)
OUTPUT:
country area population London England 1572 8615246 Berlin Germany 1572 3562166 Madrid Spain 1572 3165235 Rome Italy 1572 2874038 Paris France 1572 2273305 Vienna Austria 1572 1805681 Bucharest Romania 1572 1803425 Hamburg Germany 1572 1760433 Budapest Hungary 1572 1754000 Warsaw Poland 1572 1740119 Barcelona Spain 1572 1602386 Munich Germany 1572 1493900 Milan Italy 1572 1350680
In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame.
# area in square km:
area = [1572, 891.85, 605.77, 1285,
105.4, 414.6, 228, 755,
525.2, 517, 101.9, 310.4,
181.8]
# area could have been designed as a list, a Series, an array or a scalar
city_frame["area"] = area
print(city_frame)
OUTPUT:
country area population London England 1572.00 8615246 Berlin Germany 891.85 3562166 Madrid Spain 605.77 3165235 Rome Italy 1285.00 2874038 Paris France 105.40 2273305 Vienna Austria 414.60 1805681 Bucharest Romania 228.00 1803425 Hamburg Germany 755.00 1760433 Budapest Hungary 525.20 1754000 Warsaw Poland 517.00 1740119 Barcelona Spain 101.90 1602386 Munich Germany 310.40 1493900 Milan Italy 181.80 1350680
Sorting DataFrames
Let's sort our DataFrame according to the city area:
city_frame = city_frame.sort_values(by="area", ascending=False)
print(city_frame)
OUTPUT:
country area population London England 1572.00 8615246 Rome Italy 1285.00 2874038 Berlin Germany 891.85 3562166 Hamburg Germany 755.00 1760433 Madrid Spain 605.77 3165235 Budapest Hungary 525.20 1754000 Warsaw Poland 517.00 1740119 Vienna Austria 414.60 1805681 Munich Germany 310.40 1493900 Bucharest Romania 228.00 1803425 Milan Italy 181.80 1350680 Paris France 105.40 2273305 Barcelona Spain 101.90 1602386
Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8],
index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)
OUTPUT:
country area population London England 1572.0 8615246 Berlin Germany NaN 3562166 Madrid Spain NaN 3165235 Rome Italy NaN 2874038 Paris France NaN 2273305 Vienna Austria NaN 1805681 Bucharest Romania NaN 1803425 Hamburg Germany 755.0 1760433 Budapest Hungary NaN 1754000 Warsaw Poland NaN 1740119 Barcelona Spain NaN 1602386 Munich Germany NaN 1493900 Milan Italy 181.8 1350680
insert(self, loc, column, value, allow_duplicates=False)`
city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
idx = 1
city_frame.insert(loc=idx, column='area', value=area)
city_frame
country | area | population | |
---|---|---|---|
London | England | 1572.00 | 8615246 |
Berlin | Germany | 891.85 | 3562166 |
Madrid | Spain | 605.77 | 3165235 |
Rome | Italy | 1285.00 | 2874038 |
Paris | France | 105.40 | 2273305 |
Vienna | Austria | 414.60 | 1805681 |
Bucharest | Romania | 228.00 | 1803425 |
Hamburg | Germany | 755.00 | 1760433 |
Budapest | Hungary | 525.20 | 1754000 |
Warsaw | Poland | 517.00 | 1740119 |
Barcelona | Spain | 101.90 | 1602386 |
Munich | Germany | 310.40 | 1493900 |
Milan | Italy | 181.80 | 1350680 |
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Creating a DataFrame by appending Rows
import pandas as pd
from numpy.random import randint
df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
for i in range(5):
df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
df
lib | qty1 | qty2 | |
---|---|---|---|
0 | name0 | 1 | 4 |
1 | name1 | 6 | 5 |
2 | name2 | 4 | 2 |
3 | name3 | 1 | 9 |
4 | name4 | 3 | 1 |
DataFrame from Nested Dictionaries
A nested dictionary of dicts can be passed to a DataFrame as well. The indices of the outer dictionary are taken as the the columns and the inner keys. i.e. the keys of the nested dictionaries, are used as the row indices:
growth = {"Switzerland": {"2010": 3.0, "2011": 1.8, "2012": 1.1, "2013": 1.9},
"Germany": {"2010": 4.1, "2011": 3.6, "2012": 0.4, "2013": 0.1},
"France": {"2010":2.0, "2011":2.1, "2012": 0.3, "2013": 0.3},
"Greece": {"2010":-5.4, "2011":-8.9, "2012":-6.6, "2013": -3.3},
"Italy": {"2010":1.7, "2011": 0.6, "2012":-2.3, "2013":-1.9}
}
growth_frame = pd.DataFrame(growth)
growth_frame
Switzerland | Germany | France | Greece | Italy | |
---|---|---|---|---|---|
2010 | 3.0 | 4.1 | 2.0 | -5.4 | 1.7 |
2011 | 1.8 | 3.6 | 2.1 | -8.9 | 0.6 |
2012 | 1.1 | 0.4 | 0.3 | -6.6 | -2.3 |
2013 | 1.9 | 0.1 | 0.3 | -3.3 | -1.9 |
You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data:
growth_frame.T
2010 | 2011 | 2012 | 2013 | |
---|---|---|---|---|
Switzerland | 3.0 | 1.8 | 1.1 | 1.9 |
Germany | 4.1 | 3.6 | 0.4 | 0.1 |
France | 2.0 | 2.1 | 0.3 | 0.3 |
Greece | -5.4 | -8.9 | -6.6 | -3.3 |
Italy | 1.7 | 0.6 | -2.3 | -1.9 |
growth_frame = growth_frame.T
growth_frame2 = growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)
OUTPUT:
2010 2011 2012 2013 Switzerland 3.0 1.8 1.1 1.9 Italy 1.7 0.6 -2.3 -1.9 Germany 4.1 3.6 0.4 0.1 Greece -5.4 -8.9 -6.6 -3.3
Filling a DataFrame with random values:
import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame((np.random.randn(12, 5)*1000).round(2),
columns=names,
index=index)
df
Frank | Eve | Stella | Guido | Lara | |
---|---|---|---|---|---|
January | -1312.08 | 232.99 | 1686.19 | -1182.34 | -1916.60 |
February | 676.24 | -1226.14 | -1175.01 | -159.25 | -855.56 |
March | 2044.31 | 1143.64 | 281.80 | 273.50 | -493.31 |
April | -568.10 | -407.86 | -1020.66 | 136.64 | -259.10 |
May | -1292.68 | -1025.80 | -8.08 | -281.49 | -173.00 |
June | 966.28 | 723.29 | 88.75 | 57.01 | -745.23 |
July | -2067.34 | -33.33 | -1190.31 | -816.72 | 368.75 |
August | 114.27 | -295.51 | 299.64 | 1186.75 | -873.99 |
September | 1040.65 | 291.92 | -501.46 | -1465.73 | 187.93 |
October | 129.06 | 955.29 | -732.68 | -1507.39 | 2333.80 |
November | 1490.33 | -658.89 | 12.76 | -620.20 | -381.90 |
December | -33.15 | -1147.61 | -1057.52 | -3.90 | -785.99 |
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses