30. Pandas Groupby Example
By Bernd Klein. Last modified: 21 Feb 2024.
In this part of our Python course we would like to give you a detailed example of how to use groupby. We will use a data file donations.txt.
The first few lines of this data file look like this
firstname,surname,city,job,income,donations
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Oswald,Segebahn,Köln,Musician,57700,1142
Heinz-Joachim,Wagner,Stuttgart,Engineer,109300,1592
The data of this is artifially created by the Faker module, so any possible correspondence with real existing persons is purely coincidental and not intended. Each line contains the firstname and the surname of a fictive person plus the persons city, the job position (one of five politician, student, engineer, musician and manager), the early income and the sum of donations per year.
We first read in the data file, which is of course a csv file:
import pandas as pd
fname = '../data/donations.txt'
data = pd.read_csv(fname, usecols=[2, 3, 4, 5])
data[:10]
city | job | income | donations | |
---|---|---|---|---|
0 | Karlsruhe | Politician | 244400 | 2512 |
1 | Freiburg | Student | 16800 | 336 |
2 | Hamburg | Engineer | 116900 | 1479 |
3 | Köln | Musician | 57700 | 1142 |
4 | Stuttgart | Engineer | 109300 | 1592 |
5 | Hamburg | Student | 12500 | 250 |
6 | Freiburg | Engineer | 128700 | 1984 |
7 | Stuttgart | Politician | 161300 | 822 |
8 | Freiburg | Engineer | 129000 | 2159 |
9 | Stuttgart | Musician | 108800 | 1516 |
First, let's look at how much each occupational group earned and donated in total. We can do this be using the Pandas groupby
together with sum
:
data_sum = data[['job', 'income', 'donations']].groupby(['job']).sum()
data_sum
income | donations | |
---|---|---|
job | ||
Engineer | 2067200 | 25564 |
Manager | 12862600 | 87475 |
Musician | 1448700 | 24376 |
Politician | 4118300 | 30758 |
Student | 372900 | 7458 |
By using sort_values
on the donations column, we can sort this DataFrame:
data_sum.sort_values(by='donations')
income | donations | |
---|---|---|
job | ||
Student | 372900 | 7458 |
Musician | 1448700 | 24376 |
Engineer | 2067200 | 25564 |
Politician | 4118300 | 30758 |
Manager | 12862600 | 87475 |
Let's look at the following three people from our datafile:
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Which is the most generous of them? The politician Janett Schwital, because 2512 Euros is nearly 8 times more than what the student Daniele Segebahn has donated and nearly two times as much as the engineer Kirstin Klapp has donated.
Most people would think that we have to see the donations in relation to their incomes. Then we get the following:
The politician spent about 1 % of the income, the engineer about 1.3 % and the student 2 %. So we could say that Daniele is the most generous one in relation to her income!
Let's do the same thing on the previous DataFrame:
data_sum['relative'] = data_sum.donations * 100 / data_sum.income
data_sum.sort_values(by='relative')
income | donations | relative | |
---|---|---|---|
job | |||
Manager | 12862600 | 87475 | 0.680072 |
Politician | 4118300 | 30758 | 0.746862 |
Engineer | 2067200 | 25564 | 1.236649 |
Musician | 1448700 | 24376 | 1.682612 |
Student | 372900 | 7458 | 2.000000 |
So, we see that the managers and the politicians are very stingy and the students are extremely generous. We are not sure, if we can general this by saying: The less people have the more generous they are. You shouldn't forget that the whole data set is fake!
We want to count now the number of people in each job group for each city. Using count and groupby in the following way is helpful but not nice. The numbers in the columns 'income' and 'donations' show the wanted numbers but the column names are misleading and we don't need the results twice:
x = data.groupby(['city', 'job']).count()
x
income | donations | ||
---|---|---|---|
city | job | ||
Berlin | Engineer | 3 | 3 |
Manager | 3 | 3 | |
Musician | 2 | 2 | |
Politician | 1 | 1 | |
Student | 2 | 2 | |
Freiburg | Engineer | 3 | 3 |
Manager | 5 | 5 | |
Musician | 3 | 3 | |
Politician | 3 | 3 | |
Student | 5 | 5 | |
Hamburg | Engineer | 4 | 4 |
Musician | 4 | 4 | |
Politician | 1 | 1 | |
Student | 2 | 2 | |
Karlsruhe | Engineer | 1 | 1 |
Manager | 3 | 3 | |
Politician | 7 | 7 | |
Student | 2 | 2 | |
Konstanz | Engineer | 2 | 2 |
Manager | 5 | 5 | |
Musician | 3 | 3 | |
Politician | 4 | 4 | |
Student | 3 | 3 | |
Köln | Engineer | 1 | 1 |
Manager | 3 | 3 | |
Musician | 2 | 2 | |
Politician | 1 | 1 | |
Student | 3 | 3 | |
Stuttgart | Engineer | 4 | 4 |
Manager | 4 | 4 | |
Musician | 4 | 4 | |
Politician | 3 | 3 | |
Student | 4 | 4 |
We could do the following to get a nice result, but the whole way is clumsy:
people_job_city = data.groupby(['city', 'job']).count()
people_job_city.drop(columns=['income'], inplace=True)
people_job_city.rename(columns={'donations': 'number of people'})
number of people | ||
---|---|---|
city | job | |
Berlin | Engineer | 3 |
Manager | 3 | |
Musician | 2 | |
Politician | 1 | |
Student | 2 | |
Freiburg | Engineer | 3 |
Manager | 5 | |
Musician | 3 | |
Politician | 3 | |
Student | 5 | |
Hamburg | Engineer | 4 |
Musician | 4 | |
Politician | 1 | |
Student | 2 | |
Karlsruhe | Engineer | 1 |
Manager | 3 | |
Politician | 7 | |
Student | 2 | |
Konstanz | Engineer | 2 |
Manager | 5 | |
Musician | 3 | |
Politician | 4 | |
Student | 3 | |
Köln | Engineer | 1 |
Manager | 3 | |
Musician | 2 | |
Politician | 1 | |
Student | 3 | |
Stuttgart | Engineer | 4 |
Manager | 4 | |
Musician | 4 | |
Politician | 3 | |
Student | 4 |
There is a better way to solve this with groupby of Pandas. In the following solution, the groupby method groups the DataFrame by the specified columns (in this case, "city" and "job"), creating a new DataFrame with each group. The size method returns the count of each group, resulting in a new DataFrame with the counts for each combination of city and job. Note that the result is a Series object:
result = data[['city', 'job']].groupby(['city', 'job']).size()
result
OUTPUT:
city job Berlin Engineer 3 Manager 3 Musician 2 Politician 1 Student 2 Freiburg Engineer 3 Manager 5 Musician 3 Politician 3 Student 5 Hamburg Engineer 4 Musician 4 Politician 1 Student 2 Karlsruhe Engineer 1 Manager 3 Politician 7 Student 2 Konstanz Engineer 2 Manager 5 Musician 3 Politician 4 Student 3 Köln Engineer 1 Manager 3 Musician 2 Politician 1 Student 3 Stuttgart Engineer 4 Manager 4 Musician 4 Politician 3 Student 4 dtype: int64
What's more interesting than the previous result is the number of donations and especially donations in relation to the income for each job and city:
city_job_data = data.groupby(['city', 'job']).sum()
city_job_data[:12] # the first 12 lines
income | donations | ||
---|---|---|---|
city | job | ||
Berlin | Engineer | 334300 | 3732 |
Manager | 1916000 | 16290 | |
Musician | 135500 | 2644 | |
Politician | 246700 | 1103 | |
Student | 31200 | 624 | |
Freiburg | Engineer | 358400 | 5431 |
Manager | 2423700 | 17811 | |
Musician | 273800 | 4640 | |
Politician | 489300 | 4352 | |
Student | 87500 | 1750 | |
Hamburg | Engineer | 448400 | 5390 |
Musician | 358800 | 5359 |
Again, we prefer to see the donations in relation to the income:
city_job_data['rel_donations'] = (city_job_data['donations'] * 100 / city_job_data['income']).round(2)
city_job_data.drop(['income', 'donations'], axis=1, inplace=True)
city_job_data[:12]
rel_donations | ||
---|---|---|
city | job | |
Berlin | Engineer | 1.12 |
Manager | 0.85 | |
Musician | 1.95 | |
Politician | 0.45 | |
Student | 2.00 | |
Freiburg | Engineer | 1.52 |
Manager | 0.73 | |
Musician | 1.69 | |
Politician | 0.89 | |
Student | 2.00 | |
Hamburg | Engineer | 1.20 |
Musician | 1.49 |
Maybe, you want to know now, in which city the most generous people live?
data
city | job | income | donations | |
---|---|---|---|---|
0 | Karlsruhe | Politician | 244400 | 2512 |
1 | Freiburg | Student | 16800 | 336 |
2 | Hamburg | Engineer | 116900 | 1479 |
3 | Köln | Musician | 57700 | 1142 |
4 | Stuttgart | Engineer | 109300 | 1592 |
... | ... | ... | ... | ... |
95 | Stuttgart | Manager | 364300 | 1487 |
96 | Stuttgart | Student | 12800 | 256 |
97 | Köln | Engineer | 119300 | 1308 |
98 | Karlsruhe | Politician | 295600 | 3364 |
99 | Freiburg | Engineer | 100700 | 1288 |
100 rows × 4 columns
cities_donations = data[['city', 'income', 'donations']].groupby(['city']).sum()
cities_donations['relative'] = cities_donations['donations'] * 100 / cities_donations['income']
cities_donations.sort_values(by='relative')
income | donations | relative | |
---|---|---|---|
city | |||
Karlsruhe | 3457600 | 22265 | 0.643944 |
Konstanz | 4165500 | 29386 | 0.705462 |
Köln | 2369500 | 21158 | 0.892931 |
Stuttgart | 3522900 | 31891 | 0.905249 |
Berlin | 2663700 | 24393 | 0.915756 |
Freiburg | 3632700 | 33984 | 0.935503 |
Hamburg | 1057800 | 12554 | 1.186803 |
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses