31. Pandas Groupby Example

By Bernd Klein. Last modified: 03 Feb 2025.

On this page ➤

Person on a heap of coins

In this part of our Python course we would like to give you a detailed example of how to use groupby. We will use a data file donations.txt.

The first few lines of this data file look like this

firstname,surname,city,job,income,donations
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Oswald,Segebahn,Köln,Musician,57700,1142
Heinz-Joachim,Wagner,Stuttgart,Engineer,109300,1592

The data of this is artifially created by the Faker module, so any possible correspondence with real existing persons is purely coincidental and not intended. Each line contains the firstname and the surname of a fictive person plus the persons city, the job position (one of five politician, student, engineer, musician and manager), the early income and the sum of donations per year.

We first read in the data file, which is of course a csv file:

import pandas as pd

fname = '../data/donations.txt'
data = pd.read_csv(fname, usecols=[2, 3, 4, 5])
data[:10]

	city	job	income	donations
0	Karlsruhe	Politician	244400	2512
1	Freiburg	Student	16800	336
2	Hamburg	Engineer	116900	1479
3	Köln	Musician	57700	1142
4	Stuttgart	Engineer	109300	1592
5	Hamburg	Student	12500	250
6	Freiburg	Engineer	128700	1984
7	Stuttgart	Politician	161300	822
8	Freiburg	Engineer	129000	2159
9	Stuttgart	Musician	108800	1516

First, let's look at how much each occupational group earned and donated in total. We can do this be using the Pandas groupby together with sum:

data_sum = data[['job', 'income', 'donations']].groupby(['job']).sum()
data_sum

	income	donations
job
Engineer	2067200	25564
Manager	12862600	87475
Musician	1448700	24376
Politician	4118300	30758
Student	372900	7458

By using sort_values on the donations column, we can sort this DataFrame:

data_sum.sort_values(by='donations')

	income	donations
job
Student	372900	7458
Musician	1448700	24376
Engineer	2067200	25564
Politician	4118300	30758
Manager	12862600	87475

Let's look at the following three people from our datafile:

Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479

Which is the most generous of them? The politician Janett Schwital, because 2512 Euros is nearly 8 times more than what the student Daniele Segebahn has donated and nearly two times as much as the engineer Kirstin Klapp has donated.

Most people would think that we have to see the donations in relation to their incomes. Then we get the following:

The politician spent about 1 % of the income, the engineer about 1.3 % and the student 2 %. So we could say that Daniele is the most generous one in relation to her income!

Let's do the same thing on the previous DataFrame:

data_sum['relative'] = data_sum.donations * 100 / data_sum.income
data_sum.sort_values(by='relative')

	income	donations	relative
job
Manager	12862600	87475	0.680072
Politician	4118300	30758	0.746862
Engineer	2067200	25564	1.236649
Musician	1448700	24376	1.682612
Student	372900	7458	2.000000

So, we see that the managers and the politicians are very stingy and the students are extremely generous. We are not sure, if we can general this by saying: The less people have the more generous they are. You shouldn't forget that the whole data set is fake!

We want to count now the number of people in each job group for each city. Using count and groupby in the following way is helpful but not nice. The numbers in the columns 'income' and 'donations' show the wanted numbers but the column names are misleading and we don't need the results twice:

x = data.groupby(['city', 'job']).count()
x

		income	donations
city	job
Berlin	Engineer	3	3
	Manager	3	3
	Musician	2	2
	Politician	1	1
	Student	2	2
Freiburg	Engineer	3	3
	Manager	5	5
	Musician	3	3
	Politician	3	3
	Student	5	5
Hamburg	Engineer	4	4
	Musician	4	4
	Politician	1	1
	Student	2	2
Karlsruhe	Engineer	1	1
	Manager	3	3
	Politician	7	7
	Student	2	2
Konstanz	Engineer	2	2
	Manager	5	5
	Musician	3	3
	Politician	4	4
	Student	3	3
Köln	Engineer	1	1
	Manager	3	3
	Musician	2	2
	Politician	1	1
	Student	3	3
Stuttgart	Engineer	4	4
	Manager	4	4
	Musician	4	4
	Politician	3	3
	Student	4	4

We could do the following to get a nice result, but the whole way is clumsy:

people_job_city = data.groupby(['city', 'job']).count()
people_job_city.drop(columns=['income'], inplace=True)
people_job_city.rename(columns={'donations': 'number of people'})

		number of people
city	job
Berlin	Engineer	3
	Manager	3
	Musician	2
	Politician	1
	Student	2
Freiburg	Engineer	3
	Manager	5
	Musician	3
	Politician	3
	Student	5
Hamburg	Engineer	4
	Musician	4
	Politician	1
	Student	2
Karlsruhe	Engineer	1
	Manager	3
	Politician	7
	Student	2
Konstanz	Engineer	2
	Manager	5
	Musician	3
	Politician	4
	Student	3
Köln	Engineer	1
	Manager	3
	Musician	2
	Politician	1
	Student	3
Stuttgart	Engineer	4
	Manager	4
	Musician	4
	Politician	3
	Student	4

There is a better way to solve this with groupby of Pandas. In the following solution, the groupby method groups the DataFrame by the specified columns (in this case, "city" and "job"), creating a new DataFrame with each group. The size method returns the count of each group, resulting in a new DataFrame with the counts for each combination of city and job. Note that the result is a Series object:

result = data[['city', 'job']].groupby(['city', 'job']).size()
result

OUTPUT:

city       job       
Berlin     Engineer      3
           Manager       3
           Musician      2
           Politician    1
           Student       2
Freiburg   Engineer      3
           Manager       5
           Musician      3
           Politician    3
           Student       5
Hamburg    Engineer      4
           Musician      4
           Politician    1
           Student       2
Karlsruhe  Engineer      1
           Manager       3
           Politician    7
           Student       2
Konstanz   Engineer      2
           Manager       5
           Musician      3
           Politician    4
           Student       3
Köln       Engineer      1
           Manager       3
           Musician      2
           Politician    1
           Student       3
Stuttgart  Engineer      4
           Manager       4
           Musician      4
           Politician    3
           Student       4
dtype: int64

What's more interesting than the previous result is the number of donations and especially donations in relation to the income for each job and city:

city_job_data = data.groupby(['city', 'job']).sum()
city_job_data[:12]  # the first 12 lines

		income	donations
city	job
Berlin	Engineer	334300	3732
	Manager	1916000	16290
	Musician	135500	2644
	Politician	246700	1103
	Student	31200	624
Freiburg	Engineer	358400	5431
	Manager	2423700	17811
	Musician	273800	4640
	Politician	489300	4352
	Student	87500	1750
Hamburg	Engineer	448400	5390
Hamburg	Musician	358800	5359

Again, we prefer to see the donations in relation to the income:

city_job_data['rel_donations'] = (city_job_data['donations'] * 100 / city_job_data['income']).round(2)
city_job_data.drop(['income', 'donations'], axis=1, inplace=True)
city_job_data[:12]

		rel_donations
city	job
Berlin	Engineer	1.12
	Manager	0.85
	Musician	1.95
	Politician	0.45
	Student	2.00
Freiburg	Engineer	1.52
	Manager	0.73
	Musician	1.69
	Politician	0.89
	Student	2.00
Hamburg	Engineer	1.20
Hamburg	Musician	1.49

Maybe, you want to know now, in which city the most generous people live?

data

	city	job	income	donations
0	Karlsruhe	Politician	244400	2512
1	Freiburg	Student	16800	336
2	Hamburg	Engineer	116900	1479
3	Köln	Musician	57700	1142
4	Stuttgart	Engineer	109300	1592
...	...	...	...	...
95	Stuttgart	Manager	364300	1487
96	Stuttgart	Student	12800	256
97	Köln	Engineer	119300	1308
98	Karlsruhe	Politician	295600	3364
99	Freiburg	Engineer	100700	1288

100 rows × 4 columns

cities_donations = data[['city', 'income', 'donations']].groupby(['city']).sum()
cities_donations['relative'] = cities_donations['donations'] * 100 / cities_donations['income']
cities_donations.sort_values(by='relative')

	income	donations	relative
city
Karlsruhe	3457600	22265	0.643944
Konstanz	4165500	29386	0.705462
Köln	2369500	21158	0.892931
Stuttgart	3522900	31891	0.905249
Berlin	2663700	24393	0.915756
Freiburg	3632700	33984	0.935503
Hamburg	1057800	12554	1.186803

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

10 Mar 2025 to 14 Mar 2025
07 Apr 2025 to 11 Apr 2025
23 Jun 2025 to 27 Jun 2025
28 Jul 2025 to 01 Aug 2025

Data Analysis with Python

12 Mar 2025 to 14 Mar 2025
09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025
30 Jul 2025 to 01 Aug 2025

Efficient Data Analysis with Pandas

10 Mar 2025 to 11 Mar 2025
07 Apr 2025 to 08 Apr 2025
02 Jun 2025 to 03 Jun 2025
23 Jun 2025 to 24 Jun 2025
28 Jul 2025 to 29 Jul 2025

Python Text Processing Course

09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025

Overview of Python training courses

Machine Learning with Python training courses