3. Numpy Data Objects, dtype

By Bernd Klein. Last modified: 24 Mar 2022.

On this page ➤

dtype

The data type object 'dtype' is an instance of numpy.dtype class. It can be created with numpy.dtype.

So far, we have used in our examples of numpy arrays only fundamental numeric data types like 'int' and 'float'. These numpy arrays contained solely homogenous data types. dtype objects are construed by combinations of fundamental data types. With the aid of dtype we are capable to create "Structured Arrays", - also known as "Record Arrays". The structured arrays provide us with the ability to have different data types per column. It has similarity to the structure of excel or csv documents. This makes it possibe to define data like the one in the following table with dtype:

Country	Population Density	Area	Population
Netherlands	393	41526	16,928,800
Belgium	337	30510	11,007,020
United Kingdom	256	243610	62,262,000
Germany	233	357021	81,799,600
Liechtenstein	205	160	32,842
Italy	192	301230	59,715,625
Switzerland	177	41290	7,301,994
Luxembourg	173	2586	512,000
France	111	547030	63,601,002
Austria	97	83858	8,169,929
Greece	81	131940	11,606,813
Ireland	65	70280	4,581,269
Sweden	20	449964	9,515,744
Finland	16	338424	5,410,233
Norway	13	385252	5,033,675

Before we start with a complex data structure like the previous data, we want to introduce dtype in a very simple example. We define an int16 data type and call this type i16. (We have to admit, that this is not a nice name, but we use it only here!). The elements of the list 'lst' are turned into i16 types to create the two-dimensional array A.

import numpy as np

i16 = np.dtype(np.int16)
print(i16)

lst = [ [3.4, 8.7, 9.9], 
        [1.1, -7.8, -0.7],
        [4.1, 12.3, 4.8] ]

A = np.array(lst, dtype=i16)

print(A)

OUTPUT:

int16
[[ 3  8  9]
 [ 1 -7  0]
 [ 4 12  4]]

We introduced a new name for a basic data type in the previous example. This has nothing to do with the structured arrays, which we mentioned in the introduction of this chapter of our dtype tutorial.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of all Python training courses

Structured Arrays

ndarrays are homogeneous data objects, i.e. all elements of an array have to be of the same data type. The data type dytpe on the other hand allows as to define separate data types for each column.

Now we will take the first step towards implementing the table with European countries and the information on population, area and population density. We create a structured array with the 'density' column. The data type is defined as np.dtype([('density', np.int)]). We assign this data type to the variable 'dt' for the sake of convenience. We use this data type in the darray definition, in which we use the first three densities.

import numpy as np

dt = np.dtype([('density', np.int32)])

x = np.array([(393,), (337,), (256,)],
             dtype=dt)

print(x)

print("\nThe internal representation:")
print(repr(x))

OUTPUT:

[(393,) (337,) (256,)]

The internal representation:
array([(393,), (337,), (256,)], dtype=[('density', '<i4')])

We can access the content of the density column by indexing x with the key 'density'. It looks like accessing a dictionary in Python:

print(x['density'])

OUTPUT:

[393 337 256]

You may wonder that we have used 'np.int32' in our definition and the internal representation shows '<i4'. We can use in the dtype definition the type directly (e.g. np.int32) or we can use a string (e.g. 'i4'). So, we could have defined our dtype like this as well:

dt = np.dtype([('density', 'i4')])
x = np.array([(393,), (337,), (256,)],
             dtype=dt)
print(x)

OUTPUT:

[(393,) (337,) (256,)]

The 'i' means integer and the 4 means 4 bytes. What about the less-than sign in front of i4 in the result? We could have written '<i4' in our definition as well. We can prefix a type with the '<' and '>' sign. '<' means that the encoding will be little-endian and '>' means that the encoding will be big-endian. No prefix means that we get the native byte ordering. We demonstrate this in the following by defining a double-precision floating-point number in various orderings:

# little-endian ordering
dt = np.dtype('<d')
print(dt.name, dt.byteorder, dt.itemsize)

# big-endian ordering
dt = np.dtype('>d')  
print(dt.name, dt.byteorder, dt.itemsize)

# native byte ordering
dt = np.dtype('d') 
print(dt.name, dt.byteorder, dt.itemsize)

OUTPUT:

float64 = 8
float64 > 8
float64 = 8

The equal character '=' stands for 'native byte ordering', defined by the operating system. In our case this means 'little-endian', because we use a Linux computer.

Another thing in our density array might be confusing. We defined the array with a list containing one-tuples. So you may ask yourself, if it is possible to use tuples and lists interchangeably? This is not possible. The tuples are used to define the records - in our case consisting solely of a density - and the list is the 'container' for the records or in other words 'the lists are cursed upon'. The tuples define the atomic elements of the structure and the lists the dimensions.

Now we will add the country name, the area and the population number to our data type:

dt = np.dtype([('country', 'S20'), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])
population_table = np.array([
    ('Netherlands', 393, 41526, 16928800),
    ('Belgium', 337, 30510, 11007020),
    ('United Kingdom', 256, 243610, 62262000),
    ('Germany', 233, 357021, 81799600),
    ('Liechtenstein', 205, 160, 32842),
    ('Italy', 192, 301230, 59715625),
    ('Switzerland', 177, 41290, 7301994),
    ('Luxembourg', 173, 2586, 512000),
    ('France', 111, 547030, 63601002),
    ('Austria', 97, 83858, 8169929),
    ('Greece', 81, 131940, 11606813),
    ('Ireland', 65, 70280, 4581269),
    ('Sweden', 20, 449964, 9515744),
    ('Finland', 16, 338424, 5410233),
    ('Norway', 13, 385252, 5033675)],
    dtype=dt)
print(population_table[:4])

OUTPUT:

[(b'Netherlands', 393,  41526, 16928800)
 (b'Belgium', 337,  30510, 11007020)
 (b'United Kingdom', 256, 243610, 62262000)
 (b'Germany', 233, 357021, 81799600)]

We can acces every column individually:

print(population_table['density'])
print(population_table['country'])
print(population_table['area'][2:5])

OUTPUT:

[393 337 256 233 205 192 177 173 111  97  81  65  20  16  13]
[b'Netherlands' b'Belgium' b'United Kingdom' b'Germany' b'Liechtenstein'
 b'Italy' b'Switzerland' b'Luxembourg' b'France' b'Austria' b'Greece'
 b'Ireland' b'Sweden' b'Finland' b'Norway']
[243610 357021    160]

Unicode Strings in Array

Some may have noticed that the strings in our previous array have been prefixed with a lower case "b". This means that we have created binary strings with the definition "('country', 'S20')". To get unicode strings we exchange this with the definition "('country', np.compat.unicode, 20)". We will redefine our population table now:

dt = np.dtype([('country', np.compat.unicode, 20), 
               ('density', 'i4'), 
               ('area', 'i4'), 
               ('population', 'i4')])
population_table = np.array([
    ('Netherlands', 393, 41526, 16928800),
    ('Belgium', 337, 30510, 11007020),
    ('United Kingdom', 256, 243610, 62262000),
    ('Germany', 233, 357021, 81799600),
    ('Liechtenstein', 205, 160, 32842),
    ('Italy', 192, 301230, 59715625),
    ('Switzerland', 177, 41290, 7301994),
    ('Luxembourg', 173, 2586, 512000),
    ('France', 111, 547030, 63601002),
    ('Austria', 97, 83858, 8169929),
    ('Greece', 81, 131940, 11606813),
    ('Ireland', 65, 70280, 4581269),
    ('Sweden', 20, 449964, 9515744),
    ('Finland', 16, 338424, 5410233),
    ('Norway', 13, 385252, 5033675)],
    dtype=dt)
print(population_table[:4])

OUTPUT:

[('Netherlands', 393,  41526, 16928800) ('Belgium', 337,  30510, 11007020)
 ('United Kingdom', 256, 243610, 62262000)
 ('Germany', 233, 357021, 81799600)]

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

10 Mar 2025 to 14 Mar 2025
07 Apr 2025 to 11 Apr 2025
23 Jun 2025 to 27 Jun 2025
28 Jul 2025 to 01 Aug 2025

Data Analysis with Python

12 Mar 2025 to 14 Mar 2025
09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025
30 Jul 2025 to 01 Aug 2025

Efficient Data Analysis with Pandas

10 Mar 2025 to 11 Mar 2025
07 Apr 2025 to 08 Apr 2025
02 Jun 2025 to 03 Jun 2025
23 Jun 2025 to 24 Jun 2025
28 Jul 2025 to 29 Jul 2025

Python Text Processing Course

09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025

Overview of all Python training courses

Input and Output of Structured Arrays

In most applications it will be necessary to save the data from a program into a file. We will write our previously created "darray" to a file with the command savetxt. You will find a detailled introduction into this topic in our chapter Reading and Writing Data Files

np.savetxt("population_table.csv",
           population_table,
           fmt="%s;%d;%d;%d",           
           delimiter=";")

It is highly probable that you will need to read in the previously written file at a later date. This can be achieved with the function genfromtxt.

dt = np.dtype([('country', np.compat.unicode, 20), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])

x = np.genfromtxt("population_table.csv",
               dtype=dt,
               delimiter=";")

print(x)

OUTPUT:

[('Netherlands', 393,  41526, 16928800) ('Belgium', 337,  30510, 11007020)
 ('United Kingdom', 256, 243610, 62262000)
 ('Germany', 233, 357021, 81799600)
 ('Liechtenstein', 205,    160,    32842) ('Italy', 192, 301230, 59715625)
 ('Switzerland', 177,  41290,  7301994)
 ('Luxembourg', 173,   2586,   512000) ('France', 111, 547030, 63601002)
 ('Austria',  97,  83858,  8169929) ('Greece',  81, 131940, 11606813)
 ('Ireland',  65,  70280,  4581269) ('Sweden',  20, 449964,  9515744)
 ('Finland',  16, 338424,  5410233) ('Norway',  13, 385252,  5033675)]

There is also a function "loadtxt", but it is more difficult to use, because it returns the strings as binary strings!

To overcome this problem, we can use loadtxt with a converter function for the first column.

dt = np.dtype([('country', np.compat.unicode, 20), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')])

x = np.loadtxt("population_table.csv",
               dtype=dt,
               converters={0: lambda x: x.decode('utf-8')},
               delimiter=";")

print(x)

OUTPUT:

[('Netherlands', 393,  41526, 16928800) ('Belgium', 337,  30510, 11007020)
 ('United Kingdom', 256, 243610, 62262000)
 ('Germany', 233, 357021, 81799600)
 ('Liechtenstein', 205,    160,    32842) ('Italy', 192, 301230, 59715625)
 ('Switzerland', 177,  41290,  7301994)
 ('Luxembourg', 173,   2586,   512000) ('France', 111, 547030, 63601002)
 ('Austria',  97,  83858,  8169929) ('Greece',  81, 131940, 11606813)
 ('Ireland',  65,  70280,  4581269) ('Sweden',  20, 449964,  9515744)
 ('Finland',  16, 338424,  5410233) ('Norway',  13, 385252,  5033675)]

Exercises:

Before you go on, you may take time to do some exercises to deepen the understanding of the previously learned stuff.

Exercise:

Define a structured array with two columns. The first column contains the product ID, which can be defined as an int32. The second column shall contain the price for the product. How can you print out the column with the product IDs, the first row and the price for the third article of this structured array?
Exercise:

Figure out a data type definition for time records with entries for hours, minutes and seconds.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of all Python training courses

Solutions:

Solution to the first exercise:

import numpy as np

mytype = [('productID', np.int32), ('price', np.float64)]

stock = np.array([(34765, 603.76), 
                  (45765, 439.93),
                  (99661, 344.19),
                  (12129, 129.39)], dtype=mytype)

print(stock[1])
print(stock["productID"])
print(stock[2]["price"])
print(stock)

OUTPUT:

(45765, 439.93)
[34765 45765 99661 12129]
344.19
[(34765, 603.76) (45765, 439.93) (99661, 344.19) (12129, 129.39)]

Solution to the second exercise:

A clock

time_type = np.dtype( [('h', int), ('min', int), ('sec', int)])

times = np.array([(11, 38, 5), 
                  (14, 56, 0),
                  (3, 9, 1)], dtype=time_type)
print(times)
print(times[0])
# reset the first time record:
times[0] = (11, 42, 17)
print(times[0])

OUTPUT:

[(11, 38, 5) (14, 56, 0) ( 3,  9, 1)]
(11, 38, 5)
(11, 42, 17)

A more Complex Example:

We will increase the complexity of our previous example by adding temperatures to the records.

time_type = np.dtype( np.dtype([('time', [('h', int), ('min', int), ('sec', int)]),
                                ('temperature', float)] ))

times = np.array( [((11, 42, 17), 20.8), ((13, 19, 3), 23.2) ], dtype=time_type)
print(times)
print(times['time'])
print(times['time']['h'])
print(times['temperature'])

OUTPUT:

[((11, 42, 17), 20.8) ((13, 19,  3), 23.2)]
[(11, 42, 17) (13, 19,  3)]
[11 13]
[20.8 23.2]

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Overview of all Python training courses

Exercise

This exercise should be closer to real life examples. Usually, we have to create or get the data for our structured array from some data base or file. We will use the list, which we have created in our chapter on file I/O File Management. The list has been saved with the aid of pickle.dump in the file cities_and_times.pkl.

So the first task consists in unpickling our data:

import pickle
fh = open("../data/cities_and_times.pkl", "br")
cities_and_times = pickle.load(fh)
print(cities_and_times[:30])

OUTPUT:

[('Amsterdam', 'Sun', (8, 52)), ('Anchorage', 'Sat', (23, 52)), ('Ankara', 'Sun', (10, 52)), ('Athens', 'Sun', (9, 52)), ('Atlanta', 'Sun', (2, 52)), ('Auckland', 'Sun', (20, 52)), ('Barcelona', 'Sun', (8, 52)), ('Beirut', 'Sun', (9, 52)), ('Berlin', 'Sun', (8, 52)), ('Boston', 'Sun', (2, 52)), ('Brasilia', 'Sun', (5, 52)), ('Brussels', 'Sun', (8, 52)), ('Bucharest', 'Sun', (9, 52)), ('Budapest', 'Sun', (8, 52)), ('Cairo', 'Sun', (9, 52)), ('Calgary', 'Sun', (1, 52)), ('Cape Town', 'Sun', (9, 52)), ('Casablanca', 'Sun', (7, 52)), ('Chicago', 'Sun', (1, 52)), ('Columbus', 'Sun', (2, 52)), ('Copenhagen', 'Sun', (8, 52)), ('Dallas', 'Sun', (1, 52)), ('Denver', 'Sun', (1, 52)), ('Detroit', 'Sun', (2, 52)), ('Dubai', 'Sun', (11, 52)), ('Dublin', 'Sun', (7, 52)), ('Edmonton', 'Sun', (1, 52)), ('Frankfurt', 'Sun', (8, 52)), ('Halifax', 'Sun', (3, 52)), ('Helsinki', 'Sun', (9, 52))]

Turning our data into a structured array:

time_type = np.dtype([('city', 'U30'), ('day', 'U3'), ('time', [('h', int), ('min', int)])])

times = np.array( cities_and_times , dtype=time_type)
print(times['time'])
print(times['city'])
x = times[27]
x[0]

OUTPUT:

[( 8, 52) (23, 52) (10, 52) ( 9, 52) ( 2, 52) (20, 52) ( 8, 52) ( 9, 52)
 ( 8, 52) ( 2, 52) ( 5, 52) ( 8, 52) ( 9, 52) ( 8, 52) ( 9, 52) ( 1, 52)
 ( 9, 52) ( 7, 52) ( 1, 52) ( 2, 52) ( 8, 52) ( 1, 52) ( 1, 52) ( 2, 52)
 (11, 52) ( 7, 52) ( 1, 52) ( 8, 52) ( 3, 52) ( 9, 52) ( 1, 52) ( 2, 52)
 (10, 52) ( 9, 52) ( 9, 52) (13, 37) (10, 52) ( 0, 52) ( 7, 52) ( 7, 52)
 ( 0, 52) ( 8, 52) (18, 52) ( 2, 52) ( 1, 52) ( 2, 52) (10, 52) ( 1, 52)
 ( 2, 52) ( 8, 52) ( 2, 52) ( 8, 52) ( 2, 52) ( 0, 52) ( 8, 52) ( 7, 52)
 (10, 52) ( 8, 52) ( 1, 52) ( 0, 52) ( 1, 52) ( 4, 52) ( 0, 52) (15, 52)
 (15, 52) ( 8, 52) (18, 52) ( 5, 52) (16, 52) ( 2, 52) ( 0, 52) ( 8, 52)
 ( 8, 52) ( 2, 52) ( 1, 52) ( 8, 52)]
['Amsterdam' 'Anchorage' 'Ankara' 'Athens' 'Atlanta' 'Auckland'
 'Barcelona' 'Beirut' 'Berlin' 'Boston' 'Brasilia' 'Brussels' 'Bucharest'
 'Budapest' 'Cairo' 'Calgary' 'Cape Town' 'Casablanca' 'Chicago'
 'Columbus' 'Copenhagen' 'Dallas' 'Denver' 'Detroit' 'Dubai' 'Dublin'
 'Edmonton' 'Frankfurt' 'Halifax' 'Helsinki' 'Houston' 'Indianapolis'
 'Istanbul' 'Jerusalem' 'Johannesburg' 'Kathmandu' 'Kuwait City'
 'Las Vegas' 'Lisbon' 'London' 'Los Angeles' 'Madrid' 'Melbourne' 'Miami'
 'Minneapolis' 'Montreal' 'Moscow' 'New Orleans' 'New York' 'Oslo'
 'Ottawa' 'Paris' 'Philadelphia' 'Phoenix' 'Prague' 'Reykjavik' 'Riyadh'
 'Rome' 'Salt Lake City' 'San Francisco' 'San Salvador' 'Santiago'
 'Seattle' 'Shanghai' 'Singapore' 'Stockholm' 'Sydney' 'São Paulo' 'Tokyo'
 'Toronto' 'Vancouver' 'Vienna' 'Warsaw' 'Washington DC' 'Winnipeg'
 'Zurich']
'Frankfurt'

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

10 Mar 2025 to 14 Mar 2025
07 Apr 2025 to 11 Apr 2025
23 Jun 2025 to 27 Jun 2025
28 Jul 2025 to 01 Aug 2025

Data Analysis with Python

12 Mar 2025 to 14 Mar 2025
09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025
30 Jul 2025 to 01 Aug 2025

Efficient Data Analysis with Pandas

10 Mar 2025 to 11 Mar 2025
07 Apr 2025 to 08 Apr 2025
02 Jun 2025 to 03 Jun 2025
23 Jun 2025 to 24 Jun 2025
28 Jul 2025 to 29 Jul 2025

Python Text Processing Course

09 Apr 2025 to 11 Apr 2025
04 Jun 2025 to 06 Jun 2025

Overview of all Python training courses