Structured arrays (aka “Record arrays”)

Structured Arrays (aka Record Arrays)

Introduction

Numpy provides powerful capabilities to create arrays of structs or records. These arrays permit one to manipulate the data by the structs or by fields of the struct. A simple example will show what is meant.:

>>> x = np.zeros((2,),dtype=('i4,f4,a10'))
>>> x[:] = [(1,2.,'Hello'),(2,3.,"World")]
>>> x
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
     dtype=[('f0', '>i4'), ('f1', '>f4'), ('f2', '|S10')])

Here we have created a one-dimensional array of length 2. Each element of this array is a record that contains three items, a 32-bit integer, a 32-bit float, and a string of length 10 or less. If we index this array at the second position we get the second record:

>>> x[1]
(2,3.,"World")

The interesting aspect is that we can reference the different fields of the array simply by indexing the array with the string representing the name of the field. In this case the fields have received the default names of ‘f0’, ‘f1’ and ‘f2’.

>>> y = x['f1']
>>> y
array([ 2.,  3.], dtype=float32)
>>> y[:] = 2*y
>>> y
array([ 4.,  6.], dtype=float32)
>>> x
array([(1, 4.0, 'Hello'), (2, 6.0, 'World')],
      dtype=[('f0', '>i4'), ('f1', '>f4'), ('f2', '|S10')])

In these examples, y is a simple float array consisting of the 2nd field in the record. But it is not a copy of the data in the structured array, instead it is a view. It shares exactly the same data. Thus when we updated this array by doubling its values, the structured array shows the corresponding values as doubled as well. Likewise, if one changes the record, the field view changes:

>>> x[1] = (-1,-1.,"Master")
>>> x
array([(1, 4.0, 'Hello'), (-1, -1.0, 'Master')],
      dtype=[('f0', '>i4'), ('f1', '>f4'), ('f2', '|S10')])
>>> y
array([ 4., -1.], dtype=float32)

Defining Structured Arrays

The definition of a structured array is all done through the dtype object. There are a lot of different ways one can define the fields of a record. Some of variants are there to provide backward compatibility with Numeric or numarray, or another module, and should not be used except for such purposes. These will be so noted. One defines records by specifying the structure by 4 general ways, using an argument (as supplied to a dtype function keyword or a dtype object constructor itself) in the form of a: 1) string, 2) tuple, 3) list, or 4) dictionary. Each of these will be briefly described.

1) String argument (as used in the above examples). In this case, the constructor is expecting a comma separated list of type specifiers, optionally with extra shape information. The type specifiers can take 4 different forms:

a) b1, i1, i2, i4, i8, u1, u2, u4, u8, f4, f8, c8, c16, a<n>
   (representing bytes, ints, unsigned ints, floats, complex and
    fixed length strings of specified byte lengths)
b) int8,...,uint8,...,float32, float64, complex64, complex128
   (this time with bit sizes)
c) older Numeric/numarray type specifications (e.g. Float32).
   Don't use these in new code!
d) Single character type specifiers (e.g H for unsigned short ints).
   Avoid using these unless you must. Details can be found in the
   Numpy book

These different styles can be mixed within the same string (but why would you want to do that?). Furthermore, each type specifier can be prefixed with a repetition number, or a shape. In these cases an array element is created, i.e., an array within a record. That array is still referred to as a single field. An example:

>>> x = np.zeros(3, dtype='3int8, float32, (2,3)float64')
>>> x
array([([0, 0, 0], 0.0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
       ([0, 0, 0], 0.0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
       ([0, 0, 0], 0.0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
      dtype=[('f0', '|i1', 3), ('f1', '>f4'), ('f2', '>f8', (2, 3))])

By using strings to define the record structure, it precludes being able to name the fields in the original definition. The names can be changed as shown later, however.

2) Tuple argument: The only relevant tuple case that applies to record structures is when a structure is mapped to an existing data type. This is done by pairing in a tuple, the existing data type with a matching dtype definition (using any of the variants being described here). As an example (using a definition using a list, so see 3) for further details):

>>> x = zeros(3, dtype=('i4',[('r','u1'), ('g','u1'), ('b','u1'), ('a','u1')]))
>>> x
array([0, 0, 0])
>>> x['r']
array([0, 0, 0], dtype=uint8)

In this case, an array is produced that looks and acts like a simple int32 array, but also has definitions for fields that use only one byte of the int32 (a bit like Fortran equivalencing).

3) List argument: In this case the record structure is defined with a list of tuples. Each tuple has 2 or 3 elements specifying: 1) The name of the field (‘’ is permitted), 2) the type of the field, and 3) the shape (optional). For example:

>>> x = np.zeros(3, dtype=[('x','f4'),('y',np.float32),('value','f4',(2,2))])
>>> x
array([(0.0, 0.0, [[0.0, 0.0], [0.0, 0.0]]),
       (0.0, 0.0, [[0.0, 0.0], [0.0, 0.0]]),
       (0.0, 0.0, [[0.0, 0.0], [0.0, 0.0]])],
      dtype=[('x', '>f4'), ('y', '>f4'), ('value', '>f4', (2, 2))])

4) Dictionary argument: two different forms are permitted. The first consists of a dictionary with two required keys (‘names’ and ‘formats’), each having an equal sized list of values. The format list contains any type/shape specifier allowed in other contexts. The names must be strings. There are two optional keys: ‘offsets’ and ‘titles’. Each must be a correspondingly matching list to the required two where offsets contain integer offsets for each field, and titles are objects containing metadata for each field (these do not have to be strings), where the value of None is permitted. As an example:

>>> x = np.zeros(3, dtype={'names':['col1', 'col2'], 'formats':['i4','f4']})
>>> x
array([(0, 0.0), (0, 0.0), (0, 0.0)],
      dtype=[('col1', '>i4'), ('col2', '>f4')])

The other dictionary form permitted is a dictionary of name keys with tuple values specifying type, offset, and an optional title.

>>> x = np.zeros(3, dtype={'col1':('i1',0,'title 1'), 'col2':('f4',1,'title 2')})
array([(0, 0.0), (0, 0.0), (0, 0.0)],
      dtype=[(('title 1', 'col1'), '|i1'), (('title 2', 'col2'), '>f4')])

Accessing and modifying field names

The field names are an attribute of the dtype object defining the record structure. For the last example:

>>> x.dtype.names
('col1', 'col2')
>>> x.dtype.names = ('x', 'y')
>>> x
array([(0, 0.0), (0, 0.0), (0, 0.0)],
     dtype=[(('title 1', 'x'), '|i1'), (('title 2', 'y'), '>f4')])
>>> x.dtype.names = ('x', 'y', 'z') # wrong number of names
<type 'exceptions.ValueError'>: must replace all names at once with a sequence of length 2

Accessing field titles

The field titles provide a standard place to put associated info for fields. They do not have to be strings.

>>> x.dtype.fields['x'][2]
'title 1'