.. sectionauthor:: Pierre Gerard-Marchant ********************************************* Importing data with :func:`~numpy.genfromtxt` ********************************************* Numpy provides several functions to create arrays from tabular data. We focus here on the :func:`~numpy.genfromtxt` function. In a nutshell, :func:`~numpy.genfromtxt` runs two main loops. The first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. This mechanism is slower than a single loop, but gives more flexibility. In particular, :func:`~numpy.genfromtxt` is able to take missing data into account, when other faster and simpler functions like :func:`~numpy.loadtxt` cannot .. note:: When giving examples, we will use the following conventions >>> import numpy as np >>> from StringIO import StringIO Defining the input ================== The only mandatory argument of :func:`~numpy.genfromtxt` is the source of the data. It can be a string corresponding to the name of a local or remote file, or a file-like object with a :meth:`read` method (such as an actual file or a :class:`StringIO.StringIO` object). If the argument is the URL of a remote file, this latter is automatically downloaded in the current directory. The input file can be a text file or an archive. Currently, the function recognizes :class:`gzip` and :class:`bz2` (`bzip2`) archives. The type of the archive is determined by examining the extension of the file: if the filename ends with ``'.gz'``, a :class:`gzip` archive is expected; if it ends with ``'bz2'``, a :class:`bzip2` archive is assumed. Splitting the lines into columns ================================ The :keyword:`delimiter` argument --------------------------------- Once the file is defined and open for reading, :func:`~numpy.genfromtxt` splits each non-empty line into a sequence of strings. Empty or commented lines are just skipped. The :keyword:`delimiter` keyword is used to define how the splitting should take place. Quite often, a single character marks the separation between columns. For example, comma-separated files (CSV) use a comma (``,``) or a semicolon (``;``) as delimiter. >>> data = "1, 2, 3\n4, 5, 6" >>> np.genfromtxt(StringIO(data), delimiter=",") array([[ 1., 2., 3.], [ 4., 5., 6.]]) Another common separator is ``"\t"``, the tabulation character. However, we are not limited to a single character, any string will do. By default, :func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space. Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. In that case, we need to set :keyword:`delimiter` to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes). >>> data = " 1 2 3\n 4 5 67\n890123 4" >>> np.genfromtxt(StringIO(data), delimiter=3) array([[ 1., 2., 3.], [ 4., 5., 67.], [ 890., 123., 4.]]) >>> data = "123456789\n 4 7 9\n 4567 9" >>> np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)) array([[ 1234., 567., 89.], [ 4., 7., 9.], [ 4., 567., 9.]]) The :keyword:`autostrip` argument --------------------------------- By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces. This behavior can be overwritten by setting the optional argument :keyword:`autostrip` to a value of ``True``. >>> data = "1, abc , 2\n 3, xxx, 4" >>> # Without autostrip >>> np.genfromtxt(StringIO(data), dtype="|S5") array([['1', ' abc ', ' 2'], ['3', ' xxx', ' 4']], dtype='|S5') >>> # With autostrip >>> np.genfromtxt(StringIO(data), dtype="|S5", autostrip=True) array([['1', 'abc', '2'], ['3', 'xxx', '4']], dtype='|S5') The :keyword:`comments` argument -------------------------------- The optional argument :keyword:`comments` is used to define a character string that marks the beginning of a comment. By default, :func:`~numpy.genfromtxt` assumes ``comments='#'``. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored. >>> data = """# ... # Skip me ! ... # Skip me too ! ... 1, 2 ... 3, 4 ... 5, 6 #This is the third line of the data ... 7, 8 ... # And here comes the last line ... 9, 0 ... """ >>> np.genfromtxt(StringIO(data), comments="#", delimiter=",") [[ 1. 2.] [ 3. 4.] [ 5. 6.] [ 7. 8.] [ 9. 0.]] .. note:: There is one notable exception to this behavior: if the optional argument ``names=True``, the first commented line will be examined for names. Skipping lines and choosing columns =================================== The :keyword:`skip_header` and :keyword:`skip_footer` arguments --------------------------------------------------------------- The presence of a header in the file can hinder data processing. In that case, we need to use the :keyword:`skip_header` optional argument. The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. Similarly, we can skip the last ``n`` lines of the file by using the :keyword:`skip_footer` attribute and giving it a value of ``n``. >>> data = "\n".join(str(i) for i in range(10)) >>> np.genfromtxt(StringIO(data),) array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) >>> np.genfromtxt(StringIO(data), ... skip_header=3, skip_footer=5) array([ 3., 4.]) By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines are skipped. The :keyword:`usecols` argument ------------------------------- In some cases, we are not interested in all the columns of the data but only a few of them. We can select which columns to import with the :keyword:`usecols` argument. This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import. Remember that by convention, the first column has an index of 0. Negative integers correspond to For example, if we want to import only the first and the last columns, we can use ``usecols=(0, -1)``: >>> data = "1 2 3\n4 5 6" >>> np.genfromtxt(StringIO(data), usecols=(0, -1)) array([[ 1., 3.], [ 4., 6.]]) If the columns have names, we can also select which columns to import by giving their name to the :keyword:`usecols` argument, either as a sequence of strings or a comma-separated string. >>> data = "1 2 3\n4 5 6" >>> np.genfromtxt(StringIO(data), ... names="a, b, c", usecols=("a", "c")) array([(1.0, 3.0), (4.0, 6.0)], dtype=[('a', '>> np.genfromtxt(StringIO(data), ... names="a, b, c", usecols=("a, c")) array([(1.0, 3.0), (4.0, 6.0)], dtype=[('a', '>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"]) array([(1, 2, 3), (4, 5, 6)], dtype=[('a', '>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, names="A, B, C") array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], dtype=[('A', '>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6") >>> np.genfromtxt(data, skip_header=1, names=True) array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], dtype=[('a', '>> data = StringIO("1 2 3\n 4 5 6") >>> ndtype=[('a',int), ('b', float), ('c', int)] >>> names = ["A", "B", "C"] >>> np.genfromtxt(data, names=names, dtype=ndtype) array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('A', '>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=(int, float, int)) array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('f0', '>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=(int, float, int), names="a") array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('a', '>> data = StringIO("1 2 3\n 4 5 6") >>> np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i") array([(1, 2.0, 3), (4, 5.0, 6)], dtype=[('var_00', ',<``. :keyword:`excludelist` Gives a list of the names to exclude, such as ``return``, ``file``, ``print``... If one of the input name is part of this list, an underscore character (``'_'``) will be appended to it. :keyword:`case_sensitive` Whether the names should be case-sensitive (``case_sensitive=True``), converted to upper case (``case_sensitive=False`` or ``case_sensitive='upper'``) or to lower case (``case_sensitive='lower'``). Tweaking the conversion ======================= The :keyword:`converters` argument ---------------------------------- Usually, defining a dtype is sufficient to define how the sequence of strings must be converted. However, some additional control may sometimes be required. For example, we may want to make sure that a date in a format ``YYYY/MM/DD`` is converted to a :class:`datetime` object, or that a string like ``xx%`` is properly converted to a float between 0 and 1. In such cases, we should define conversion functions with the :keyword:`converters` arguments. The value of this argument is typically a dictionary with column indices or column names as keys and a conversion function as values. These conversion functions can either be actual functions or lambda functions. In any case, they should accept only a string as input and output only a single element of the wanted type. In the following example, the second column is converted from as string representing a percentage to a float between 0 and 1 >>> convertfunc = lambda x: float(x.strip("%"))/100. >>> data = "1, 2.3%, 45.\n6, 78.9%, 0" >>> names = ("i", "p", "n") >>> # General case ..... >>> np.genfromtxt(StringIO(data), delimiter=",", names=names) array([(1.0, nan, 45.0), (6.0, nan, 0.0)], dtype=[('i', '>> # Converted case ... >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, ... converters={1: convertfunc}) array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], dtype=[('i', '>> # Using a name for the converter ... >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, ... converters={"p": convertfunc}) array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], dtype=[('i', '>> data = "1, , 3\n 4, 5, 6" >>> convert = lambda x: float(x.strip() or -999) >>> np.genfromtxt(StringIO(data), delimiter=",", ... converter={1: convert}) array([[ 1., -999., 3.], [ 4., 5., 6.]]) Using missing and filling values -------------------------------- Some entries may be missing in the dataset we are trying to import. In a previous example, we used a converter to transform an empty string into a float. However, user-defined converters may rapidly become cumbersome to manage. The :func:`~nummpy.genfromtxt` function provides two other complementary mechanisms: the :keyword:`missing_values` argument is used to recognize missing data and a second argument, :keyword:`filling_values`, is used to process these missing data. :keyword:`missing_values` ------------------------- By default, any empty string is marked as missing. We can also consider more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing or invalid data. The :keyword:`missing_values` argument accepts three kind of values: a string or a comma-separated string This string will be used as the marker for missing data for all the columns a sequence of strings In that case, each item is associated to a column, in order. a dictionary Values of the dictionary are strings or sequence of strings. The corresponding keys can be column indices (integers) or column names (strings). In addition, the special key ``None`` can be used to define a default applicable to all columns. :keyword:`filling_values` ------------------------- We know how to recognize missing data, but we still need to provide a value for these missing entries. By default, this value is determined from the expected dtype according to this table: ============= ============== Expected type Default ============= ============== ``bool`` ``False`` ``int`` ``-1`` ``float`` ``np.nan`` ``complex`` ``np.nan+0j`` ``string`` ``'???'`` ============= ============== We can get a finer control on the conversion of missing values with the :keyword:`filling_values` optional argument. Like :keyword:`missing_values`, this argument accepts different kind of values: a single value This will be the default for all columns a sequence of values Each entry will be the default for the corresponding column a dictionary Each key can be a column index or a column name, and the corresponding value should be a single object. We can use the special key ``None`` to define a default for all columns. In the following example, we suppose that the missing values are flagged with ``"N/A"`` in the first column and by ``"???"`` in the third column. We wish to transform these missing values to 0 if they occur in the first and second column, and to -999 if they occur in the last column. >>> data = "N/A, 2, 3\n4, ,???" >>> kwargs = dict(delimiter=",", ... dtype=int, ... names="a,b,c", ... missing_values={0:"N/A", 'b':" ", 2:"???"}, ... filling_values={0:0, 'b':0, 2:-999}) >>> np.genfromtxt(StringIO.StringIO(data), **kwargs) array([(0, 2, 3), (4, 0, -999)], dtype=[('a', '