agate.table

This module contains the Table object, which is the central data structure in agate. Tables are created by supplying row data, column names and subclasses of DataType to the constructor. Once instantiated, tables are immutable. This concept is central to agate. The data of the table may not be accessed or modified directly.

Various methods on the Table simulate “SQL-like” operations. For example, the Table.select() method reduces the table to only the specified columns. The Table.where() method reduces the table to only those rows that pass a truth test. And the Table.order_by() method sorts the rows in the table. In all of these cases the output is new Table and the existing table remains unmodified.

Tables are not themselves iterable, but the columns of the table can be accessed via Table.columns and the rows via Table.rows. Both sequences can be accessed either by numeric index or by name. (In the case of rows, row names are optional.)

agate.table.allow_tableset_proxy(func)

Decorator to flag that a given Table method can be proxied as a TableSet method.

class agate.table.Table(rows, column_names=None, column_types=None, row_names=None, _is_fork=False)

A dataset consisting of rows and columns. Columns refer to “vertical” slices of data that must all be of the same type. Rows refer to “horizontal” slices of data that may (and usually do) contain mixed types.

The sequence of Column instances are retrieved via the Table.columns property. They may be accessed by either numeric index or by unique column name.

The sequence of Row instances are retrieved via the Table.rows property. They maybe be accessed by either numeric index or, if specified, unique row names.

Parameters:
  • rows – The data as a sequence of any sequences: tuples, lists, etc. If any row has fewer values than the number of columns, it will be filled out with nulls. No row may have more values than the number of columns.
  • column_names – A sequence of string names for each column or None, in which case column names will be automatically assigned using letter_name().
  • column_types – A sequence of instances of DataType or an instance of TypeTester or None in which case a generic TypeTester will be used.
  • row_names – Specifies unique names for each row. This parameter is optional. If specified it may be 1) the name of a single column that contains a unique identifier for each row, 2) a key function that takes a Row and returns a unique identifier or 3) a sequence of unique identifiers of the same length as the sequence of rows. The uniqueness of resulting identifiers is not validated, so be certain the values you provide are truly unique.
  • _is_fork – Used internally to skip certain validation steps when data is propagated from an existing table. When True, rows are assumed to be Row instances, rather than raw data.
column_types

Get an ordered sequence of this table’s column types.

Returns:A tuple of DataType instances.
column_names

Get an ordered sequence of this table’s column names.

Returns:A tuple of strings.
row_names

Get an ordered sequence of this table’s row names.

Returns:A tuple of strings if this table has row names. Otherwise, None.
columns

Get this table’s columns.

Returns:MappedSequence
rows

Get this table’s rows.

Returns:MappedSequence
rename(column_names=None, row_names=None)

Creates a copy of this table with different column or row names.

Parameters:
  • column_names – New column names for the renamed table. May be either an array or a dictionary mapping existing column names to new names. If not specified, will use this table’s existing column names.
  • row_names – New row names for the renamed table. May be either an array or a dictionary mapping existing row names to new names. If not specified, will use this table’s existing row names.
classmethod from_csv(path, column_names=None, column_types=None, row_names=None, header=True, sniff_limit=0, **kwargs)

Create a new table for a CSV. This method uses agate’s builtin CSV reader, which supports unicode on both Python 2 and Python 3.

kwargs will be passed through to the CSV reader.

Parameters:
  • path – Filepath or file-like object from which to read CSV data.
  • column_names – See Table.__init__().
  • column_types – See Table.__init__().
  • row_names – See Table.__init__().
  • header – If True, the first row of the CSV is assumed to contains headers and will be skipped. If header and column_names are both specified then a row will be skipped, but column_names will be used.
  • sniff_limit – Limit CSV dialect sniffing to the specified number of bytes. Set to None to sniff the entire file. Defaults to 0 or no sniffing.
to_csv(path, **kwargs)

Write this table to a CSV. This method uses agate’s builtin CSV writer, which supports unicode on both Python 2 and Python 3.

kwargs will be passed through to the CSV writer.

Parameters:path – Filepath or file-like object to write to.
classmethod from_json(path, row_names=None, key=None, newline=False, column_types=None, **kwargs)

Create a new table from a JSON file. Once the JSON is deseralized, the resulting Python object is passed to Table.from_object(). See the documentation of that method for additional details.

If the file contains a top-level dictionary you may specify what property contains the row list using the key parameter.

kwargs will be passed through to json.load().

Parameters:
  • path – Filepath or file-like object from which to read JSON data.
  • row_names – See Table.__init__().
  • key – The key of the top-level dictionary that contains a list of row arrays.
  • newline – If True then the file will be parsed as “newline-delimited JSON”.
  • column_types – See Table.__init__().
classmethod from_object(obj, row_names=None, column_types=None)

Create a new table from a Python object with a structure that mirrors a deserialized JSON object. Its contents should be an array containing a dictionary for each “row”. Nested objects or lists will also be parsed. For example, this object:

{
    'one': {
        'a': 1,
        'b': 2,
        'c': 3
    },
    'two': [4, 5, 6],
    'three': 'd'
}

Would generate these columns and values:

{
    'one/a': 1,
    'one/b': 2,
    'one/c': 3,
    'two.0': 4,
    'two.1': 5,
    'two.2': 6,
    'three': 'd'
}

Column names and types will be inferred from the data. Not all rows are required to have the same keys. Missing elements will be filled in with null.

Parameters:
  • obj – Filepath or file-like object from which to read JSON data.
  • row_names – See Table.__init__().
  • column_types – See Table.__init__().
to_json(path, key=None, newline=False, indent=None, **kwargs)

Write this table to a JSON file or file-like object.

kwargs will be passed through to the JSON encoder.

Parameters:
  • path – File path or file-like object to write to.
  • key – If specified, JSON will be output as an hash instead of a list. May be either the name of a column from the this table containing unique values or a function that takes a row and returns a unique value.
  • newline – If True, output will be in the form of “newline-delimited JSON”.
  • indent – If specified, the number of spaces to indent the JSON for formatting.
select(key)

Create a new table with the same rows as this one, but only those columns in the key.

Parameters:key – Either the name of a column to include or a sequence of such names.
Returns:A new Table.
exclude(key)

Create a new table with the same rows as this one, but only columns not in the key.

Parameters:key – Either the name of a column to exclude or a sequence of such names.
Returns:A new Table.
where(test)

Create a new table with the rows from this table that pass a truth test.

Parameters:test (function) – A function that takes a Row and returns True if it should be included.
Returns:A new Table.
find(test)

Find the first row that passes a truth test.

Parameters:test (function) – A function that takes a Row and returns True if it matches.
Returns:A single Row if found, or None.
order_by(key, reverse=False)

Sort this table by the key. This can be either a column_name or callable that returns a value to sort by.

Parameters:
  • key – Either the name of a column to sort by, a sequence of such names, or a function that takes a row and returns a value to sort by.
  • reverse – If True then sort in reverse (typically, descending) order.
Returns:

A new Table.

limit(start_or_stop=None, stop=None, step=None)

Filter data to a subset of all rows.

See also: Python’s slice().

Parameters:
  • start_or_stop – If the only argument, then how many rows to include, otherwise, the index of the first row to include.
  • stop – The index of the last row to include.
  • step – The size of the jump between rows to include. (step=2 will return every other row.)
Returns:

A new Table.

distinct(key=None)

Filter data to only rows that are unique.

Parameters:key – Either the name of a column to use to identify unique rows, a sequence of such column names, a function that takes a row and returns a value to identify unique rows, or None, in which case the entire row will be checked for uniqueness.
Returns:A new Table.
join(right_table, left_key, right_key=None, inner=False, require_match=False, columns=None)

Performs the equivalent of SQL’s “left outer join”, combining columns from this table and from right_table anywhere that the left_key and right_key are equivalent.

Where there is no match for left_key the left columns will be included with the right columns set to None unless the inner argument is specified. (See arguments for more.)

If left_key and right_key are column names, only the left columns will be included in the output table.

Column names from the right table which also exist in this table will be suffixed “2” in the new table.

Parameters:
  • right_table – The “right” table to join to.
  • left_key – Either the name of a column from the this table to join on, a sequence of such column names, or a function that takes a row and returns a value to join on.
  • right_key – Either the name of a column from :code:table` to join on, a sequence of such column names, or a function that takes a row and returns a value to join on. If None then left_key will be used for both.
  • inner – Perform a SQL-style “inner join” instead of a left outer join. Rows which have no match for left_key will not be included in the output table.
  • require_match – If true, an exception will be raised if there is a left_key with no matching right_key.
  • columns – A sequence of column names from right_table to include in the final output table. Defaults to all columns not in right_key.
Returns:

A new Table.

homogenize(key, compare_values, default_row=None)

Fills missing rows in a dataset with default values.

Determines what rows are missing by comparing the values in the given column_names with the expected compare_values.

Values not found in the table will be used to generate new rows with the given default_row.

Default_row should be an array of values or an array-generating function. If not specified, the new rows will have None in columns not given in column_names.

If it is an array of values, the length should be row length minus column_names count and the gap will be filled with the missing values.

If it is an array-generating function, the function should take an array of missing values for each new row and output a full row including those values.

Parameters:
  • key – Either a column name or a sequence of such names.
  • compare_values – An array of lists with combinations of values that should be present in at least one row in the table. A row is generated for each combination not found.
  • default_row – An array of values or a function to generate new rows. The length of the input array should be equal to row length minus column_names count. The length of array generated by the function should be the row length.
Returns:

A new Table.

classmethod merge(tables, row_names=None, column_names=None)

Merge an array of tables into a single table.

Row names will be lost, but new row names can be specified with the row_names argument.

It is possible to limit the columns included in the new Table with column_names argument. For example, to only include columns from a specific table, set column_names equal to table.column_names.

Parameters:
  • tables – An sequence of Table instances.
  • row_names – See Table for the usage of this parameter.
  • column_names – A sequence of column names to include in the new Table. If not specified, all distinct column names from tables are included.
Returns:

A new Table.

pivot(key=None, pivot=None, aggregation=None, computation=None, default_value=<object object>, key_name=None)

Pivot reorganizes the data in a table by grouping the data, aggregating those groups, optionally applying a computation, and then organizing the groups into new rows and columns.

For example:

name race gender
Joe white male
Jane black female
Josh black male
Jim asian female

This table can be pivoted with key equal to “race” and pivot equal to “gender”. The default aggregation is Count. This would result in the following table.

race male female
white 1 0
black 1 1
asian 0 1

If one or more keys are specified then the resulting table will automatically have row_names set to those keys.

See also the related method Table.denormalize().

Parameters:
  • key – Either the name of a column from the this table to group by, a sequence of such column names, a function that takes a row and returns a value to group by, or None, in which case there will be only a single row in the output table.
  • pivot – A column name whose unique values will become columns in the new table, or None in which case there will be a single value column in the output table.
  • aggregation

    An instance of an Aggregation to perform on each group of data in the pivot table. (Each cell is the result of an aggregation of the grouped data.)

    If not specified this defaults to Count with no arguments.

  • computation

    An optional Computation instance to be applied to the aggregated sequence of values before they are transposed into the pivot table.

    Use the class name of the aggregation as your column name argument when constructing your computation. (This is “Count” if using the default value for aggregation.)

  • default_value – Value to be used for missing values in the pivot table. Defaults to Decimal(0). If performing non-mathematical aggregations you may wish to set this to None.
  • key_name – A name for the key column in the output table. This is most useful when the provided key is a function. This argument is not valid when key is a sequence.
Returns:

A new Table.

normalize(key, properties, property_column='property', value_column='value', column_types=None)

Normalize a sequence of columns into two columns for field and value.

For example:

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

can be normalized on columns ‘gender’, ‘race’ and ‘age’:

name property value
Jane gender female
Jane race black
Jane age 24
... ... ...

This is the opposite of Table.denormalize().

Parameters:
  • key – A column name or a sequence of column names that should be maintained as they are in the normalized table. Typically these are the tables unique identifiers and any metadata about them.
  • properties – A column name or a sequence of column names that should be converted to properties in the new table.
  • property_column – The name to use for the column containing the property names.
  • value_column – The name to use for the column containing the property values.
  • column_types – A sequence of two column types for the property and value column in that order or an instance of TypeTester. Defaults to a generic TypeTester.
Returns:

A new Table.

denormalize(key=None, property_column='property', value_column='value', default_value=<object object>, column_types=None)

Denormalize a dataset so that unique values in a column become their own columns.

For example:

name property value
Jane gender female
Jane race black
Jane age 24
... ... ...

Can be denormalized so that each unique value in field becomes a column with value used for its values.

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

If one or more keys are specified then the resulting table will automatically have row_names set to those keys.

This is the opposite of Table.normalize().

Parameters:
  • key – A column name or a sequence of column names that should be maintained as they are in the normalized table. Typically these are the tables unique identifiers and any metadata about them. Or, None if there are no key columns.
  • field_column – The column whose values should become column names in the new table.
  • property_column – The column whose values should become the values of the property columns in the new table.
  • default_value – Value to be used for missing values in the pivot table. If not specified Decimal(0) will be used for aggregations that return Number data and None will be used for all others.
  • column_types – A sequence of column types with length equal to number of unique values in field_column or an instance of TypeTester. Defaults to a generic TypeTester.
Returns:

A new Table.

group_by(key, key_name=None, key_type=None)

Create a new Table for unique value and return them as a TableSet. The key can be either a column name or a function that returns a value to group by.

Note that when group names will always be coerced to a string, regardless of the format of the input column.

Parameters:
  • key – Either the name of a column from the this table to group by, or a function that takes a row and returns a value to group by.
  • key_name – A name that describes the grouped properties. Defaults to the column name that was grouped on or “group” if grouping with a key function. See TableSet for more.
  • key_type – An instance of any subclass of DataType. If not provided it will default to a :class`.Text`.
Returns:

A TableSet mapping where the keys are unique values from the key and the values are new Table instances containing the grouped rows.

monkeypatch(patch_cls)

Dynamically add patch_cls as a base class of this class.

Parameters:patch_cls – The class to be patched on.
aggregate(aggregations)

Aggregate data from the columns in this table by applying a sequence of Aggregation instances.

Parameters:aggregations – A single Aggregation instance or sequence of them.
Returns:If the input was a single Aggregation then a single result will be returned. If it was a sequence then a tuple of results will be returned.
compute(computations)

Compute new columns by applying one or more Computation to each row.

Parameters:computations – A sequence of pairs of new column names and Computation instances.
Returns:A new Table.
bins(column_name, count=10, start=None, end=None)

Generates (approximately) evenly sized bins for the values in a column. Bins may not be perfectly even if the spread of the data does not divide evenly, but all values will always be included in some bin.

The resulting table will have two columns. The first will have the same name as the specified column, but will be type Text. The second will be named count and will be of type Number.

Parameters:
  • column_name – The name of the column to bin. Must be of type Number
  • count – The number of bins to create. If not specified then each value will be counted as its own bin.
  • start – The minimum value to start the bins at. If not specified the minimum value in the column will be used.
  • end – The maximum value to end the bins at. If not specified the maximum value in the column will be used.
Returns:

A new Table.

print_table(max_rows=None, max_columns=None, output=<open file '<stdout>', mode 'w'>, max_column_width=20, locale=None)

Print a well-formatted preview of this table to the console or any other output.

Parameters:
  • max_rows – The maximum number of rows to display before truncating the data.
  • max_columns – The maximum number of columns to display before truncating the data.
  • output – A file-like object to print to. Defaults to sys.stdout.
  • max_column_width – Truncate all columns to at most this width. The remainder will be replaced with ellipsis.
  • locale – Provide a locale you would like to be used to format the output. By default it will use the system’s setting.
print_html(max_rows=None, max_columns=None, output=<open file '<stdout>', mode 'w'>)

Print an HTML-formatted preview of this table to the console or any other output.

Parameters:
  • max_rows – The maximum number of rows to display before truncating the data.
  • max_columns – The maximum number of columns to display before truncating the data.
  • output – A file-like object to print to. Defaults to sys.stdout.
print_csv(**kwargs)

A shortcut for printing a CSV directly to the csonsole. Effectively the same as passing sys.stdout() to Table.to_csv().

kwargs will be passed on to Table.to_csv().

print_json(**kwargs)

A shortcut for printing JSON directly to the console. Effectively the same as passing sys.stdout() to Table.to_json().

kwargs will be passed on to Table.to_json().

print_bars(label_column_name='group', value_column_name='Count', domain=None, width=120, output=<open file '<stdout>', mode 'w'>, printable=False)

Print a text-based bar chart of the columns names label_column_name and value_column_name.

Parameters:
  • label_column_name – The column containing the label values. Defaults to “group”, which is the default output of Table.pivot() or Table.bins().
  • value_column_name – The column containing the bar values. Defaults to “Count”, which is the default output of Table.pivot() or Table.bins().
  • domain – A 2-tuple containing the minimum and maximum values for the chart’s x-axis. The domain must be large enough to contain all values in the column.
  • width – The width, in characters, to use for the bar chart. Defaults to 120.
  • output – A file-like object to print to. Defaults to sys.stdout.
  • printable – If true, only printable characters will be outputed.
print_structure(output=<open file '<stdout>', mode 'w'>)

Print the column names and their respective types

Parameters:
  • table – A Table instance.
  • output – The output used to print the structure of the Table.
Returns:

None