agate 1.6.0

Build status PyPI downloads Version License Support Python versions

agate is a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that solves real-world problems with readable code.

agate was previously known as journalism.

Important links:

About agate

Why agate?

  • A readable and user-friendly API.
  • A complete set of SQL-like operations.
  • Unicode support everywhere.
  • Decimal precision everywhere.
  • Exhaustive user documentation.
  • Pluggable extensions that add SQL integration, Excel support, and more.
  • Designed with iPython, Jupyter and atom/hydrogen in mind.
  • Pure Python. No C dependencies to compile.
  • Exhaustive test coverage.
  • MIT licensed and free for all purposes.
  • Zealously zen.
  • Made with love.

Principles

agate is a intended to fill a very particular programming niche. It should not be allowed to become as complex as numpy or pandas. Please bear in mind the following principles when considering a new feature:

  • Humans have less time than computers. Optimize for humans.
  • Most datasets are small. Don’t optimize for “big data”.
  • Text is data. It must always be a first-class citizen.
  • Python gets it right. Make it work like Python does.
  • Humans lives are nasty, brutish and short. Make it easy.
  • Mutability leads to confusion. Processes that alter data must create new copies.
  • Extensions are the way. Don’t add it to core unless everybody needs it.

Installation

Users

To use agate install it with pip:

pip install agate

Note

Need more speed? Upgrade to Python 3. It’s 3-5x faster than Python 2.

If you must use Python 2 you can you can pip install cdecimal for a performance boost.

Developers

If you are a developer that also wants to hack on agate, install it from git:

git clone git://github.com/onyxfish/agate.git
cd agate
mkvirtualenv agate

# If running Python 3 (strongly recommended for development)
pip install -r requirements-py3.txt

# If running Python 2
pip install -r requirements-py2.txt

python setup.py develop
tox

Note

To run the agate tests with coverage:

nosetests --with-coverage tests

Supported platforms

agate supports the following versions of Python:

  • Python 2.7
  • Python 3.3+
  • PyPy versions >= 4.0.0

It is tested primarily on OSX, but due to its minimal dependencies it should work perfectly on both Linux and Windows.

Note

iPython or Jupyter user? Agate works great there too.

Tutorial

The agate tutorial is now available in new-and-improved Jupyter Notebook format.

Find it on Github!

Cookbook

Welcome to the agate cookbook, a source of how-to’s and use cases.

Creating tables

From data in memory

From a list of lists.

column_names = ['letter', 'number']
column_types = [agate.Text(), agate.Number()]

rows = [
    ('a', 1),
    ('b', 2),
    ('c', None)
]

table = agate.Table(rows, column_names, column_types)

From a list of dictionaries.

rows = [
    dict(letter='a', number=1),
    dict(letter='b', number=2),
    dict(letter='c', number=None)
]

table = agate.Table.from_object(rows)

From a CSV

By default, loading a table from a CSV will use agate’s builtin TypeTester to infer column types:

table = agate.Table.from_csv('filename.csv')

Override type inference

In some cases agate’s TypeTester may guess incorrectly. To override the type for some columns and use TypeTester for the rest, pass a dictionary to the column_types argument.

specified_types = {
    'column_name_one': agate.Text(),
    'column_name_two': agate.Number()
}

table = agate.Table.from_csv('filename.csv', column_types=specified_types)

This will use a generic TypeTester and override your specified columns with TypeTester.force.

Limit type inference

For large datasets TypeTester may be unreasonably slow. In order to limit the amount of data it uses you can specify the limit argument. Note that if data after the limit invalidates the TypeTester’s inference you may get errors when the data is loaded.

tester = agate.TypeTester(limit=100)

table = agate.Table.from_csv('filename.csv', column_types=tester)

Manually specify columns

If you know the types of your data you may find it more efficient to manually specify the names and types of your columns. This also gives you an opportunity to rename columns when you load them.

text_type = agate.Text()
number_type = agate.Number()

column_names = ['city', 'area', 'population']
column_types = [text_type, number_type, number_type]

table = agate.Table.from_csv('population.csv', column_names, column_types)

Or, you can use this method to load data from a file that does not have a header row:

table = agate.Table.from_csv('population.csv', column_names, column_types, header=False)

From a unicode CSV

You don’t have to do anything special. It just works!

From a latin1 CSV

table = agate.Table.from_csv('census.csv', encoding='latin1')

From a semicolon delimited CSV

Normally, agate will automatically guess the delimiter of your CSV, but if that guess fails you can specify it manually:

table = agate.Table.from_csv('filename.csv', delimiter=';')

From a TSV (tab-delimited CSV)

This is the same as the previous example, but in this case we specify that the delimiter is a tab:

table = agate.Table.from_csv('filename.csv', delimiter='\t')

From JSON

table = agate.Table.from_json('filename.json')

From newline-delimited JSON

table = agate.Table.from_json('filename.json', newline=True)

From a SQL database

Use the agate-sql extension.

import agatesql

agatesql.patch()

table = agate.Table.from_sql('postgresql:///database', 'input_table')

From an Excel spreadsheet

Use the agate-excel extension. It supports both .xls and .xlsx files.

import agateexcel

agateexcel.patch()

table = agate.Table.from_xls('test.xls', sheet='data')

table2 = agate.Table.from_xlsx('test.xlsx', sheet='data')

From a DBF table

DBF is the file format used to hold tabular data for ArcGIS shapefiles. agate-dbf extension.

import agatedbf

agatedbf.patch()

table = agate.Table.from_dbf('test.dbf')

From a remote file

Use the agate-remote extension.

import agateremote

agateremote.patch()

table = agate.Table.from_url('https://raw.githubusercontent.com/onyxfish/agate/master/examples/test.csv')

agate-remote also let’s you create an Archive, which is a reference to a group of tables with a known path structure.

archive = agateremote.Archive('https://github.com/vincentarelbundock/Rdatasets/raw/master/csv/')

table = archive.get_table('sandwich/PublicSchools.csv')

Save a table

To a CSV

table.to_csv('filename.csv')

To JSON

table.to_json('filename.json')

To newline-delimited JSON

table.to_json('filename.json', newline=True)

To a SQL database

Use the agate-sql extension.

import agatesql

table.to_sql('postgresql:///database', 'output_table')

Remove columns

Include specific columns

Create a new table with only a specific set of columns:

include_columns = ['column_name_one', 'column_name_two']

new_table = table.select(include_columns)

Exclude specific columns

Create a new table without a specific set of columns:

exclude_columns = ['column_name_one', 'column_name_two']

new_table = table.exclude(exclude_columns)

Filter rows

By regex

You can use Python’s builtin re module to introduce a regular expression into a Table.where() query.

For example, here we find all states that start with “C”.

import re

new_table = table.where(lambda row: re.match('^C', row['state']))

This can also be useful for finding values that don’t match your expectations. For example, finding all values in the “phone number” column that don’t look like phone numbers:

new_table = table.where(lambda row: not re.match('\d{3}-\d{3}-\d{4}', row['phone']))

By glob

Hate regexes? You can use glob (fnmatch) syntax too!

from fnmatch import fnmatch

new_table = table.where(lambda row: fnmatch('C*', row['state']))

Values within a range

This snippet filters the dataset to incomes between 100,000 and 200,000.

new_table = table.where(lambda row: 100000 < row['income'] < 200000)

Dates within a range

This snippet filters the dataset to events during the summer of 2015:

import datetime

new_table = table.where(lambda row: datetime.datetime(2015, 6, 1) <= row['date'] <= datetime.datetime(2015, 8, 31))

If you want to filter to events during the summer of any year:

new_table = table.where(lambda row: 6 <= row['date'].month <= 8)

Top N percent

To filter a dataset to the top 10% percent of values we first compute the percentiles for the column and then use the result in the Table.where() truth test:

percentiles = table.aggregate(agate.Percentiles('salary'))
top_ten_percent = table.where(lambda r: r['salary'] >= percentiles[90])

Random sample

By combining a random sort with limiting, we can effectively get a random sample from a table.

import random

randomized = table.order_by(lambda row: random.random())
sampled = table.limit(10)

Ordered sample

With can also get an ordered sample by simply using the step parameter of the Table.limit() method to get every Nth row.

sampled = table.limit(step=10)

Distinct values

You can retrieve a distinct list of values in a column using Column.values_distinct() or Table.distinct().

Table.distinct() returns the entire row so it’s necessary to chain a select on the specific column.

columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)

new_table.columns['value'].values_distinct()
# or
new_table.distinct('value').columns['value'].values()
(Decimal('1'), Decimal('2'), Decimal('5'))

Sort

Alphabetical

Order a table by the last_name column:

new_table = table.order_by('last_name')

Numerical

Order a table by the cost column:

new_table = table.order_by('cost')

By date

Order a table by the birth_date column:

new_table = table.order_by('birth_date')

Reverse order

The order of any sort can be reversed by using the reverse keyword:

new_table = table.order_by('birth_date', reverse=True)

Multiple columns

Because Python’s internal sorting works natively with sequences, we can implement multi-column sort by returning a tuple from the key function.

new_table = table.order_by(lambda row: (row['last_name'], row['first_name']))

This table will now be ordered by last_name, then first_name.

Random order

import random

new_table = table.order_by(lambda row: random.random())

Standardize names and values

Standardize row and columns names

The Table.rename() method has arguments to convert row or column names to slugs and append unique identifiers to duplicate values.

Using an existing table object:

# Convert column names to unique slugs
table.rename(slug_columns=True)

# Convert row names to unique slugs
table.rename(slug_rows=True)

# Convert both column and row names to unique slugs
table.rename(slug_columns=True, slug_rows=True)

Standardize column values

agate has a Slug computation that can be used to also standardize text column values. The computation has an option to also append unique identifiers to duplicate values.

Using an existing table object:

# Convert the values in column 'title' to slugs
new_table = table.compute([
    ('title-slug', agate.Slug('title'))
])

# Convert the values in column 'title' to unique slugs
new_table = table.compute([
    ('title-slug', agate.Slug('title', ensure_unique=True))
])

Statistics

Common descriptive and aggregate statistics are included with the core agate library. For additional statistical methods beyond the scope of agate consider using the agate-stats extension or integrating with scipy.

Descriptive statistics

agate includes a full set of standard descriptive statistics that can be applied to any column containing Number data.

table.aggregate(agate.Sum('salary'))
table.aggregate(agate.Min('salary'))
table.aggregate(agate.Max('salary'))
table.aggregate(agate.Mean('salary'))
table.aggregate(agate.Median('salary'))
table.aggregate(agate.Mode('salary'))
table.aggregate(agate.Variance('salary'))
table.aggregate(agate.StDev('salary'))
table.aggregate(agate.MAD('salary'))

Or, get several at once:

table.aggregate([
    agate.Min('salary'),
    agate.Mean('salary'),
    agate.Max('salary')
])

Aggregate statistics

You can also generate aggregate statistics for subsets of data (sometimes referred to as “rolling up”):

doctors = patients.group_by('doctor')
patient_ages = doctors.aggregate([
    ('patient_count', agate.Count()),
    ('age_mean', agate.Mean('age')),
    ('age_median', agate.Median('age'))
])

The resulting table will have four columns: doctor, patient_count, age_mean and age_median.

You can roll up by multiple columns by chaining agate’s Table.group_by() method.

doctors_by_state = patients.group_by("state").group_by('doctor')

Distribution by count (frequency)

Counting the number of each unique value in a column can be accomplished with the Table.pivot() method:

# Counts of a single column's values
table.pivot('doctor')

# Counts of all combinations of more than one column's values
table.pivot(['doctor', 'hospital'])

The resulting tables will have a column for each key column and another Count column counting the number of instances of each value.

Distribution by percent

Table.pivot() can also be used to calculate the distribution of values as a percentage of the total number:

# Percents of a single column's values
table.pivot('doctor', computation=agate.Percent('Count'))

# Percents of all combinations of more than one column's values
table.pivot(['doctor', 'hospital'], computation=agate.Percent('Count'))

The output table will be the same format as the previous example, except the value column will be named Percent.

Identify outliers

The agate-stats extension adds methods for finding outliers.

import agatestats

agatestats.patch()

outliers = table.stdev_outliers('salary', deviations=3, reject=False)

By specifying reject=True you can instead return a table including only those values not identified as outliers.

not_outliers = table.stdev_outliers('salary', deviations=3, reject=True)

The second, more robust, method for identifying outliers is by identifying values which are more than some number of “median absolute deviations” from the median (typically 3).

outliers = table.mad_outliers('salary', deviations=3, reject=False)

As with the first example, you can specify reject=True to exclude outliers in the resulting table.

Custom statistics

You can also generate custom aggregated statistics for your data by defining your own ‘summary’ aggregation. This might be especially useful for performing calculations unique to your data. Here’s a simple example:

# Create a custom summary aggregation with agate.Summary
# Input a column name, a return data type and a function to apply on the column
count_millionaires = agate.Summary('salary', agate.Number(), lambda r: sum(salary > 1000000 for salary in r.values()))

table.aggregate([
    count_millionaires
])

Your custom aggregation can be used to determine both descriptive and aggregate statistics shown above.

Compute new values

Change

new_table = table.compute([
    ('2000_change', agate.Change('2000', '2001')),
    ('2001_change', agate.Change('2001', '2002')),
    ('2002_change', agate.Change('2002', '2003'))
])

Or, better yet, compute the whole decade using a loop:

computations = []

for year in range(2000, 2010):
    change = agate.Change(year, year + 1)
    computations.append(('%i_change' % year, change))

new_table = table.compute(computations)

Percent

Calculate the percentage for each value in a column with Percent. Values are divided into the sum of the column by default.

columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)

new_table = new_table.compute([
    ('percent', agate.Percent('value'))
])

new_table.print_table()

| value | percent |
| ----- | ------- |
|     1 |      10 |
|     2 |      20 |
|     2 |      20 |
|     5 |      50 |

Override the denominator with a keyword argument.

new_table = new_table.compute([
    ('percent', agate.Percent('value', 5))
])

new_table.print_table()

| value | percent |
| ----- | ------- |
|     1 |      20 |
|     2 |      40 |
|     2 |      40 |
|     5 |     100 |

Percent change

Want percent change instead of value change? Just swap out the Computation:

computations = []

for year in range(2000, 2010):
    change = agate.PercentChange(year, year + 1)
    computations.append(('%i_change' % year, change))

new_table = table.compute(computations)

Indexed/cumulative change

Need your change indexed to a starting year? Just fix the first argument:

computations = []

for year in range(2000, 2010):
    change = agate.Change(2000, year + 1)
    computations.append(('%i_change' % year, change))

new_table = table.compute(computations)

Of course you can also use PercentChange if you need percents rather than values.

Round to two decimal places

agate stores numerical values using Python’s decimal.Decimal type. This data type ensures numerical precision beyond what is supported by the native float() type, however, because of this we can not use Python’s builtin round() function. Instead we must use decimal.Decimal.quantize().

We can use Table.compute() to apply the quantize to generate a rounded column from an existing one:

from decimal import Decimal

number_type = agate.Number()

def round_price(row):
    return row['price'].quantize(Decimal('0.01'))

new_table = table.compute([
    ('price_rounded', agate.Formula(number_type, round_price))
])

To round to one decimal place you would simply change 0.01 to 0.1.

Difference between dates

Calculating the difference between dates (or dates and times) works exactly the same as it does for numbers:

new_table = table.compute([
    ('age_at_death', agate.Change('born', 'died'))
])

Levenshtein edit distance

The Levenshtein edit distance is a common measure of string similarity. It can be used, for instance, to check for typos between manually-entered names and a version that is known to be spelled correctly.

Implementing Levenshtein requires writing a custom Computation. To save ourselves building the whole thing from scratch, we will lean on the python-Levenshtein library for the actual algorithm.

import agate
from Levenshtein import distance
import six

class LevenshteinDistance(agate.Computation):
    """
    Computes Levenshtein edit distance between the column and a given string.
    """
    def __init__(self, column_name, compare_string):
        self._column_name = column_name
        self._compare_string = compare_string

    def get_computed_data_type(self, table):
        """
        The return value is a numerical distance.
        """
        return agate.Number()

    def validate(self, table):
        """
        Verify the column is text.
        """
        column = table.columns[self._column_name]

        if not isinstance(column.data_type, agate.Text):
            raise agate.DataTypeError('Can only be applied to Text data.')

    def run(self, table):
        """
        Find the distance, returning null when the input column was null.
        """
        new_column = []

        for row in table.rows:
          val = row[self._column_name]

          if val is None:
              new_column.append(None)
          else:
              new_column.append(distance(val, self._compare_string))

        return new_column

This code can now be applied to any Table just as any other Computation would be:

new_table = table.compute([
    ('distance', LevenshteinDistance('column_name', 'string to compare'))
])

The resulting column will contain an integer measuring the edit distance between the value in the column and the comparison string.

USA Today Diversity Index

The USA Today Diversity Index is a widely cited method for evaluating the racial diversity of a given area. Using a custom Computation makes it simple to calculate.

Assuming that your data has a column for the total population, another for the population of each race and a final column for the hispanic population, you can implement the diversity index like this:

class USATodayDiversityIndex(agate.Computation):
    def get_computed_data_type(self, table):
        return agate.Number()

    def run(self, table):
        new_column = []

        for row in table.rows:
          race_squares = 0

          for race in ['white', 'black', 'asian', 'american_indian', 'pacific_islander']:
              race_squares += (row[race] / row['population']) ** 2

          hispanic_squares = (row['hispanic'] / row['population']) ** 2
          hispanic_squares += (1 - (row['hispanic'] / row['population'])) ** 2

          new_column.append((1 - (race_squares * hispanic_squares)) * 100)

        return new_column

We apply the diversity index like any other computation:

with_index = table.compute([
    ('diversity_index', USATodayDiversityIndex())
])

Simple Moving Average

A simple moving average is the average of some number of prior values in a series. It is typically used to smooth out variation in time series data.

The following custom Computation will compute a simple moving average. This example assumes your data is already sorted.

class SimpleMovingAverage(agate.Computation):
    """
    Computes the simple moving average of a column over some interval.
    """
    def __init__(self, column_name, interval):
        self._column_name = column_name
        self._interval = interval

    def get_computed_data_type(self, table):
        """
        The return value is a numerical average.
        """
        return agate.Number()

    def validate(self, table):
        """
        Verify the column is numerical.
        """
        column = table.columns[self._column_name]

        if not isinstance(column.data_type, agate.Number):
            raise agate.DataTypeError('Can only be applied to Number data.')

    def run(self, table):
        new_column = []

        for i, row in enumerate(table.rows):
            if i < self._interval:
                new_column.append(None)
            else:
                values = tuple(r[self._column_name] for r in table.rows[i - self._interval:i])

                if None in values:
                    new_column.append(None)
                else:
                    new_column.append(sum(values) / self._interval)

        return new_column

You would use the simple moving average like so:

with_average = table.compute([
    ('six_month_moving_average', SimpleMovingAverage('price', 6))
])

Dates and times

Specify a date format

By default agate will attempt to guess the format of a Date or DateTime column. In some cases, it may not be possible to automatically figure out the format of a date. In this case you can specify a datetime.datetime.strptime() formatting string to specify how the dates should be parsed. For example, if your dates were formatted as “15-03-15” (March 15th, 2015) then you could specify:

date_type = agate.Date('%d-%m-%y')

Another use for this feature is if you have a column that contains extraneous data. For instance, imagine that your column contains hours and minutes, but they are always zero. It would make more sense to load that data as type Date and ignore the extra time information:

date_type = agate.Date('%m/%d/%Y 00:00')

Specify a timezone

Timezones are hard. Under normal circumstances (no arguments specified), agate will not try to parse timezone information, nor will it apply a timezone to the datetime.datetime instances it creates. (They will be naive in Python parlance.) There are two ways to force timezone data into your agate columns.

The first is to use a format string, as shown above, and specify a pattern for timezone information:

datetime_type = agate.DateTime('%Y-%m-%d %H:%M:%S%z')

The second way is to specify a timezone as an argument to the type constructor:

import pytz

eastern = pytz.timezone('US/Eastern')
datetime_type = agate.DateTime(timezone=eastern)

In this case all timezones that are processed will be set to have the Eastern timezone. Note, the timezone will be set, not converted. You cannot use this method to convert your timezones from UTC to another timezone. To do that see Convert timezones.

Calculate a time difference

See Difference between dates.

Sort by date

See By date.

Convert timezones

If you load data from a spreadsheet in one timezone and you need to convert it to another, you can do this using a Formula. Your datetime column must have timezone data for the following example to work. See Specify a timezone.

import pytz

us_eastern = pytz.timezone('US/Eastern')
datetime_type = agate.DateTime(timezone=us_eastern)

column_names = ['what', 'when']
column_types = [text_type, datetime_type]

table = agate.Table.from_csv('events.csv', columns)

rome = timezone('Europe/Rome')
timezone_shifter = agate.Formula(lambda r: r['when'].astimezone(rome))

table = agate.Table.compute([
    ('when_in_rome', timezone_shifter)
])

Emulate SQL

agate’s command structure is very similar to SQL. The primary difference between agate and SQL is that commands like SELECT and WHERE explicitly create new tables. You can chain them together as you would with SQL, but be aware each command is actually creating a new table.

Note

All examples in this section use the PostgreSQL dialect for comparison.

If you want to read and write data from SQL, see From a SQL database.

SELECT

SQL:

SELECT state, total FROM table;

agate:

new_table = table.select(['state', 'total'])

WHERE

SQL:

SELECT * FROM table WHERE LOWER(state) = 'california';

agate:

new_table = table.where(lambda row: row['state'].lower() == 'california')

ORDER BY

SQL:

SELECT * FROM table ORDER BY total DESC;

agate:

new_table = table.order_by(lambda row: row['total'], reverse=True)

DISTINCT

SQL:

SELECT DISTINCT ON (state) * FROM table;

agate:

new_table = table.distinct('state')

Note

Unlike most SQL implementations, agate always returns the full row. Use Table.select() if you want to filter the columns first.

INNER JOIN

SQL (two ways):

SELECT * FROM patient, doctor WHERE patient.doctor = doctor.id;

SELECT * FROM patient INNER JOIN doctor ON (patient.doctor = doctor.id);

agate:

joined = patients.join(doctors, 'doctor', 'id', inner=True)

LEFT OUTER JOIN

SQL:

SELECT * FROM patient LEFT OUTER JOIN doctor ON (patient.doctor = doctor.id);

agate:

joined = patients.join(doctors, 'doctor', 'id')

FULL OUTER JOIN

SQL:

SELECT * FROM patient FULL OUTER JOIN doctor ON (patient.doctor = doctor.id);

agate:

joined = patients.join(doctors, 'doctor', 'id', full_outer=True)

GROUP BY

agate’s Table.group_by() works slightly different than SQLs. It does not require an aggregate function. Instead it returns TableSet. To see how to perform the equivalent of a SQL aggregate, see below.

doctors = patients.group_by('doctor')

You can group by two or more columns by chaining the command.

doctors_by_state = patients.group_by('state').group_by('doctor')

HAVING

agate’s TableSet.having() works very similar to SQL’s keyword of the same name.

doctors = patients.group_by('doctor')
popular_doctors = doctors.having([
    ('patient_count', Count())
], lambda t: t['patient_count'] > 100)

This filters to only those doctors whose table includes at least 100 results. Can add as many aggregations as you want to the list and each will be available, by name in the test function you pass.

For example, here we filter to popular doctors with more an average review of at least three stars:

doctors = patients.group_by('doctor')
popular_doctors = doctors.having([
    ('patient_count', Count()),
    ('average_stars', Average('stars'))
], lambda t: t['patient_count'] > 100 and t['average_stars'] >= 3)

Chain commands together

SQL:

SELECT state, total FROM table WHERE LOWER(state) = 'california' ORDER BY total DESC;

agate:

new_table = table \
    .select(['state', 'total']) \
    .where(lambda row: row['state'].lower() == 'california') \
    .order_by('total', reverse=True)

Note

Chaining commands in this way is sometimes not a good idea. Being explicit about each step can lead to clearer code.

Aggregate functions

SQL:

SELECT mean(age), median(age) FROM patients GROUP BY doctor;

agate:

doctors = patients.group_by('doctor')
patient_ages = doctors.aggregate([
    ('patient_count', agate.Count()),
    ('age_mean', agate.Mean('age')),
    ('age_median', agate.Median('age'))
])

The resulting table will have four columns: doctor, patient_count, age_mean and age_median.

Emulate Excel

One of agate’s most powerful assets is that instead of a wimpy “formula” language, you have the entire Python language at your disposal. Here are examples of how to translate a few common Excel operations.

Simple formulas

If you need to simulate a simple Excel formula you can use the Formula class to apply an arbitrary function.

Excel:

=($A1 + $B1) / $C1

agate:

def f(row):
    return (row['a'] + row['b']) / row['c']

new_table = table.compute([
    ('new_column', agate.Formula(agate.Number(), f))
])

If this still isn’t enough flexibility, you can also create your own subclass of Computation.

SUM

number_type = agate.Number()

def five_year_total(row):
    columns = ('2009', '2010', '2011', '2012', '2013')

    return sum(tuple(row[c] for c in columns)]

formula = agate.Formula(number_type, five_year_total)

new_table = table.compute([
    ('five_year_total', formula)
])

TRIM

new_table = table.compute([
    ('name_stripped', agate.Formula(text_type, lambda r: r['name'].strip()))
])

CONCATENATE

new_table = table.compute([
    ('full_name', agate.Formula(text_type, lambda r: '%(first_name)s %(middle_name)s %(last_name)s' % r))
])

IF

new_table = table.compute([
    ('mvp_candidate', agate.Formula(boolean_type, lambda r: row['batting_average'] > 0.3))
])

VLOOKUP

There are two ways to get the equivalent of Excel’s VLOOKUP with agate. If your lookup source is another agate Table, then you’ll want to use the Table.join() method:

new_table = mvp_table.join(states, 'state_abbr')

This will add all the columns from the states table to the mvp_table, where their state_abbr columns match.

If your lookup source is a Python dictionary or some other object you can implement the lookup using a Formula computation:

states = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    ...
}

new_table = table.compute([
    ('mvp_candidate', agate.Formula(text_type, lambda r: states[row['state_abbr']]))
])

Pivot tables as cross-tabulations

Pivot tables in Excel implement a tremendous range of functionality. Agate divides this functionality into a few different methods.

If what you want is to convert rows to columns to create a “crosstab”, then you’ll want to use the Table.pivot() method:

jobs_by_state_and_year = employees.pivot('state', 'year')

This will generate a table with a row for each value in the state column and a column for each value in the year column. The intersecting cells will contains the counts grouped by state and year. You can pass the aggregation keyword to aggregate some other value, such as Mean or Median.

Pivot tables as summaries

On the other hand, if what you want is to summarize your table with descriptive statistics, then you’ll want to use Table.group_by() and TableSet.aggregate():

jobs = employees.group_by('job_title')
summary = jobs.aggregate([
    ('employee_count', agate.Count()),
    ('salary_mean', agate.Mean('salary')),
    ('salary_median', agate.Median('salary'))
])

The resulting summary table will have four columns: job_title, employee_count, salary_mean and salary_median.

You may also want to look at the Table.normalize() and Table.denormalize() methods for examples of functionality frequently accomplished with Excel’s pivot tables.

Emulate R

c()

Agate’s Table.select() and Table.exclude() are the equivalent of R’s c for selecting columns.

R:

selected <- data[c("last_name", "first_name", "age")]
excluded <- data[!c("last_name", "first_name", "age")]

agate:

selected = table.select(['last_name', 'first_name', 'age'])
excluded = table.exclude(['last_name', 'first_name', 'age'])

subset

Agate’s Table.where() is the equivalent of R’s subset.

R:

newdata <- subset(data, age >= 20 | age < 10)

agate:

new_table = table.where(lambda row: row['age'] >= 20 or row['age'] < 10)

order

Agate’s Table.order_by() is the equivalent of R’s order.

R:

newdata <- employees[order(last_name),]

agate:

new_table = employees.order_by('last_name')

merge

Agate’s Table.join() is the equivalent of R’s merge.

R:

joined <- merge(employees, states, by="usps")

agate:

joined = employees.join(states, 'usps')

rbind

Agate’s Table.merge() is the equivalent of R’s rbind.

R:

merged <- rbind(first_year, second_year)

agate:

merged = agate.Table.merge(first_year, second_year)

aggregate

Agate’s Table.group_by() and TableSet.aggregate() can be used to recreate the functionality of R’s aggregate.

R:

aggregates = aggregate(employees$salary, list(job = employees$job), mean)

agate:

jobs = employees.group_by('job')
aggregates = jobs.aggregate([
    ('mean', agate.Mean('salary'))
])

melt

Agate’s Table.normalize() is the equivalent of R’s melt.

R:

melt(employees, id=c("last_name", "first_name"))

agate:

employees.normalize(['last_name', 'first_name'])

cast

Agate’s Table.denormalize() is the equivalent of R’s cast.

R:

melted = melt(employees, id=c("name"))
casted = cast(melted, name~variable, mean)

agate:

normalized = employees.normalize(['name'])
denormalized = normalized.denormalize('name')

Emulate underscore.js

filter

agate’s Table.where() functions exactly like Underscore’s filter.

new_table = table.where(lambda row: row['state'] == 'Texas')

reject

To simulate Underscore’s reject, simply negate the return value of the function you pass into agate’s Table.where().

new_table = table.where(lambda row: not (row['state'] == 'Texas'))

find

agate’s Table.find() works exactly like Underscore’s find.

row = table.find(lambda row: row['state'].startswith('T'))

any

The Any aggregation works like Underscore’s any.

true_or_false = table.aggregate(Any('salaries', lambda d: d > 100000))

You can also use Table.where() to filter to columns that pass the truth test.

all

The All aggregation works like Underscore’s all.

true_or_false = table.aggregate(All('salaries', lambda d: d > 100000))

Homogenize rows

Fill in missing rows in a series. This can be used, for instance, to add rows for missing years in a time series.

Create rows for missing values

We can insert a default row for each value that is missing in a table from a given sequence of values.

Starting with a table like this, we can fill in rows for all missing years:

year female_count male_count
1997 2 1
2000 4 3
2002 4 5
2003 1 2
key = 'year'
expected_values = (1997, 1998, 1999, 2000, 2001, 2002, 2003)

# Your default row should specify column values not in `key`
default_row = (0, 0)

new_table = table.homogenize(key, expected_values, default_row)

The result will be:

year female_count male_count
1997 2 1
1998 0 0
1999 0 0
2000 4 3
2001 0 0
2002 4 5
2003 1 2

Create dynamic rows based on missing values

We can also specify new row values with a value-generating function:

key = 'year'
expected_values = (1997, 1998, 1999, 2000, 2001, 2002, 2003)

# If default row is a function, it should return a full row
def default_row(missing_value):
  return (missing_value, missing_value-1997, missing_value-1997)

new_table = table.homogenize(key, expected_values, default_row)

The new table will be:

year female_count male_count
1997 2 1
1998 1 1
1999 2 2
2000 4 3
2001 4 4
2002 4 5
2003 1 2

Renaming and reordering columns

Rename columns

You can rename the columns in a table by using the Table.rename() method and specifying the new column names as an array or dictionary mapping old column names to new ones.

table = Table(rows, column_names = ['a', 'b', 'c'])
new_table = table.rename(column_names = ['one', 'two', 'three'])
# or
new_table = table.rename(column_names = {'a': 'one', 'b': 'two', 'c': 'three'})

Reorder columns

You can reorder the columns in a table by using the Table.select() method and specifying the column names in the order you want:

new_table = table.select(['3rd_column_name', '1st_column_name', '2nd_column_name'])

Transform

Pivot by a single column

The Table.pivot() method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:

name race gender age
Joe white female 20
Jane asian male 20
Jill black female 20
Jim latino male 25
Julia black female 25
Joan asian female 25

In the very simplest case, this table can be pivoted to count the number occurences of values in a column:

transformed = table.pivot('race')

Result:

race pivot
white 1
asian 2
black 2
latino 1

Pivot by multiple columns

You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:

transformed = table.pivot(['race', 'gender'])

Result:

race gender pivot
white female 1
asian male 1
black female 2
latino male 1
asian female 1

For the column, version you would do:

transformed = table.pivot('race', 'gender')

Result:

race male female
white 0 1
asian 1 1
black 0 2
latino 1 0

Pivot to sum

The default pivot aggregation is Count but you can also supply other operations. For example, to aggregate each group by Sum of their ages:

transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
race male female
white 0 20
asian 20 25
black 0 45
latino 25 0

Pivot to percent of total

Pivot allows you to apply a Computation to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:

transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
race male female
white 0 14.8
asian 14.8 18.4
black 0 33.3
latino 18.4 0

Note: actual computed percentages will be much more precise.

It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.

Denormalize key/value columns into separate columns

It’s common for very large datasets to be distributed in a “normalized” format, such as:

name property value
Jane gender female
Jane race black
Jane age 24
... ... ...

The Table.denormalize() method can be used to transform the table so that each unique property has its own column.

transformed = table.denormalize('name', 'property', 'value')

Result:

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

Normalize separate columns into key/value columns

Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

The Table.normalize() method can be used to transform the table so that all the properties and their values share two columns.

transformed = table.normalize('name', ['gender', 'race', 'age'])

Result:

name property value
Jane gender female
Jane race black
Jane age 24
... ... ...

Locales

agate strives to work equally well for users from all parts of the world. This means properly handling foreign currencies, date formats, etc. To facilitate this, agate makes a hard distinction between your locale and the locale of the data you are working with. This allows you to work seamlessly with data from other countries.

Set your locale

Setting your locale will change how numbers are displayed when you print an agate Table or serialize it to, for example, a CSV file. This works the same as it does for any other Python module. See the locale documentation for details. Changing your locale will not affect how they are parsed from the files you are using. To change how data is parsed see Specify locale of numbers.

Specify locale of numbers

To correctly parse numbers from non-US locales, you must pass a locale parameter to the Number constructor. For example, to parse Dutch numbers (which use a period to separate thousands and a comma to separate fractions):

dutch_numbers = agate.Number(locale='de_DE')

column_names = ['city', 'population']
column_types = [text_type, dutch_numbers]

table = agate.Table.from_csv('dutch_cities.csv', columns)

Rank

There are many ways to rank a sequence of values. agate strives to find a balance between simple, intuitive ranking and flexibility when you need it.

Competition rank

The basic rank supported by agate is standard “competition ranking”. In this model the values [3, 4, 4, 5] would be ranked [1, 2, 2, 4]. You can apply competition ranking using the Rank computation:

new_table = table.compute([
    ('rank', agate.Rank('value'))
])

Rank descending

Descending competition ranking is specified using the reverse argument.

new_table = table.compute([
    ('rank', agate.Rank('value', reverse=True))
])

Rank change

You can compute the change from one rank to another by combining the Rank and Change computations:

new_table = table.compute([
    ('rank2014', agate.Rank('value2014')),
    ('rank2015', agate.Rank('value2015'))
])

new_table2 = new_table.compute([
    ('rank_change', agate.Change('rank2014', 'rank2015'))
])

Percentile rank

“Percentile rank” is a bit of a misnomer. Really, this is the percentile in which each value in a column is located. This column can be computed for your data using the PercentileRank computation:

new_table = table.compute([
    ('percentile_rank', agate.PercentileRank('value'))
])

Note that there is no entirely standard method for computing percentiles. The percentiles computed in this manner may not agree precisely with those generated by other software. See the Percentiles class documentation for implementation details.

Charts

Agate offers two kinds of built in charting: very simple text bar charts and SVG charting via leather. Both are intended for efficiently exploring data, rather than producing publication-ready charts.

Text-based bar chart

agate has a builtin text-based bar-chart generator:

table.limit(10).print_bars('State Name', 'TOTAL', width=80)
State Name         TOTAL
ALABAMA           19,582 ▓░░░░░░░░░░░░░
ALASKA             2,705 ▓░░
ARIZONA           46,743 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
ARKANSAS           7,932 ▓░░░░░
CALIFORNIA        76,639 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
COLORADO          21,485 ▓░░░░░░░░░░░░░░░
CONNECTICUT        4,350 ▓░░░
DELAWARE           1,904 ▓░
DIST. OF COLUMBIA  2,185 ▓░
FLORIDA           59,519 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
                         +-------------+------------+------------+-------------+
                         0          20,000       40,000       60,000      80,000

Text-based histogram

Table.print_bars() can be combined with Table.pivot() or Table.bins() to produce fast histograms:

table.bins('TOTAL', start=0, end=100000).print_bars('TOTAL', width=80)
TOTAL              Count
[0 - 10,000)          30 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
[10,000 - 20,000)     12 ▓░░░░░░░░░░░░░░░░░░░░░░
[20,000 - 30,000)      7 ▓░░░░░░░░░░░░░
[30,000 - 40,000)      1 ▓░░
[40,000 - 50,000)      2 ▓░░░░
[50,000 - 60,000)      1 ▓░░
[60,000 - 70,000)      1 ▓░░
[70,000 - 80,000)      1 ▓░░
[80,000 - 90,000)      0[90,000 - 100,000]     0 ▓
                         +-------------+------------+------------+-------------+
                         0.0          7.5         15.0         22.5         30.0

SVG bar chart

table.limit(10).bar_chart('State Name', 'TOTAL', 'docs/images/bar_chart.svg')

SVG column chart

table.limit(10).column_chart('State Name', 'TOTAL', 'docs/images/column_chart.svg')

SVG line chart

by_year_exonerated = table.group_by('exonerated')
counts = by_year_exonerated.aggregate([
    ('count', agate.Count())
])

counts.order_by('exonerated').line_chart('exonerated', 'count', 'docs/images/line_chart.svg')

SVG dots chart

table.scatterplot('exonerated', 'age', 'docs/images/dots_chart.svg')

SVG lattice chart

top_crimes = table.group_by('crime').having([
    ('count', agate.Count())
], lambda t: t['count'] > 100)

by_year = top_crimes.group_by('exonerated')

counts = by_year.aggregate([
    ('count', agate.Count())
])

by_crime = counts.group_by('crime')

by_crime.order_by('exonerated').line_chart('exonerated', 'count', 'docs/images/lattice.svg')

Using matplotlib

If you need to make more complex charts, you can always use agate with matplotlib.

Here is an example of how you might generate a line chart:

import pylab

pylab.plot(table.columns['homeruns'], table.columns['wins'])
pylab.xlabel('Homeruns')
pylab.ylabel('Wins')
pylab.title('How homeruns correlate to wins')

pylab.show()

Lookup

Generate new columns by mapping existing data to common lookup tables.

CPI deflation

The agate-lookup extension adds a lookup method to agate’s Table class.

Starting with a table that looks like this:

year cost
1995 2.0
1997 2.2
1996 2.3
2003 4.0
2007 5.0
2005 6.0

We can map the year column to its annual CPI index in one lookup call.

import agatelookup

agatelookup.patch()

join_year_cpi = table.lookup('year', 'cpi')

The return table will have now have a new column:

year cost cpi
1995 2.0 152.383
1997 2.2 160.525
1996 2.3 156.858
2003 4.0 184.000
2007 5.0 207.344
2005 6.0 195.267

A simple computation tacked on to this lookup can then get the 2015 equivalent values of each cost:

cpi_2015 = Decimal(216.909)

def cpi_adjust_2015(row):
    return (row['cost'] * (cpi_2015 / row['cpi'])).quantize(Decimal('0.01'))

cost_2015 = join_year_cpi.compute([
    ('cost_2015', agate.Formula(agate.Number(), cpi_adjust_2015))
])

And the final table will look like this:

year cost cpi cost_2015
1995 2.0 152.383 2.85
1997 2.2 160.525 2.97
1996 2.3 156.858 3.18
2003 4.0 184.000 4.72
2007 5.0 207.344 5.23
2005 6.0 195.267 6.66

Coming from other tools

Extensions

The core agate library is designed rely on as few dependencies as possible. However, in the real world you’re often going to want to interface with more specialized tools, or with other formats, such as SQL or Excel.

Using extensions

agate support’s plugin-style extensions using a monkey-patching pattern. Libraries can be created that add new methods onto Table and TableSet. For example, agate-sql adds the ability to read and write tables from a SQL database:

import agate
import agatesql

# After calling patch the from_sql and to_sql methods are now part of the Table class
table = agate.Table.from_sql('postgresql:///database', 'input_table')
table.to_sql('postgresql:///database', 'output_table')

List of extensions

Here is a list of agate extensions that are known to be actively maintained:

Writing your own extensions

Writing your own extensions is straightforward. Create a function that acts as your “patch” and then dynamically add it to Table or TableSet.

import agate

def new_method(self):
    print('I do something to a Table when you call me.')

agate.Table.new_method = new_method

You can also create new classmethods:

def new_class_method(cls):
    print('I make Tables when you call me.')

agate.Table.new_method = classmethod(new_method)

These methods can now be called on Table class in your code:

>>> import agate
>>> import myextension
>>> table = agate.Table(rows, column_names, column_types)
>>> table.new_method()
'I do something to a Table when you call me.'
>>> agate.Table.new_class_method()
'I make Tables when you call me.'

The same pattern also works for adding methods to TableSet.

API

Table

The Table object is the most important class in agate. Tables are created by supplying row data, column names and subclasses of DataType to the constructor. Once created, the data in a table can not be changed. This concept is central to agate.

Instead of modifying the data, various methods can be used to create new, derivative tables. For example, the Table.select() method creates a new table with only the specified columns. The Table.where() method creates a new table with only those rows that pass a test. And Table.order_by() creates a sorted table. In all of these cases the output is a new Table and the existing table remains unmodified.

Tables are not themselves iterable, but the columns of the table can be accessed via Table.columns and the rows via Table.rows. Both sequences can be accessed either by numeric index or by name. (In the case of rows, row names are optional.)

agate.Table A dataset consisting of rows and columns.

Properties

agate.Table.columns A MappedSequence with column names for keys and Column instances for values.
agate.Table.column_names An tuple of strings.
agate.Table.column_types An tuple DataType instances.
agate.Table.rows A MappedSeqeuence with row names for keys (if specified) and Row instances for values.
agate.Table.row_names An tuple of strings, if this table has row names.

Creating

agate.Table.from_csv Create a new table from a CSV.
agate.Table.from_json Create a new table from a JSON file.
agate.Table.from_fixed Create a new table from a fixed-width file and a CSV schema.
agate.Table.from_object Create a new table from a Python object.

Saving

agate.Table.to_csv Write this table to a CSV.
agate.Table.to_json Write this table to a JSON file or file-like object.

Basic processing

agate.Table.distinct Create a new table with only unique rows.
agate.Table.exclude Create a new table without the specified columns.
agate.Table.find Find the first row that passes a test.
agate.Table.limit Create a new table with fewer rows.
agate.Table.order_by Create a new table that is sorted.
agate.Table.select Create a new table with only the specified columns.
agate.Table.where Create a new Table with only those rows that pass a test.

Calculating new data

agate.Table.aggregate Apply one or more Aggregation instances to this table.
agate.Table.compute Create a new table by applying one or more Computation instances to each row.

Advanced processing

agate.Table.bins Generates (approximately) evenly sized bins for the values in a column.
agate.Table.denormalize Create a new table with row values converted into columns.
agate.Table.group_by Create a TableSet with a table for each unique key.
agate.Table.homogenize Fill in missing rows in a series.
agate.Table.join Create a new table by joining two table’s on common values.
agate.Table.merge Create a new table from a sequence of similar tables.
agate.Table.normalize Create a new table with columns converted into rows values.
agate.Table.pivot Create a new table by grouping the data, aggregating those groups, applying a computation, and then organizing the groups into new rows and columns.
agate.Table.rename Create a copy of this table with different column names or row names.

Previewing

agate.Table.print_bars Print a text-based bar chart based on this table.
agate.Table.print_csv Print this table as a CSV.
agate.Table.print_html Print an HTML version of this table.
agate.Table.print_json Print this table as JSON.
agate.Table.print_structure Print this table’s column names and types as a plain-text table.
agate.Table.print_table Print a text-based view of the data in this table.

Charting

agate.Table.bar_chart Render a bar chart using leather.Chart.
agate.Table.column_chart Render a column chart using leather.Chart.
agate.Table.line_chart Render a line chart using leather.Chart.
agate.Table.scatterplot Render a scatterplot using leather.Chart.

Detailed list

class agate.Table(rows, column_names=None, column_types=None, row_names=None, _is_fork=False)

Bases: object

A dataset consisting of rows and columns. Columns refer to “vertical” slices of data that must all be of the same type. Rows refer to “horizontal” slices of data that may (and usually do) contain mixed types.

The sequence of Column instances are retrieved via the Table.columns property. They may be accessed by either numeric index or by unique column name.

The sequence of Row instances are retrieved via the Table.rows property. They may be accessed by either numeric index or, if specified, unique row names.

Parameters:
  • rows – The data as a sequence of any sequences: tuples, lists, etc. If any row has fewer values than the number of columns, it will be filled out with nulls. No row may have more values than the number of columns.
  • column_names – A sequence of string names for each column or None, in which case column names will be automatically assigned using letter_name().
  • column_types – A sequence of instances of DataType or an instance of TypeTester or None in which case a generic TypeTester will be used. Alternatively, a dictionary with column names as keys and instances of DataType as values to specify some types.
  • row_names – Specifies unique names for each row. This parameter is optional. If specified it may be 1) the name of a single column that contains a unique identifier for each row, 2) a key function that takes a Row and returns a unique identifier or 3) a sequence of unique identifiers of the same length as the sequence of rows. The uniqueness of resulting identifiers is not validated, so be certain the values you provide are truly unique.
  • _is_fork – Used internally to skip certain validation steps when data is propagated from an existing table. When True, rows are assumed to be Row instances, rather than raw data.
aggregate(aggregations)

Apply one or more Aggregation instances to this table.

Parameters:aggregations – A single Aggregation instance or a sequence of tuples in the format (name, aggregation), where each aggregation is an instance of Aggregation.
Returns:If the input was a single Aggregation then a single result will be returned. If it was a sequence then an OrderedDict of results will be returned.
bar_chart(label=0, value=1, path=None, width=None, height=None)

Render a bar chart using leather.Chart.

Parameters:
  • label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
  • value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
bins(column_name, count=10, start=None, end=None)

Generates (approximately) evenly sized bins for the values in a column. Bins may not be perfectly even if the spread of the data does not divide evenly, but all values will always be included in some bin.

The resulting table will have two columns. The first will have the same name as the specified column, but will be type Text. The second will be named count and will be of type Number.

Parameters:
  • column_name – The name of the column to bin. Must be of type Number
  • count – The number of bins to create. If not specified then each value will be counted as its own bin.
  • start – The minimum value to start the bins at. If not specified the minimum value in the column will be used.
  • end – The maximum value to end the bins at. If not specified the maximum value in the column will be used.
Returns:

A new Table.

column_chart(label=0, value=1, path=None, width=None, height=None)

Render a column chart using leather.Chart.

Parameters:
  • label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
  • value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
column_names

An tuple of strings.

column_types

An tuple DataType instances.

columns

A MappedSequence with column names for keys and Column instances for values.

compute(computations, replace=False)

Create a new table by applying one or more Computation instances to each row.

Parameters:
  • computations – A sequence of pairs of new column names and Computation instances.
  • replace – If True then new column names can match existing names, and those columns will be replaced with the computed data.
Returns:

A new Table.

denormalize(key=None, property_column='property', value_column='value', default_value=<object object>, column_types=None)

Create a new table with row values converted into columns.

For example:

name property value
Jane gender female
Jane race black
Jane age 24
... ... ...

Can be denormalized so that each unique value in field becomes a column with value used for its values.

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

If one or more keys are specified then the resulting table will automatically have row_names set to those keys.

This is the opposite of Table.normalize().

Parameters:
  • key – A column name or a sequence of column names that should be maintained as they are in the normalized table. Typically these are the tables unique identifiers and any metadata about them. Or, None if there are no key columns.
  • field_column – The column whose values should become column names in the new table.
  • property_column – The column whose values should become the values of the property columns in the new table.
  • default_value – Value to be used for missing values in the pivot table. If not specified Decimal(0) will be used for aggregations that return Number data and None will be used for all others.
  • column_types – A sequence of column types with length equal to number of unique values in field_column or an instance of TypeTester. Defaults to a generic TypeTester.
Returns:

A new Table.

distinct(key=None)

Create a new table with only unique rows.

Parameters:key – Either the name of a single column to use to identify unique rows, a sequence of such column names, a function that takes a row and returns a value to identify unique rows, or None, in which case the entire row will be checked for uniqueness.
Returns:A new Table.
exclude(key)

Create a new table without the specified columns.

Parameters:key – Either the name of a single column to exclude or a sequence of such names.
Returns:A new Table.
find(test)

Find the first row that passes a test.

Parameters:test (function) – A function that takes a Row and returns True if it matches.
Returns:A single Row if found, or None.
classmethod from_csv(path, column_names=None, column_types=None, row_names=None, skip_lines=0, header=True, sniff_limit=0, encoding='utf-8', **kwargs)

Create a new table from a CSV.

This method uses agate’s builtin CSV reader, which supplies encoding support for both Python 2 and Python 3.

kwargs will be passed through to the CSV reader.

Parameters:
  • path – Filepath or file-like object from which to read CSV data. If a file-like object is specified, it must be seekable. If using Python 2, the file should be opened in binary mode (rb).
  • column_names – See Table.__init__().
  • column_types – See Table.__init__().
  • row_names – See Table.__init__().
  • skip_lines – The number of lines to skip from the top of the file. Note that skip lines will not work with
  • header – If True, the first row of the CSV is assumed to contain column names. If header and column_names are both specified then a row will be skipped, but column_names will be used.
  • sniff_limit – Limit CSV dialect sniffing to the specified number of bytes. Set to None to sniff the entire file. Defaults to 0 (no sniffing).
  • encoding – Character encoding of the CSV file. Note: if passing in a file handle it is assumed you have already opened it with the correct encoding specified.
classmethod from_fixed(path, schema_path, column_names=<object object>, column_types=None, row_names=None, encoding='utf-8', schema_encoding='utf-8')

Create a new table from a fixed-width file and a CSV schema.

Schemas must be in the “ffs” format. There is a repository of such schemas maintained at wireservice/ffs.

Parameters:
  • path – File path or file-like object from which to read fixed-width data.
  • schema_path – File path or file-like object from which to read schema (CSV) data.
  • column_names – By default, these will be parsed from the schema. For alternatives, see Table.__init__().
  • column_types – See Table.__init__().
  • row_names – See Table.__init__().
  • encoding – Character encoding of the fixed-width file. Note: if passing in a file handle it is assumed you have already opened it with the correct encoding specified.
  • schema_encoding – Character encoding of the schema file. Note: if passing in a file handle it is assumed you have already opened it with the correct encoding specified.
classmethod from_json(path, row_names=None, key=None, newline=False, column_types=None, **kwargs)

Create a new table from a JSON file.

Once the JSON has been deseralized, the resulting Python object is passed to Table.from_object().

If the file contains a top-level dictionary you may specify what property contains the row list using the key parameter.

kwargs will be passed through to json.load().

Parameters:
  • path – Filepath or file-like object from which to read JSON data.
  • row_names – See the Table.__init__().
  • key – The key of the top-level dictionary that contains a list of row arrays.
  • newline – If True then the file will be parsed as “newline-delimited JSON”.
  • column_types – See Table.__init__().
classmethod from_object(obj, row_names=None, column_types=None)

Create a new table from a Python object.

The object should be a list containing a dictionary for each “row”. Nested objects or lists will also be parsed. For example, this object:

{
    'one': {
        'a': 1,
        'b': 2,
        'c': 3
    },
    'two': [4, 5, 6],
    'three': 'd'
}

Would generate these columns and values:

{
    'one/a': 1,
    'one/b': 2,
    'one/c': 3,
    'two.0': 4,
    'two.1': 5,
    'two.2': 6,
    'three': 'd'
}

Column names and types will be inferred from the data.

Not all rows are required to have the same keys. Missing elements will be filled in with null values.

Parameters:
  • obj – Filepath or file-like object from which to read JSON data.
  • row_names – See Table.__init__().
  • column_types – See Table.__init__().
group_by(key, key_name=None, key_type=None)

Create a TableSet with a table for each unique key.

Note that group names will always be coerced to a string, regardless of the format of the input column.

Parameters:
  • key – Either the name of a column from the this table to group by, or a function that takes a row and returns a value to group by.
  • key_name – A name that describes the grouped properties. Defaults to the column name that was grouped on or “group” if grouping with a key function. See TableSet for more.
  • key_type – An instance of any subclass of DataType. If not provided it will default to a :class`.Text`.
Returns:

A TableSet mapping where the keys are unique values from the key and the values are new Table instances containing the grouped rows.

homogenize(key, compare_values, default_row=None)

Fill in missing rows in a series.

This can be used, for instance, to add rows for missing years in a time series.

Missing rows are found by comparing the values in the key columns with those provided as compare_values.

Values not found in the table will be used to generate new rows with the given default_row.

default_row should be an array of values or an array-generating function. If not specified, the new rows will have None in columns all columns not specified in key.

If default_row is an array of values, its length should be row length minus the number of column names provided in the key.

If it is an array-generating function, the function should take an array of missing values for each new row and output a full row including those values.

Parameters:
  • key – Either a column name or a sequence of such names.
  • compare_values – Either an array of column values if key is a single column name or a sequence of arrays of values if key is a sequence of names. It can also be a generator that yields either of the two. A row is created for each value or list of values not found in the rows of the table.
  • default_row – An array of values or a function to generate new rows. The length of the input array should be equal to row length minus column_names count. The length of array generated by the function should be the row length.
Returns:

A new Table.

join(right_table, left_key=None, right_key=None, inner=False, full_outer=False, require_match=False, columns=None)

Create a new table by joining two table’s on common values. This method implements most varieties of SQL join, in addition to some unique features.

If left_key and right_key are both None then this method will peform a “sequential join”, which is to say it will join on row number. The inner and full_outer arguments will determine whether dangling left-hand and right-hand rows are included, respectively.

If left_key is specified, then a “left outer join” will be performed. This will combine columns from the right_table anywhere that left_key and right_key are equal. Unmatched rows from the left table will be included with the right-hand columns set to None.

If inner is True then an “inner join” will be performed. Unmatched rows from either table will be left out.

If full_outer is True then a “full outer join” will be performed. Unmatched rows from both tables will be included, with the columns in the other table set to None.

In all cases, if right_key is None then it left_key will be used for both tables.

If left_key and right_key are column names, the right-hand identifier column will not be included in the output table.

If require_match is True unmatched rows will raise an exception. This is like an “inner join” except any row that doesn’t have a match will raise an exception instead of being dropped. This is useful for enforcing expectations about datasets that should match.

Column names from the right table which also exist in this table will be suffixed “2” in the new table.

A subset of columns from the right-hand table can be included in the joined table using the columns argument.

Parameters:
  • right_table – The “right” table to join to.
  • left_key – Either the name of a column from the this table to join on, the index of a column, a sequence of such column identifiers, a function that takes a row and returns a value to join on, or None in which case the tables will be joined on row number.
  • right_key – Either the name of a column from :code:table` to join on, the index of a column, a sequence of such column identifiers, or a function that takes a ow and returns a value to join on. If None then left_key will be used for both. If left_key is None then this value is ignored.
  • inner – Perform a SQL-style “inner join” instead of a left outer join. Rows which have no match for left_key will not be included in the output table.
  • full_outer – Perform a SQL-style “full outer” join rather than a left or a right. May not be used in combination with inner.
  • require_match – If true, an exception will be raised if there is a left_key with no matching right_key.
  • columns – A sequence of column names from right_table to include in the final output table. Defaults to all columns not in right_key. Ignored when full_outer is True.
Returns:

A new Table.

limit(start_or_stop=None, stop=None, step=None)

Create a new table with fewer rows.

See also: Python’s builtin slice().

Parameters:
  • start_or_stop – If the only argument, then how many rows to include, otherwise, the index of the first row to include.
  • stop – The index of the last row to include.
  • step – The size of the jump between rows to include. (step=2 will return every other row.)
Returns:

A new Table.

line_chart(x=0, y=1, path=None, width=None, height=None)

Render a line chart using leather.Chart.

Parameters:
  • x – The name or index of a column to plot as the x-axis. Defaults to the first column in the table.
  • y – The name or index of a column to plot as the y-axis. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
classmethod merge(tables, row_names=None, column_names=None)

Create a new table from a sequence of similar tables.

This method will not carry over row names from the merged tables, but new row names can be specified with the row_names argument.

It is possible to limit the columns included in the new Table with column_names argument. For example, to only include columns from a specific table, set column_names equal to table.column_names.

Parameters:
  • tables – An sequence of Table instances.
  • row_names – See Table for the usage of this parameter.
  • column_names – A sequence of column names to include in the new Table. If not specified, all distinct column names from tables are included.
Returns:

A new Table.

normalize(key, properties, property_column='property', value_column='value', column_types=None)

Create a new table with columns converted into rows values.

For example:

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

can be normalized on columns ‘gender’, ‘race’ and ‘age’:

name property value
Jane gender female
Jane race black
Jane age 24
... ... ...

This is the opposite of Table.denormalize().

Parameters:
  • key – A column name or a sequence of column names that should be maintained as they are in the normalized self. Typically these are the tables unique identifiers and any metadata about them.
  • properties – A column name or a sequence of column names that should be converted to properties in the new self.
  • property_column – The name to use for the column containing the property names.
  • value_column – The name to use for the column containing the property values.
  • column_types – A sequence of two column types for the property and value column in that order or an instance of TypeTester. Defaults to a generic TypeTester.
Returns:

A new Table.

order_by(key, reverse=False)

Create a new table that is sorted.

Parameters:
  • key – Either the name of a single column to sort by, a sequence of such names, or a function that takes a row and returns a value to sort by.
  • reverse – If True then sort in reverse (typically, descending) order.
Returns:

A new Table.

pivot(key=None, pivot=None, aggregation=None, computation=None, default_value=<object object>, key_name=None)

Create a new table by grouping the data, aggregating those groups, applying a computation, and then organizing the groups into new rows and columns.

This is sometimes called a “crosstab”.

name race gender
Joe white male
Jane black female
Josh black male
Jim asian female

This table can be pivoted with key equal to “race” and columns equal to “gender”. The default aggregation is Count. This would result in the following table.

race male female
white 1 0
black 1 1
asian 0 1

If one or more keys are specified then the resulting table will automatically have row_names set to those keys.

See also the related method Table.denormalize().

Parameters:
  • key – Either the name of a column from the this table to group by, a sequence of such column names, a function that takes a row and returns a value to group by, or None, in which case there will be only a single row in the output table.
  • pivot – A column name whose unique values will become columns in the new table, or None in which case there will be a single value column in the output table.
  • aggregation

    An instance of an Aggregation to perform on each group of data in the pivot table. (Each cell is the result of an aggregation of the grouped data.)

    If not specified this defaults to Count with no arguments.

  • computation

    An optional Computation instance to be applied to the aggregated sequence of values before they are transposed into the pivot table.

    Use the class name of the aggregation as your column name argument when constructing your computation. (This is “Count” if using the default value for aggregation.)

  • default_value – Value to be used for missing values in the pivot table. Defaults to Decimal(0). If performing non-mathematical aggregations you may wish to set this to None.
  • key_name – A name for the key column in the output table. This is most useful when the provided key is a function. This argument is not valid when key is a sequence.
Returns:

A new Table.

print_bars(label_column_name='group', value_column_name='Count', domain=None, width=120, output=<open file '<stdout>', mode 'w'>, printable=False)

Print a text-based bar chart based on this table.

Parameters:
  • label_column_name – The column containing the label values. Defaults to group, which is the default output of Table.pivot() or Table.bins().
  • value_column_name – The column containing the bar values. Defaults to Count, which is the default output of Table.pivot() or Table.bins().
  • domain – A 2-tuple containing the minimum and maximum values for the chart’s x-axis. The domain must be large enough to contain all values in the column.
  • width – The width, in characters, to use for the bar chart. Defaults to 120.
  • output – A file-like object to print to. Defaults to sys.stdout.
  • printable – If true, only printable characters will be outputed.
print_csv(**kwargs)

Print this table as a CSV.

This is the same as passing sys.stdout to Table.to_csv().

kwargs will be passed on to Table.to_csv().

print_html(max_rows=20, max_columns=6, output=<open file '<stdout>', mode 'w'>, max_column_width=20, locale=None)

Print an HTML version of this table.

Parameters:
  • max_rows – The maximum number of rows to display before truncating the data. This defaults to 20 to prevent accidental printing of the entire table. Pass None to disable the limit.
  • max_columns – The maximum number of columns to display before truncating the data. This defaults to 6 to prevent wrapping in most cases. Pass None to disable the limit.
  • output – A file-like object to print to. Defaults to sys.stdout, unless running in Jupyter. (See above.)
  • max_column_width – Truncate all columns to at most this width. The remainder will be replaced with ellipsis.
  • locale – Provide a locale you would like to be used to format the output. By default it will use the system’s setting.
print_json(**kwargs)

Print this table as JSON.

This is the same as passing sys.stdout to Table.to_json().

kwargs will be passed on to Table.to_json().

print_structure(output=<open file '<stdout>', mode 'w'>, max_rows=None)

Print this table’s column names and types as a plain-text table.

Parameters:output – The output to print to.
print_table(max_rows=20, max_columns=6, output=<open file '<stdout>', mode 'w'>, max_column_width=20, locale=None, max_precision=3)

Print a text-based view of the data in this table.

The output of this method is Github Friendly Markdown (GFM) compatible.

Parameters:
  • max_rows – The maximum number of rows to display before truncating the data. This defaults to 20 to prevent accidental printing of the entire table. Pass None to disable the limit.
  • max_columns – The maximum number of columns to display before truncating the data. This defaults to 6 to prevent wrapping in most cases. Pass None to disable the limit.
  • output – A file-like object to print to.
  • max_column_width – Truncate all columns to at most this width. The remainder will be replaced with ellipsis.
  • locale – Provide a locale you would like to be used to format the output. By default it will use the system’s setting.
Max_precision:

Puts a limit on the maximum precision displayed for number types. Numbers with lesser precision won’t be affected. This defaults to 3. Pass None to disable limit.

rename(column_names=None, row_names=None, slug_columns=False, slug_rows=False, **kwargs)

Create a copy of this table with different column names or row names.

By enabling slug_columns or slug_rows and not specifying new names you may slugify the table’s existing names.

kwargs will be passed to the slugify method in python-slugify. See: https://github.com/un33k/python-slugify

Parameters:
  • column_names – New column names for the renamed table. May be either an array or a dictionary mapping existing column names to new names. If not specified, will use this table’s existing column names.
  • row_names – New row names for the renamed table. May be either an array or a dictionary mapping existing row names to new names. If not specified, will use this table’s existing row names.
  • slug_columns – If True, column names will be converted to slugs and duplicate names will have unique identifiers appended.
  • slug_rows – If True, row names will be converted to slugs and dupicate names will have unique identifiers appended.
row_names

An tuple of strings, if this table has row names.

If this table does not have row names, then None.

rows

A MappedSeqeuence with row names for keys (if specified) and Row instances for values.

scatterplot(x=0, y=1, path=None, width=None, height=None)

Render a scatterplot using leather.Chart.

Parameters:
  • x – The name or index of a column to plot as the x-axis. Defaults to the first column in the table.
  • y – The name or index of a column to plot as the y-axis. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
select(key)

Create a new table with only the specified columns.

Parameters:key – Either the name of a single column to include or a sequence of such names.
Returns:A new Table.
to_csv(path, **kwargs)

Write this table to a CSV. This method uses agate’s builtin CSV writer, which supports unicode on both Python 2 and Python 3.

kwargs will be passed through to the CSV writer.

Parameters:path – Filepath or file-like object to write to.
to_json(path, key=None, newline=False, indent=None, **kwargs)

Write this table to a JSON file or file-like object.

kwargs will be passed through to the JSON encoder.

Parameters:
  • path – File path or file-like object to write to.
  • key – If specified, JSON will be output as an hash instead of a list. May be either the name of a column from the this table containing unique values or a function that takes a row and returns a unique value.
  • newline – If True, output will be in the form of “newline-delimited JSON”.
  • indent – If specified, the number of spaces to indent the JSON for formatting.
where(test)

Create a new Table with only those rows that pass a test.

Parameters:test (function) – A function that takes a Row and returns True if it should be included in the new Table.
Returns:A new Table.

TableSet

The TableSet class collects a set of related tables in a single data structure. The most common way of creating a TableSet is using the Table.group_by() method, which is similar to SQL’s GROUP BY keyword. The resulting set of tables will all have identical columns structure.

TableSet functions as a dictionary. Individual tables in the set can be accessed by using their name as a key. If the table set was created using Table.group_by() then the names of the tables will be the grouping factors found in the original data.

TableSet replicates the majority of the features of Table. When methods such as TableSet.select(), TableSet.where() or TableSet.order_by() are used, the operation is applied to each table in the set and the result is a new TableSet instance made up of entirely new Table instances.

TableSet instances can also contain other TableSet’s. This means you can chain calls to Table.group_by() and TableSet.group_by() and end up with data grouped across multiple dimensions. TableSet.aggregate() on nested TableSets will then group across multiple dimensions.

agate.TableSet An group of named tables with identical column definitions.

Properties

agate.TableSet.key_name Get the name of the key this TableSet is grouped by.
agate.TableSet.key_type Get the DataType this TableSet is grouped by.
agate.TableSet.column_types Get an ordered list of this TableSet‘s column types.
agate.TableSet.column_names Get an ordered list of this TableSet‘s column names.

Creating

agate.TableSet.from_csv Create a new TableSet from a directory of CSVs.
agate.TableSet.from_json Create a new TableSet from a directory of JSON files or a single JSON object with key value (Table key and list of row objects) pairs for each Table.

Saving

agate.TableSet.to_csv Write each table in this set to a separate CSV in a given directory.
agate.TableSet.to_json Write TableSet to either a set of JSON files for each table or a single nested JSON file.

Processing

agate.TableSet.aggregate Aggregate data from the tables in this set by performing some set of column operations on the groups and coalescing the results into a new Table.
agate.TableSet.having Create a new TableSet with only those tables that pass a test.
agate.TableSet.merge Convert this TableSet into a single table.

Previewing

agate.TableSet.print_structure Print the keys and row counts of each table in the tableset.

Charting

agate.TableSet.bar_chart Render a lattice/grid of bar charts using leather.Lattice.
agate.TableSet.column_chart Render a lattice/grid of column charts using leather.Lattice.
agate.TableSet.line_chart Render a lattice/grid of line charts using leather.Lattice.
agate.TableSet.scatterplot Render a lattice/grid of scatterplots using leather.Lattice.

Table Proxy Methods

agate.TableSet.bins Calls Table.bins() on each table in the TableSet.
agate.TableSet.compute Calls Table.compute() on each table in the TableSet.
agate.TableSet.denormalize Calls Table.denormalize() on each table in the TableSet.
agate.TableSet.distinct Calls Table.distinct() on each table in the TableSet.
agate.TableSet.exclude Calls Table.exclude() on each table in the TableSet.
agate.TableSet.find Calls Table.find() on each table in the TableSet.
agate.TableSet.group_by Calls Table.group_by() on each table in the TableSet.
agate.TableSet.homogenize Calls Table.homogenize() on each table in the TableSet.
agate.TableSet.join Calls Table.join() on each table in the TableSet.
agate.TableSet.limit Calls Table.limit() on each table in the TableSet.
agate.TableSet.normalize Calls Table.normalize() on each table in the TableSet.
agate.TableSet.order_by Calls Table.order_by() on each table in the TableSet.
agate.TableSet.pivot Calls Table.pivot() on each table in the TableSet.
agate.TableSet.select Calls Table.select() on each table in the TableSet.
agate.TableSet.where Calls Table.where() on each table in the TableSet.

Detailed list

class agate.TableSet(tables, keys, key_name='group', key_type=None, _is_fork=False)

Bases: agate.mapped_sequence.MappedSequence

An group of named tables with identical column definitions. Supports (almost) all the same operations as Table. When executed on a TableSet, any operation that would have returned a new Table instead returns a new TableSet. Any operation that would have returned a single value instead returns a dictionary of values.

TableSet is implemented as a subclass of MappedSequence

Parameters:
  • tables – A sequence Table instances.
  • keys – A sequence of keys corresponding to the tables. These may be any type except int.
  • key_name – A name that describes the grouping properties. Used as the column header when the groups are aggregated. Defaults to the column name that was grouped on.
  • key_type – An instance some subclass of DataType. If not provided it will default to a :class`.Text`.
  • _is_fork – Used internally to skip certain validation steps when data is propagated from an existing tablset.
aggregate(aggregations)

Aggregate data from the tables in this set by performing some set of column operations on the groups and coalescing the results into a new Table.

aggregations must be a sequence of tuples, where each has two parts: a new_column_name and a Aggregation instance.

The resulting table will have the keys from this TableSet (and any nested TableSets) set as its row_names. See Table.__init__() for more details.

Parameters:aggregations – A list of tuples in the format (new_column_name, aggregation), where each aggregation is an instance of Aggregation.
Returns:A new Table.
bar_chart(label=0, value=1, path=None, width=None, height=None)

Render a lattice/grid of bar charts using leather.Lattice.

Parameters:
  • label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
  • value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
bins(*args, **kwargs)

Calls Table.bins() on each table in the TableSet.

column_chart(label=0, value=1, path=None, width=None, height=None)

Render a lattice/grid of column charts using leather.Lattice.

Parameters:
  • label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
  • value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
column_names

Get an ordered list of this TableSet‘s column names.

Returns:A tuple of strings.
column_types

Get an ordered list of this TableSet‘s column types.

Returns:A tuple of DataType instances.
compute(*args, **kwargs)

Calls Table.compute() on each table in the TableSet.

count(value) → integer -- return number of occurrences of value
denormalize(*args, **kwargs)

Calls Table.denormalize() on each table in the TableSet.

dict()

Retrieve the contents of this sequence as an collections.OrderedDict.

distinct(*args, **kwargs)

Calls Table.distinct() on each table in the TableSet.

exclude(*args, **kwargs)

Calls Table.exclude() on each table in the TableSet.

find(*args, **kwargs)

Calls Table.find() on each table in the TableSet.

classmethod from_csv(dir_path, column_names=None, column_types=None, row_names=None, header=True, **kwargs)

Create a new TableSet from a directory of CSVs.

See Table.from_csv() for additional details.

Parameters:
  • dir_path – Path to a directory full of CSV files. All CSV files in this directory will be loaded.
  • column_names – See Table.__init__().
  • column_types – See Table.__init__().
  • row_names – See Table.__init__().
  • header – See Table.from_csv().
classmethod from_json(path, column_names=None, column_types=None, keys=None, **kwargs)

Create a new TableSet from a directory of JSON files or a single JSON object with key value (Table key and list of row objects) pairs for each Table.

See Table.from_json() for additional details.

Parameters:
  • path – Path to a directory containing JSON files or filepath/file-like object of nested JSON file.
  • keys – A list of keys of the top-level dictionaries for each file. If specified, length must be equal to number of JSON files in path.
  • column_types – See Table.__init__().
get(key, default=None)

Equivalent to collections.OrderedDict.get().

group_by(*args, **kwargs)

Calls Table.group_by() on each table in the TableSet.

having(aggregations, test)

Create a new TableSet with only those tables that pass a test.

This works by applying a sequence of Aggregation instances to each table. The resulting dictionary of properties is then passed to the test function.

This method does not modify the underlying tables in any way.

Parameters:
  • aggregations – A list of tuples in the format (name, aggregation), where each aggregation is an instance of Aggregation.
  • test (function) – A function that takes a dictionary of aggregated properties and returns True if it should be included in the new TableSet.
Returns:

A new TableSet.

homogenize(*args, **kwargs)

Calls Table.homogenize() on each table in the TableSet.

index(value) → integer -- return first index of value.

Raises ValueError if the value is not present.

items()

Equivalent to collections.OrderedDict.items().

join(*args, **kwargs)

Calls Table.join() on each table in the TableSet.

key_name

Get the name of the key this TableSet is grouped by. (If created using Table.group_by() then this is the original column name.)

key_type

Get the DataType this TableSet is grouped by. (If created using Table.group_by() then this is the original column type.)

keys()

Equivalent to collections.OrderedDict.keys().

limit(*args, **kwargs)

Calls Table.limit() on each table in the TableSet.

line_chart(x=0, y=1, path=None, width=None, height=None)

Render a lattice/grid of line charts using leather.Lattice.

Parameters:
  • x – The name or index of a column to plot as the x axis of the chart. Defaults to the first column in the table.
  • y – The name or index of a column to plot as the y axis of the chart. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
merge(groups=None, group_name=None, group_type=None)

Convert this TableSet into a single table. This is the inverse of Table.group_by().

Any row_names set on the merged tables will be lost in this process.

Parameters:
  • groups – A list of grouping factors to add to merged rows in a new column. If specified, it should have exactly one element per Table in the TableSet. If not specified or None, the grouping factor will be the name of the Row‘s original Table.
  • group_name – This will be the column name of the grouping factors. If None, defaults to the TableSet.key_name.
  • group_type – This will be the column type of the grouping factors. If None, defaults to the TableSet.key_type.
Returns:

A new Table.

normalize(*args, **kwargs)

Calls Table.normalize() on each table in the TableSet.

order_by(*args, **kwargs)

Calls Table.order_by() on each table in the TableSet.

pivot(*args, **kwargs)

Calls Table.pivot() on each table in the TableSet.

print_structure(max_rows=20, output=<open file '<stdout>', mode 'w'>)

Print the keys and row counts of each table in the tableset.

Parameters:
  • max_rows – The maximum number of rows to display before truncating the data. Defaults to 20.
  • output – The output used to print the structure of the Table.
Returns:

None

scatterplot(x=0, y=1, path=None, width=None, height=None)

Render a lattice/grid of scatterplots using leather.Lattice.

Parameters:
  • x – The name or index of a column to plot as the x axis of the chart. Defaults to the first column in the table.
  • y – The name or index of a column to plot as the y axis of the chart. Defaults to the second column in the table.
  • path – If specified, the resulting SVG will be saved to this location. If None and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string.
  • width – The width of the output SVG.
  • height – The height of the output SVG.
select(*args, **kwargs)

Calls Table.select() on each table in the TableSet.

to_csv(dir_path, **kwargs)

Write each table in this set to a separate CSV in a given directory.

See Table.to_csv() for additional details.

Parameters:dir_path – Path to the directory to write the CSV files to.
to_json(path, nested=False, indent=None, **kwargs)

Write TableSet to either a set of JSON files for each table or a single nested JSON file.

See Table.to_json() for additional details.

Parameters:
  • path – Path to the directory to write the JSON file(s) to. If nested is True, this should be a file path or file-like object to write to.
  • nested – If True, the output will be a single nested JSON file with each Table’s key paired with a list of row objects. Otherwise, the output will be a set of files for each table. Defaults to False.
  • indent – See Table.to_json().
values()

Equivalent to collections.OrderedDict.values().

where(*args, **kwargs)

Calls Table.where() on each table in the TableSet.

Columns and rows

agate.MappedSequence A generic container for immutable data that can be accessed either by numeric index or by key.
agate.Column Proxy access to column data.
agate.Row A row of data.
class agate.MappedSequence(values, keys=None)

Bases: _abcoll.Sequence

A generic container for immutable data that can be accessed either by numeric index or by key. This is similar to an collections.OrderedDict except that the keys are optional and iteration over it returns the values instead of keys.

This is the base class for both Column and Row.

Parameters:
  • values – A sequence of values.
  • keys – A sequence of keys.
dict()

Retrieve the contents of this sequence as an collections.OrderedDict.

get(key, default=None)

Equivalent to collections.OrderedDict.get().

items()

Equivalent to collections.OrderedDict.items().

keys()

Equivalent to collections.OrderedDict.keys().

values()

Equivalent to collections.OrderedDict.values().

class agate.Column(index, name, data_type, rows, row_names=None)

Bases: agate.mapped_sequence.MappedSequence

Proxy access to column data. Instances of Column should not be constructed directly. They are created by Table instances and are unique to them.

Columns are implemented as subclass of MappedSequence. They deviate from the underlying implementation in that loading of their data is deferred until it is needed.

Parameters:
  • name – The name of this column.
  • data_type – An instance of DataType.
  • rows – A MappedSequence that contains the Row instances containing the data for this column.
  • row_names – An optional list of row names (keys) for this column.
data_type

This column’s data type.

index

This column’s index.

name

This column’s name.

values()

Get the values in this column, as a tuple.

values_distinct()

Get the distinct values in this column, as a tuple.

values_sorted()

Get the values in this column sorted.

values_without_nulls()

Get the values in this column with any null values removed.

values_without_nulls_sorted()

Get the values in this column with any null values removed and sorted.

class agate.Row(values, keys=None)

Bases: agate.mapped_sequence.MappedSequence

A row of data. Values within a row can be accessed by column name or column index. Row are immutable and may be shared between Table instances.

Currently row instances are a no-op subclass of MappedSequence. They are being maintained in this fashion in order to support future features.

Data types

Data types define how data should be imported during the creation of a Table.

If column types are not explicitly specified when a Table is created, agate will attempt to guess them. The TypeTester class can be used to control how types are guessed.

agate.DataType Specifies how values should be parsed when creating a Table.

Supported types

agate.Text Data representing text.
agate.Number Data representing numbers.
agate.Boolean Data representing true and false.
agate.Date Data representing dates alone.
agate.DateTime Data representing dates with times.
agate.TimeDelta Data representing the interval between two dates and/or times.

Detailed list

class agate.DataType(null_values=('', 'na', 'n/a', 'none', 'null', '.'))

Bases: object

Specifies how values should be parsed when creating a Table.

Parameters:null_values – A sequence of values which should be cast to None when encountered by this data type.
cast(d)

Coerce a given string value into this column’s data type.

csvify(d)

Format a given native value for CSV serialization.

jsonify(d)

Format a given native value for JSON serialization.

test(d)

Test, for purposes of type inference, if a value could possibly be coerced to this data type.

This is really just a thin wrapper around DataType.cast().

class agate.Text(cast_nulls=True, **kwargs)

Bases: agate.data_types.base.DataType

Data representing text.

Parameters:cast_nulls – If True, values in DEFAULT_NULL_VALUES will be converted to None. Disable to retain them as strings.
cast(d)

Cast a single value to unicode() (str() in Python 3).

Parameters:d – A value to cast.
Returns:unicode() (str() in Python 3) or None
class agate.Number(locale='en_US', group_symbol=None, decimal_symbol=None, currency_symbols=[u'u060b', u'$', u'u0192', u'u17db', u'xa5', u'u20a1', u'u20b1', u'xa3', u'u20ac', u'xa2', u'ufdfc', u'u20aa', u'u20a9', u'u20ad', u'u20ae', u'u20a6', u'u0e3f', u'u20a4', u'u20ab'], **kwargs)

Bases: agate.data_types.base.DataType

Data representing numbers.

Parameters:
  • locale – A locale specification such as en_US or de_DE to use for parsing formatted numbers.
  • group_symbol – A grouping symbol used in the numbers. Overrides the value provided by the specified locale.
  • decimal_symbol – A decimal separate symbol used in the numbers. Overrides the value provided by the specified locale.
  • currency_symbols – A sequence of currency symbols to strip from numbers.
cast(d)

Cast a single value to a decimal.Decimal.

Returns:decimal.Decimal or None.
class agate.Boolean(true_values=('yes', 'y', 'true', 't', '1'), false_values=('no', 'n', 'false', 'f', '0'), null_values=('', 'na', 'n/a', 'none', 'null', '.'))

Bases: agate.data_types.base.DataType

Data representing true and false.

Note that by default numerical 1 and 0 are considered valid boolean values, but other numbers are not.

Parameters:
  • true_values – A sequence of values which should be cast to True when encountered with this type.
  • false_values – A sequence of values which should be cast to False when encountered with this type.
cast(d)

Cast a single value to bool.

Parameters:d – A value to cast.
Returns:bool or None.
class agate.Date(date_format=None, **kwargs)

Bases: agate.data_types.base.DataType

Data representing dates alone.

Parameters:date_format – A formatting string for datetime.datetime.strptime() to use instead of using regex-based parsing.
cast(d)

Cast a single value to a datetime.date.

Parameters:date_format – An optional datetime.strptime() format string for parsing datetimes in this column.
Returns:datetime.date or None.
class agate.DateTime(datetime_format=None, timezone=None, **kwargs)

Bases: agate.data_types.base.DataType

Data representing dates with times.

Parameters:
  • datetime_format – A formatting string for datetime.datetime.strptime() to use instead of using regex-based parsing.
  • timezone – A pytz timezone to apply to each parsed date.
cast(d)

Cast a single value to a datetime.datetime.

Parameters:datetime_format – An optional datetime.strptime() format string for parsing datetimes in this column.
Returns:datetime.datetime or None.
class agate.TimeDelta(null_values=('', 'na', 'n/a', 'none', 'null', '.'))

Bases: agate.data_types.base.DataType

Data representing the interval between two dates and/or times.

cast(d)

Cast a single value to datetime.timedelta.

Parameters:d – A value to cast.
Returns:datetime.timedelta or None

Type inference

class agate.TypeTester(force={}, limit=None, types=None)

Bases: object

Control how data types are inferred for columns in a given set of data.

This class is used by passing it to the column_types argument of the Table constructor, or the same argument for any other method that create a Table

Type inference can be a slow process. To limit the number of rows of data to be tested, pass the limit argument. Note that may cause errors if your data contains different types of values after the specified number of rows.

By default, data types will be tested against each column in this order:

  1. Boolean
  2. Number
  3. TimeDelta
  4. Date
  5. DateTime
  6. Text

Individual types may be specified using the force argument. The type order by be changed, or entire types disabled, by using the types argument. Beware that changing the order of the types may cause unexpected behavior.

Parameters:
  • force – A dictionary where each key is a column name and each value is a DataType instance that overrides inference.
  • limit – An optional limit on how many rows to evaluate before selecting the most likely type. Note that applying a limit may mean errors arise when the data is cast–if the guess is proved incorrect in further rows of data.
  • types – A sequence of possible types to test against. This be used to specify what data formats you want to test against. For instance, you may want to exclude TimeDelta from testing. It can also be used to pass options such as locale to Number or cast_nulls to Text. Take care in specifying the order of the list. It is the order they are tested in. Text should always be last.
run(rows, column_names)

Apply type inference to the provided data and return an array of column types.

Parameters:rows – The data as a sequence of any sequences: tuples, lists, etc.

Aggregations

Aggregations create a new value by summarizing a Column. For example, Mean, when applied to a column containing Number data, returns a single decimal.Decimal value which is the average of all values in that column.

Aggregations can be applied to single columns using the Table.aggregate() method. The result is a single value if a one aggregation was applied, or a tuple of values if a sequence of aggregations was applied.

Aggregations can be applied to instances of TableSet using the TableSet.aggregate() method. The result is a new Table with a column for each aggregation and a row for each table in the set.

agate.Aggregation Aggregations create a new value by summarizing a Column.
agate.Summary Apply an arbitrary function to a column.

Basic aggregations

agate.All Check if all values in a column pass a test.
agate.Any Check if any value in a column passes a test.
agate.Count Count occurences of a value or values.
agate.HasNulls Check if the column contains null values.
agate.Min Find the minimum value in a column.
agate.Max Find the maximum value in a column.
agate.MaxPrecision Find the most decimal places present for any value in this column.

Statistical aggregations

agate.Deciles Calculate the deciles of a column based on its percentiles.
agate.IQR Calculate the interquartile range of a column.
agate.MAD Calculate the median absolute deviation of a column.
agate.Mean Calculate the mean of a column.
agate.Median Calculate the median of a column.
agate.Mode Calculate the mode of a column.
agate.Percentiles Divide a column into 100 equal-size groups using the “CDF” method.
agate.PopulationStDev Calculate the population standard of deviation of a column.
agate.PopulationVariance Calculate the population variance of a column.
agate.Quartiles Calculate the quartiles of column based on its percentiles.
agate.Quintiles Calculate the quintiles of a column based on its percentiles.
agate.StDev Calculate the sample standard of deviation of a column.
agate.Sum Calculate the sum of a column.
agate.Variance Calculate the sample variance of a column.

Text aggregations

agate.MaxLength Find the length of the longest string in a column.

Detailed list

class agate.Aggregation

Bases: object

Aggregations create a new value by summarizing a Column.

Aggregations are applied with Table.aggregate() and TableSet.aggregate().

When creating a custom aggregation, ensure that the values returned by Aggregation.run() are of the type specified by Aggregation.get_aggregate_data_type(). This can be ensured by using the DataType.cast() method. See Summary for an example.

get_aggregate_data_type(table)

Get the data type that should be used when using this aggregation with a TableSet to produce a new column.

Should raise UnsupportedAggregationError if this column does not support aggregation into a TableSet. (For example, if it does not return a single value.)

run(table)

Execute this aggregation on a given column and return the result.

validate(table)

Perform any checks necessary to verify this aggregation can run on the provided table without errors. This is called by Table.aggregate() before run().

class agate.All(column_name, test)

Bases: agate.aggregations.base.Aggregation

Check if all values in a column pass a test.

Parameters:
  • column_name – The name of the column to check.
  • test – Either a single value that all values in the column are compared against (for equality) or a function that takes a column value and returns True or False.
run(table)
Returns:bool
class agate.Any(column_name, test)

Bases: agate.aggregations.base.Aggregation

Check if any value in a column passes a test.

Parameters:
  • column_name – The name of the column to check.
  • test – Either a single value that all values in the column are compared against (for equality) or a function that takes a column value and returns True or False.
class agate.Count(column_name=None, value=<object object>)

Bases: agate.aggregations.base.Aggregation

Count occurences of a value or values.

This aggregation can be used in three ways:

  1. If no arguments are specified, then it will count the number of rows in the table.
  2. If only column_name is specified, then it will count the number of non-null values in that column.
  3. If both column_name and value are specified, then it will count occurrences of a specific value.
Parameters:
  • column_name – The column containing the values to be counted.
  • value – Any value to be counted, including None.
class agate.Deciles(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the deciles of a column based on its percentiles.

Deciles will be equivalent to the 10th, 20th ... 90th percentiles.

“Zeroth” (min value) and “Tenth” (max value) deciles are included for reference and intuitive indexing.

See Percentiles for implementation details.

This aggregation can not be applied to a TableSet.

Parameters:column_name – The name of a column containing Number data.
run(table)
Returns:An instance of Quantiles.
class agate.HasNulls(column_name)

Bases: agate.aggregations.base.Aggregation

Check if the column contains null values.

Parameters:column_name – The name of the column to check.
class agate.IQR(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the interquartile range of a column.

Parameters:column_name – The name of a column containing Number data.
class agate.MAD(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the median absolute deviation of a column.

Parameters:column_name – The name of a column containing Number data.
class agate.Min(column_name)

Bases: agate.aggregations.base.Aggregation

Find the minimum value in a column.

This aggregation can be applied to columns containing Date, DateTime, or Number data.

Parameters:column_name – The name of the column to be searched.
class agate.Max(column_name)

Bases: agate.aggregations.base.Aggregation

Find the maximum value in a column.

This aggregation can be applied to columns containing Date, DateTime, or Number data.

Parameters:column_name – The name of the column to be searched.
class agate.MaxLength(column_name)

Bases: agate.aggregations.base.Aggregation

Find the length of the longest string in a column.

Note: On Python 2.7 this function may miscalcuate the length of unicode strings that contain “wide characters”. For details see this StackOverflow answer: http://stackoverflow.com/a/35462951

Parameters:column_name – The name of a column containing Text data.
run(table)
Returns:int.
class agate.MaxPrecision(column_name)

Bases: agate.aggregations.base.Aggregation

Find the most decimal places present for any value in this column.

Parameters:column_name – The name of the column to be searched.
class agate.Mean(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the mean of a column.

Parameters:column_name – The name of a column containing Number data.
class agate.Median(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the median of a column.

Median is equivalent to the 50th percentile. See Percentiles for implementation details.

Parameters:column_name – The name of a column containing Number data.
class agate.Mode(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the mode of a column.

Parameters:column_name – The name of a column containing Number data.
class agate.Percentiles(column_name)

Bases: agate.aggregations.base.Aggregation

Divide a column into 100 equal-size groups using the “CDF” method.

See this explanation of the various methods for computing percentiles.

“Zeroth” (min value) and “Hundredth” (max value) percentiles are included for reference and intuitive indexing.

A reference implementation was provided by pycalcstats.

This aggregation can not be applied to a TableSet.

Parameters:column_name – The name of a column containing Number data.
run(table)
Returns:An instance of Quantiles.
class agate.PopulationStDev(column_name)

Bases: agate.aggregations.stdev.StDev

Calculate the population standard of deviation of a column.

For the sample standard of deviation see StDev.

Parameters:column_name – The name of a column containing Number data.
class agate.PopulationVariance(column_name)

Bases: agate.aggregations.variance.Variance

Calculate the population variance of a column.

For the sample variance see Variance.

Parameters:column_name – The name of a column containing Number data.
class agate.Quartiles(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the quartiles of column based on its percentiles.

Quartiles will be equivalent to the the 25th, 50th and 75th percentiles.

“Zeroth” (min value) and “Fourth” (max value) quartiles are included for reference and intuitive indexing.

See Percentiles for implementation details.

This aggregation can not be applied to a TableSet.

Parameters:column_name – The name of a column containing Number data.
run(table)
Returns:An instance of Quantiles.
class agate.Quintiles(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the quintiles of a column based on its percentiles.

Quintiles will be equivalent to the 20th, 40th, 60th and 80th percentiles.

“Zeroth” (min value) and “Fifth” (max value) quintiles are included for reference and intuitive indexing.

See Percentiles for implementation details.

This aggregation can not be applied to a TableSet.

Parameters:column_name – The name of a column containing Number data.
run(table)
Returns:An instance of Quantiles.
class agate.StDev(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the sample standard of deviation of a column.

For the population standard of deviation see PopulationStDev.

Parameters:column_name – The name of a column containing Number data.
class agate.Sum(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the sum of a column.

Parameters:column_name – The name of a column containing Number data.
class agate.Summary(column_name, data_type, func, cast=True)

Bases: agate.aggregations.base.Aggregation

Apply an arbitrary function to a column.

Parameters:
  • column_name – The name of a column to be summarized.
  • data_type – The return type of this aggregation.
  • func – A function which will be passed the column for processing.
  • cast – If True, each return value will be cast to the specified data_type to ensure it is valid. Only disable this if you are certain your summary always returns the correct type.
class agate.Variance(column_name)

Bases: agate.aggregations.base.Aggregation

Calculate the sample variance of a column.

For the population variance see PopulationVariance.

Parameters:column_name – The name of a column containing Number data.

Computations

Computations create a new value for each Row in a Table. When used with Table.compute() these new values become a new column. For instance, the PercentChange computation takes two column names as arguments and computes the percentage change between them for each row.

There are a variety of basic computations, such as Change and Percent. If none of these meet your needs you can use the Formula computation to apply an arbitrary function to the row. If this still isn’t flexible enough, it’s simple to create a custom computation class by inheriting from Computation.

agate.Computation Computations produce a new column by performing a calculation on each row.
agate.Formula Apply an arbitrary function to each row.

Mathematical computations

agate.Change Calculate the difference between two columns.
agate.Percent Calculate each values percentage of a total.
agate.PercentChange Calculate the percent difference between two columns.
agate.PercentileRank Calculate the percentile into which each value falls.
agate.Rank Calculate rank order of the values in a column.

Detailed list

class agate.Change(before_column_name, after_column_name)

Bases: agate.computations.base.Computation

Calculate the difference between two columns.

This calculation can be applied to Number columns to calculate numbers. It can also be applied to Date, DateTime, and TimeDelta columns to calculate time deltas.

Parameters:
  • before_column_name – The name of a column containing the “before” values.
  • after_column_name – The name of a column containing the “after” values.
class agate.Computation

Bases: object

Computations produce a new column by performing a calculation on each row.

Computations are applied with TableSet.compute.

When implementing a custom computation, ensure that the values returned by Computation.run() are of the type specified by Computation.get_computed_data_type(). This can be ensured by using the DataType.cast() method. See Formula for an example.

get_computed_data_type(table)

Returns an instantiated DataType which will be appended to the table.

run(table)

When invoked with a table, returns a sequence of new column values.

validate(table)

Perform any checks necessary to verify this computation can run on the provided table without errors. This is called by Table.compute() before run().

class agate.Formula(data_type, func, cast=True)

Bases: agate.computations.base.Computation

Apply an arbitrary function to each row.

Parameters:
  • data_type – The data type this formula will return.
  • func – The function to be applied to each row. Must return a valid value for the specified data type.
  • cast – If True, each return value will be cast to the specified data_type to ensure it is valid. Only disable this if you are certain your formula always returns the correct type.
class agate.Percent(column_name, total=None)

Bases: agate.computations.base.Computation

Calculate each values percentage of a total.

Parameters:
  • column_name – The name of a column containing the Number values.
  • total – If specified, the total value for each number to be divided into. By default, the Sum of the values in the column will be used.
run(table)
Returns:decimal.Decimal
class agate.PercentChange(before_column_name, after_column_name)

Bases: agate.computations.base.Computation

Calculate the percent difference between two columns.

Parameters:
  • before_column_name – The name of a column containing the “before” Number values.
  • after_column_name – The name of a column containing the “after” Number values.
run(table)
Returns:decimal.Decimal
class agate.PercentileRank(column_name, comparer=None, reverse=None)

Bases: agate.computations.rank.Rank

Calculate the percentile into which each value falls.

See Percentiles for implementation details.

Parameters:column_name – The name of a column containing the Number values.
run(table)
Returns:int
class agate.Rank(column_name, comparer=None, reverse=None)

Bases: agate.computations.base.Computation

Calculate rank order of the values in a column.

Uses the “competition” ranking method: if there are four values and the middle two are tied, then the output will be [1, 2, 2, 4].

Null values will always be ranked last.

Parameters:
  • column_name – The name of the column to rank.
  • comparer – An optional comparison function. If not specified ranking will be ascending, with nulls ranked last.
  • reverse – Reverse sort order before ranking.
run(table)
Returns:int
class agate.Slug(column_name, ensure_unique=False, **kwargs)

Bases: agate.computations.base.Computation

Convert text values from one or more columns into slugs. If multiple column names are given, values from those columns will be appended in the given order before standardizing.

Parameters:
  • column_name – The name of a column or a sequence of column names containing Text values.
  • ensure_unique – If True, any duplicate values will be appended with unique identifers. Defaults to False.
run(table)
Returns:string

CSV reader and writer

Agate contains CSV readers and writers that are intended to be used as a drop-in replacement for csv. These versions add unicode support for Python 2 and several other minor features.

Agate methods will use these version automatically. If you would like to use them in your own code, you can import them, like this:

import agate as csv

Due to nuanced differences between the versions, these classes are implemented seperately for Python 2 and Python 3. The documentation for both versions is provided below, but only the one for your version of Python is imported with the above code.

Python 3

agate.csv_py3.reader A replacement for Python’s csv.reader() that uses csv_py3.Reader.
agate.csv_py3.writer A replacement for Python’s csv.writer() that uses csv_py3.Writer.
agate.csv_py3.Reader A wrapper around Python 3’s builtin csv.reader().
agate.csv_py3.Writer A wrapper around Python 3’s builtin csv.writer().
agate.csv_py3.DictReader A wrapper around Python 3’s builtin csv.DictReader.
agate.csv_py3.DictWriter A wrapper around Python 3’s builtin csv.DictWriter.

Python 2

agate.csv_py2.reader A replacement for Python’s csv.reader() that uses csv_py2.Reader.
agate.csv_py2.writer A replacement for Python’s csv.writer() that uses csv_py2.Writer.
agate.csv_py2.Reader A unicode-aware CSV reader.
agate.csv_py2.Writer A unicode-aware CSV writer.
agate.csv_py2.DictReader A unicode-aware CSV DictReader.
agate.csv_py2.DictWriter A unicode-aware CSV DictWriter.

Python 3 details

agate.csv_py3.reader(*args, **kwargs)

A replacement for Python’s csv.reader() that uses csv_py3.Reader.

agate.csv_py3.writer(*args, **kwargs)

A replacement for Python’s csv.writer() that uses csv_py3.Writer.

class agate.csv_py3.Reader(f, field_size_limit=None, line_numbers=False, header=True, **kwargs)

Bases: six.Iterator

A wrapper around Python 3’s builtin csv.reader().

class agate.csv_py3.Writer(f, line_numbers=False, **kwargs)

Bases: object

A wrapper around Python 3’s builtin csv.writer().

class agate.csv_py3.DictReader(f, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)

Bases: csv.DictReader

A wrapper around Python 3’s builtin csv.DictReader.

class agate.csv_py3.DictWriter(f, fieldnames, line_numbers=False, **kwargs)

Bases: csv.DictWriter

A wrapper around Python 3’s builtin csv.DictWriter.

Python 2 details

agate.csv_py2.reader(*args, **kwargs)

A replacement for Python’s csv.reader() that uses csv_py2.Reader.

agate.csv_py2.writer(*args, **kwargs)

A replacement for Python’s csv.writer() that uses csv_py2.Writer.

class agate.csv_py2.Reader(f, encoding='utf-8', field_size_limit=None, line_numbers=False, header=True, **kwargs)

Bases: agate.csv_py2.UnicodeReader

A unicode-aware CSV reader.

class agate.csv_py2.Writer(f, encoding='utf-8', line_numbers=False, **kwargs)

Bases: agate.csv_py2.UnicodeWriter

A unicode-aware CSV writer.

class agate.csv_py2.DictReader(f, fieldnames=None, restkey=None, restval=None, *args, **kwargs)

Bases: agate.csv_py2.UnicodeDictReader

A unicode-aware CSV DictReader.

class agate.csv_py2.DictWriter(f, fieldnames, encoding='utf-8', line_numbers=False, **kwargs)

Bases: agate.csv_py2.UnicodeDictWriter

A unicode-aware CSV DictWriter.

Fixed-width reader

Agate contains a fixed-width file reader that is designed to work like Python’s csv.

These readers work with CSV-formatted schemas, such as those maintained at wireservice/ffs.

agate.fixed.reader A wrapper around fixed.Reader, so that it can be used in the same way as a normal CSV reader.
agate.fixed.Reader Reads a fixed-width file using a column schema in CSV format.
agate.fixed.DictReader A fixed-width reader that returns collections.OrderedDict rather than a list.

Detailed list

agate.fixed.reader(*args, **kwargs)

A wrapper around fixed.Reader, so that it can be used in the same way as a normal CSV reader.

class agate.fixed.Reader(f, schema_f)

Bases: six.Iterator

Reads a fixed-width file using a column schema in CSV format.

This works almost exactly like Python’s built-in CSV reader.

Schemas must be in the “ffs” format, with column, start, and length columns. There is a repository of such schemas maintained at wireservice/ffs.

fieldnames

The names of the columns read from the schema.

class agate.fixed.DictReader(f, schema_f)

Bases: agate.fixed.Reader

A fixed-width reader that returns collections.OrderedDict rather than a list.

Miscellaneous

agate.Patchable
agate.NullOrder Dummy object used for sorting in place of None.
agate.Quantiles A class representing quantiles (percentiles, quartiles, etc.) for a given column of Number data.
class agate.NullOrder

Bases: object

Dummy object used for sorting in place of None.

Sorts as “greater than everything but other nulls.”

class agate.Quantiles(quantiles)

Bases: _abcoll.Sequence

A class representing quantiles (percentiles, quartiles, etc.) for a given column of Number data.

locate(value)

Identify which quantile a given value is part of.

Exceptions

agate.DataTypeError A calculation was attempted with an invalid DataType.
agate.UnsupportedAggregationError An Aggregation was attempted which is not supported.
agate.CastError A column value can not be cast to the correct type.
agate.FieldSizeLimitError A field in a CSV file exceeds the maximum length.
exception agate.DataTypeError

Bases: exceptions.TypeError

A calculation was attempted with an invalid DataType.

exception agate.UnsupportedAggregationError

Bases: exceptions.TypeError

An Aggregation was attempted which is not supported.

For example, if a Percentiles is applied to a TableSet.

exception agate.CastError

Bases: exceptions.Exception

A column value can not be cast to the correct type.

exception agate.FieldSizeLimitError(limit)

Bases: exceptions.Exception

A field in a CSV file exceeds the maximum length.

This length may be the default or one set by the user.

Warnings

class agate.NullCalculationWarning

Bases: exceptions.RuntimeWarning

Warning raised if a calculation which can not logically account for null values is performed on a Column containing nulls.

class agate.DuplicateColumnWarning

Bases: exceptions.RuntimeWarning

Warning raised if multiple columns with the same name are added to a new Table.

agate.warn_null_calculation(operation, column)
agate.warn_duplicate_column(column_name, column_rename)

Unit testing helpers

class agate.AgateTestCase(methodName='runTest')

Bases: unittest2.case.TestCase

Unittest case for quickly asserting logic about tables.

assertColumnNames(table, names)

Verify the column names in the given table match what is expected.

assertColumnTypes(table, types)

Verify the column types in the given table are of the expected types.

assertRowNames(table, names)

Verify the row names in the given table match what is expected.

assertRows(table, rows)

Verify the row data in the given table match what is expected.

Config

This module contains the global configuration for agate. Users should use get_option() and set_option() to modify the global configuration.

Available configuation options:

Option Description Default value
default_locale Default locale for number formatting default_locale(‘LC_NUMERIC’) or ‘en_US’
horizontal_line_char Character to render for horizontal lines u’-‘
vertical_line_char Character to render for vertical lines u’|’
bar_char Character to render for bar chart units u’░’
printable_bar_char Printable character for bar chart units u’:’
zero_line_char Character to render for zero line units u’▓’
printable_zero_line_char Printable character for zero line units u’|’
tick_char Character to render for axis ticks u’+’
ellipsis_chars Characters to render for ellipsis u’...’
agate.config.get_option(key)

Get a global configuration option for agate.

Parameters:key – The name of the configuration option.
agate.config.set_option(key, value)

Set a global configuration option for agate.

Parameters:
  • key – The name of the configuration option.
  • value – The new value to set for the configuration option.
agate.config.set_options(options)

Set a dictionary of options simultaneously.

Parameters:hash – A dictionary of option names and values.

Contributing

agate actively encourages contributions from people of all genders, races, ethnicities, ages, creeds, nationalities, persuasions, alignments, sizes, shapes, and journalistic affiliations. You are welcome here.

We seek contributions from developers and non-developers of all skill levels. We will typically accept bug fixes, documentation updates, and new cookbook recipes with minimal fuss. If you want to work on a larger feature—great! The maintainers will be happy to provide feedback and code review on your implementation.

Before making any changes or additions to agate, please be sure to read about the principles of agate in the About section of the documentation.

Process for documentation

Not a developer? That’s fine! As long as you can use git (there are many tutorials) then you can contribute to agate. Please follow this process:

  1. Fork the project on Github.
  2. If you don’t have a specific task in mind, check out the issue tracker and find a documentation ticket that needs to be done.
  3. Comment on the ticket letting everyone know you’re going to be working on it so that nobody duplicates your effort.
  4. Write the documentation. Documentation files live in the docs directory and are in Restructured Text Format.
  5. Add yourself to the AUTHORS file if you aren’t already there.
  6. Once your contribution is complete, submit a pull request on Github.
  7. Wait for it to either be merged by a maintainer or to receive feedback about what needs to be revised.
  8. Rejoice!

Process for code

Hacker? We’d love to have you hack with us. Please follow this process to make your contribution:

  1. Fork the project on Github.
  2. If you don’t have a specific task in mind, check out the issue tracker and find a task that needs to be done and is of a scope you can realistically expect to complete in a few days. Don’t worry about the priority of the issues at first, but try to choose something you’ll enjoy. You’re much more likely to finish something to the point it can be merged if it’s something you really enjoy hacking on.
  3. If you already have a task you know you want to work on, open a ticket or comment on the existing ticket letting everyone know you’re going to be working on it. It’s also good practice to provide some general idea of how you plan on resolving the issue so that other developers can make suggestions.
  4. Write tests for the feature you’re building. Follow the format of the existing tests in the test directory to see how this works. You can run all the tests with the command nosetests tests.
  5. Verify your tests work on all supported versions of Python by runing tox.
  6. Write the code. Try to stay consistent with the style and organization of the existing codebase. A good patch won’t be refused for stylistic reasons, but large parts of it may be rewritten and nobody wants that.
  7. As you are coding, periodically merge in work from the master branch and verify you haven’t broken anything by running the test suite.
  8. Write documentation. This means docstrings on all classes and methods, including parameter explanations. It also means, when relevant, cookbook recipes and updates to the agate user tutorial.
  9. Add yourself to the AUTHORS file if you aren’t already there.
  10. Once your contribution is complete, tested, and has documentation, submit a pull request on Github.
  11. Wait for it to either be merged by a maintainer or to receive feedback about what needs to be revisited.
  12. Rejoice!

Licensing

To the extent that they care, contributors should keep in mind that the source of agate and therefore of any contributions are licensed under the permissive MIT license. By submitting a patch or pull request you are agreeing to release your code under this license. You will be acknowledged in the AUTHORS list, the commit history and the hearts and minds of journalists everywhere.

Release process

This is the release process for agate:

  1. Verify all unit tests pass with fresh environments: tox -r.
  2. Verify 100% test coverage: nosetests --with-coverage tests.
  3. Ensure any new modules have been added to setup.py’s packages list.
  4. Ensure any new public interfaces have been added to the documentation.
  5. Ensure TableSet proxy methods have been added for new Table methods.
  6. Make sure the example script still works: python example.py.
  7. Ensure python charts.py works and has been run recently.
  8. Ensure CHANGELOG.rst is up to date. Add the release date and summary.
  9. Create a release tag: git tag -a x.y.z -m "x.y.z release."
  10. Push tags upstream: git push --tags
  11. If this is a major release, merge master into stable: git checkout stable; git merge master; git push
  12. Upload to PyPI: python setup.py sdist bdist_wheel upload.
  13. Flag the release to build on RTFD.
  14. Update the “default version” on RTFD to the latest.
  15. Rev to latest version: docs/conf.py, docs/tutorial.rst, setup.py, CHANGELOG.rst need updates.
  16. Find/replace en/[old version] to en/[new version] in tutorial.ipynb.
  17. Commit revision: git commit -am "Update to version x.y.z for development.".

License

The MIT License

Copyright (c) 2017 Christopher Groskopf and contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Changelog

1.6.0 - February 28, 2017

This update should not cause any breaking changes, however, it is being classified as major release because the dependency on awesome-slugify, which is licensed with GPLv3, has been replaced with python-slugify, which is licensed with MIT.

  • Suppress warning from babel about Time Zone expressions on Python 3.6. (#665)
  • Reimplemented slugify with python-slugify instead of awesome-slugify. (#660)
  • Slugify renaming of duplicate values is now consistent with Table.init(). (#615)

1.5.5 - December 29, 2016

  • Added a “full outer join” example to the SQL section of the cookbook. (#658)
  • Warnings are now more explicit when column names are missing. (#652)
  • Date.cast() will no longer parse strings like 05_leslie3d_base as dates. (#653)
  • Text.cast() will no longer strip leading or trailing whitespace. (#654)
  • Fixed 'NoneType' object has no attribute 'groupdict' error in TimeDelta.cast(). (#656)

1.5.4 - December 27, 2016

  • Cleaned up handling of warnings in tests.
  • Blank column names are not treated as unspecified (letter names will be generated).

1.5.3 - December 26, 2016

This is a minor release that adds one feature: sequential joins (by row number). It also fixes several small bugs blocking a downstream release of csvkit.

  • Fixed empty Table column names would be intialized as list instead of tuple.
  • Table.join() can now join by row numbers—a sequential join.
  • Table.join() now supports full outer joins via the full_outer keyword.
  • Table.join() can now accept column indicies instead of column names.
  • Table.from_csv() now buffers input files to prevent issues with using STDIN as an input.

1.5.2 - December 24, 2016

  • Improved handling of non-ascii encoded CSV files under Python 2.

1.5.1 - December 23, 2016

This is a minor release fixing several small bugs that were blocking a downstream release of csvkit.

  • Documented differing behavior of MaxLength under Python 2. (#649)
  • agate is now tested against Python 3.6. (#650)
  • Fix bug when MaxLength was called on an all-null column.
  • Update extensions documentation to match new API. (#645)
  • Fix bug in Change and PercentChange where 0 values could cause None to be returned incorrectly.

1.5.0 - November 16, 2016

This release adds SVG charting via the leather charting library. Charts methods have been added for both Table and TableSet. (The latter create lattice plots.) See the revised tutorial and new cookbook entries for examples. Leather is still an early library. Please report any bugs.

Also in this release are a Slugify computation and a variety of small fixes and improvements.

The complete list of changes is as follows:

  • Remove support for monkey-patching of extensions. (#594)
  • TableSet methods which proxy Table methods now appear in the API docs. (#640)
  • Any and All aggregations no longer behave differently for boolean data. (#636)
  • Any and All aggregations now accept a single value as a test argument, in addition to a function.
  • Any and All aggregations now require a test argument.
  • Tables rendered by Table.print_table() are now Github Friendly Markdown (GFM) compatible. (#626)
  • The agate tutorial has been converted to a Jupyter Notebook.
  • Table now supports len as a proxy for len(table.rows).
  • Simple SVG charting is now integrated via leather.
  • Added First computation. (#634)
  • Table.print_table() now has a max_precision argument to limit Number precision. (#544)
  • Slug computation now accepts an array of column names to merge. (#617)
  • Cookbook: standardize column values with Slugify computation. (#613)
  • Cookbook: slugify/standardize row and column names. (#612)
  • Fixed condition that prevents integer row names to allow bools in Table.__init__(). (#627)
  • PercentChange is now null-safe, returns None for null values. (#623)
  • Table can now be iterated, yielding Row instances. (Previously it was necessarily to iterate table.rows.)

1.4.0 - May 26, 2016

This release adds several new features, fixes numerous small bug-fixes, and improves performance for common use cases. There are some minor breaking changes, but few user are likely to encounter them. The most important changes in this release are:

  1. There is now a TableSet.having() method, which behaves similarly to SQL’s HAVING keyword.
  2. Table.from_csv() is much faster. In particular, the type inference routines for parsing numbers have been optimized.
  3. The Table.compute() method now accepts a replace keyword which allows new columns to replace existing columns “in place.”” (As with all agate operations, a new table is still created.)
  4. There is now a Slug computation which can be used to compute a column of slugs. The Table.rename() method has also added new options for slugifying column and row names.

The complete list of changes is as follows:

  • Added a deprecation warning for patch methods. New extensions should not use it. (#594)
  • Added Slug computation (#466)
  • Added slug_columns and slug_rows arguments to Table.rename(). (#466)
  • Added utils.slugify() to standardize a sequence of strings. (#466)
  • Table.__init__() now prints row and column on CastError. (#593)
  • Fix null sorting in Table.order_by() when ordering by multiple columns. (#607)
  • Implemented configuration system.
  • Fixed bug in Table.print_bars() when value_column contains None (#608)
  • Table.print_table() now restricts header on max_column_width. (#605)
  • Cookbook: filling gaps in a dataset with Table.homogenize. (#538)
  • Reduced memory usage and improved performance of Table.from_csv().
  • Table.from_csv() no longer accepts a sequence of row ids for skip_lines.
  • Number.cast() is now three times as fast.
  • Number now accepts group_symbol, decimal_symbol and currency_symbols arguments. (#224)
  • Tutorial: clean up state data under computing columns (#570)
  • Table.__init__() now explicitly checks that row_names are not ints. (#322)
  • Cookbook: CPI deflation, agate-lookup. (#559)
  • Table.bins() now includes values outside start or end in computed column_names. (#596)
  • Fixed bug in Table.bins() where start or end arguments were ignored when specified alone. (#599)
  • Table.compute() now accepts a replace argument that allows columns to be overwritten. (#597)
  • Table.from_fixed() now creates an agate table from a fixed-width file. (#358)
  • fixed now implements a general-purpose fixed-width file reader. (#358)
  • TypeTester now correctly parses negative currency values as Number. (#595)
  • Cookbook: removing a column (select and exclude). (#592)
  • Cookbook: overriding specific column types. (#591)
  • TableSet now has a TableSet._fork() method used internally for deriving new tables.
  • Added an example of SQL’s HAVING to the cookbook.
  • Table.aggregate() interface has been revised to be more similar to TableSet.aggregate().
  • TableSet.having() is now implemented. (#587)
  • There is now a better error when a forced column name does not exist. (#591)
  • Arguments to Table.print_html() now mirror Table.print_table().

1.3.1 - March 30, 2016

The major feature of this release is new API documentation. Several minor features and bug fixes are also included. There are no major breaking changes in this release.

Internally, the agate codebase has been reorganized to be more modular, but this should be invisible to most users.

  • The MaxLength aggregation now returns a Decimal object. (#574)
  • Fixed an edge case where datetimes were parsed as dates. (#568)
  • Fixed column alignment in tutorial tables. (#572)
  • Table.print_table() now defaults to printing 20 rows and 6 columns. (#589)
  • Added Eli Murray to AUTHORS.
  • Table.__init__() now accepts a dict to specify partial column types. (#580)
  • Table.from_csv() now accepts a skip_lines argument. (#581)
  • Moved every Aggregation and Computation into their own modules. (#565)
  • Column and Row are now importable from agate.
  • Completely reorgnized the API documentation.
  • Moved unit tests into modules to match new code organization.
  • Moved major Table and TableSet methods into their own modules.
  • Fixed bug when using non-unicode encodings with Table.from_csv(). (#560)
  • Table.homogenize() now accepts an array of values as compare values if key is a single column name. (#539)

1.3.0 - February 28, 2016

This version implements several new features and includes two major breaking changes.

Please take note of the following breaking changes:

  1. There is no longer a Length aggregation. The more obvious Count is now used instead.
  2. Agate’s replacements for Python’s CSV reader and writer have been moved to the agate.csv namespace. To use as a drop-in replacement: from agate import csv.

The major new features in this release are primarly related to transforming (reshaping) tables. They are:

  1. Table.normalize() for converting columns to rows.
  2. Table.denormalize() for converting rows to columns.
  3. Table.pivot() for generating “crosstabs”.
  4. Table.homogenize() for filling gaps in data series.

Please see the following complete list of changes for a variety of other bug fixes and improvements.

  • Moved CSV reader/writer to agate.csv namespace.
  • Added numerous new examples to the R section of the cookbook. (#529-#535)
  • Updated Excel cookbook entry for pivot tables. (#536)
  • Updated Excel cookbook entry for VLOOKUP. (#537)
  • Fix number rendering in Table.print_table() on Windows. (#528)
  • Added cookbook examples of using Table.pivot() to count frequency/distribution.
  • Table.bins() now has smarter output column names. (#524)
  • Table.bins() is now a wrapper around pivot. (#522)
  • Table.counts() has been removed. Use Table.pivot() instead. (#508)
  • Count can now count non-null values in a column.
  • Removed Length. Count now works without any arguments. (#520)
  • Table.pivot() implemented. (#495)
  • Table.denormalize() implemented. (#493)
  • Added columns argument to Table.join(). (#479)
  • Cookbook: Custom statistics/agate.Summary
  • Added Kevin Schaul to AUTHORS.
  • Quantiles.locate() now correctly returns Decimal instances. (#509)
  • Cookbook: Filter for distinct values of a column (#498)
  • Added Column.values_distinct() (#498)
  • Cookbook: Fuzzy phonetic search example. (#207)
  • Cookbook: Create a table from a remote file. (#473)
  • Added printable argument to Table.print_bars() to use only printable characters. (#500)
  • MappedSequence now throws an explicit error on __setitem__. (#499)
  • Added require_match argument to Table.join(). (#480)
  • Cookbook: Rename columns in a table. (#469)
  • Table.normalize() implemented. (#487)
  • Added Percent computation with example in Cookbook. (#490)
  • Added Ben Welsh to AUTHORS.
  • Table.__init__() now throws a warning if auto-generated columns are used. (#483)
  • Table.__init__() no longer fails on duplicate columns. Instead it renames them and throws a warning. (#484)
  • Table.merge() now takes a column_names argument to specify columns included in new table. (#481)
  • Table.select() now accepts a single column name as a key.
  • Table.exclude() now accepts a single column name as a key.
  • Added Table.homogenize() to find gaps in a table and fill them with default rows. (#407)
  • Table.distinct() now accepts sequences of column names as a key.
  • Table.join() now accepts sequences of column names as either a left or right key. (#475)
  • Table.order_by() now accepts a sequence of column names as a key.
  • Table.distinct() now accepts a sequence of column names as a key.
  • Table.join() now accepts a sequence of column names as either a left or right key. (#475)
  • Cookbook: Create a table from a DBF file. (#472)
  • Cookbook: Create a table from an Excel spreadsheet.
  • Added explicit error if a filename is passed to the Table constructor. (#438)

1.2.2 - February 5, 2016

This release adds several minor features. The only breaking change is that default column names will now be lowercase instead of uppercase. If you depended on these names in your scripts you will need to update them accordingly.

  • TypeTester no longer takes a locale argument. Use types instead.
  • TypeTester now takes a types argument that is a list of possible types to test. (#461)
  • Null conversion can now be disabled for Text by passing cast_nulls=False. (#460)
  • Default column names are now lowercase letters instead of uppercase. (#464)
  • Table.merge() can now merge tables with different columns or columns in a different order. (#465)
  • MappedSequence.get() will no longer raise KeyError if a default is not provided. (#467)
  • Number can now test/cast the long type on Python 2.

1.2.1 - February 5, 2016

This release implements several new features and bug fixes. There are no significant breaking changes.

Special thanks to Neil Bedi for his extensive contributions to this release.

  • Added a max_column_width argument to Table.print_table(). Defaults to 20. (#442)
  • Table.from_json() now defers most functionality to Table.from_object().
  • Implemented Table.from_object() for parsing JSON-like Python objects.
  • Fixed a bug that prevented Table.order_by() on empty table. (#454)
  • Table.from_json() and TableSet.from_json() now have column_types as an optional argument. (#451)
  • csv.Reader now has line_numbers and header options to add column for line numbers (#447)
  • Renamed maxfieldsize to field_size_limit in csv.Reader for consistency (#447)
  • Table.from_csv() now has a sniff_limit option to use csv.Sniffer (#444)
  • csv.Sniffer implemented. (#444)
  • Table.__init__() no longer fails on empty rows. (#445)
  • TableSet.from_json() implemented. (#373)
  • Fixed a bug that breaks TypeTester.run() on variable row length. (#440)
  • Added TableSet.__str__() to display Table keys and row counts. (#418)
  • Fixed a bug that incorrectly checked for column_types equivalence in Table.merge() and TableSet.__init__(). (#435)
  • TableSet.merge() now has the ability to specify grouping factors with group, group_name and group_type. (#406)
  • Table can now be constructed with None for some column names. Those columns will receive letter names. (#432)
  • Slightly changed the parsing of dates and datetimes from strings.
  • Numbers are now written to CSV without extra zeros after the decimal point. (#429)
  • Made it possible for datetime.date instances to be considered valid DateTime inputs. (#427)
  • Changed preference order in type testing so Date is preferred to DateTime.
  • Removed float_precision argument from Number. (#428)
  • AgateTestCase is now available as agate.AgateTestCase. (#426)
  • TableSet.to_json() now has an indent option for use with nested.
  • TableSet.to_json() now has a nested option for writing a single, nested JSON file. (#417)
  • TestCase.assertRowNames() and TestCase.assertColumnNames() now validate the row and column instance keys.
  • Fixed a bug that prevented Table.rename() from renaming column names in Row instances. (#423)

1.2.0 - January 18, 2016

This version introduces one breaking change, which is only relevant if you are using custom Computation subclasses.

  1. Computation has been modified so that Computation.run() takes a Table instance as its argument, rather than a single row. It must return a sequence of values to use for a new column. In addition, the Computation._prepare() method has been renamed to Computation.validate() to more accurately describe it’s function. These changes were made to facilitate computing moving averages, streaks and other values that require data for the full column.

1.1.0 - November 4, 2015

This version of agate introduces three major changes.

  1. Table, Table.from_csv() and TableSet.from_csv() now all take column_names and column_types as separate arguments instead of as a sequence of tuples. This was done to enable more flexible type inference and to streamline the API.
  2. The interfaces for TableSet.aggregate() and Table.compute() have been changed. In both cases the new column name now comes first. Aggregations have also been modified so that the input column name is an argument to the aggregation class, rather than a third element in the tuple.
  3. This version drops support for Python 2.6. Testing and bug-fixing for this version was taking substantial time with no evidence that anyone was actually using it. Also, multiple dependencies claim to not support 2.6, even though agate’s tests were passing.

1.0.1 - October 29, 2015

  • TypeTester now takes a “limit” arg that restricts how many rows it tests. (#332)
  • Table.from_csv now supports CSVs with neither headers nor manual column names.
  • Tables can now be created with automatically generated column names. (#331)
  • File handles passed to Table.to_csv are now left open. (#330)
  • Added Table.print_csv method. (#307, #339)
  • Fixed stripping currency symbols when casting Numbers from strings. (#333)
  • Fixed two major join issues. (#336)

1.0.0 - October 22, 2015

  • Table.from_csv now defaults to TypeTester() if column_info is not provided. (#324)
  • New tutorial section: “Navigating table data” (#315)
  • 100% test coverage reached. (#312)
  • NullCalculationError is now a warning instead of an error. (#311)
  • TableSet is now a subclass of MappedSequence.
  • Rows and Columns are now subclasses of MappedSequence.
  • Add Column.values_without_nulls_sorted().
  • Column.get_data_without_nulls() is now Column.values_without_nulls().
  • Column.get_data_sorted() is now Column.values_sorted().
  • Column.get_data() is now Column.values().
  • Columns can now be sliced.
  • Columns can now be indexed by row name. (#301)
  • Added support for Python 3.5.
  • Row objects can now be sliced. (#303)
  • Replaced RowSequence and ColumnSequence with MappedSequence.
  • Replace RowDoesNotExistError with KeyError.
  • Replaced ColumnDoesNotExistError with IndexError.
  • Removed unnecessary custom RowIterator, ColumnIterator and CellIterator.
  • Performance improvements for Table “forks”. (where, limit, etc)
  • TableSet keys are now converted to row names during aggregation. (#291)
  • Removed fancy __repr__ implementations. Use __str__ instead. (#290)
  • Rows can now be accessed by name as well as index. (#282)
  • Added row_names argument to Table constructor. (#282)
  • Removed Row.table and Row.index properties. (#287)
  • Columns can now be accessed by index as well as name. (#281)
  • Added column name and type validation to Table constructor. (#285)
  • Table now supports variable-length rows during construction. (#39)
  • aggregations.Summary implemented for generic aggregations. (#181)
  • Fix TableSet.key_type being lost after proxying Table methods. (#278)
  • Massive performance increases for joins. (#277)
  • Added join benchmark. (#73)

0.11.0 - October 6, 2015

  • Implemented __repr__ for Table, TableSet, Column and Row. (#261)
  • Row.index property added.
  • Column constructor no longer takes a data_type argument.
  • Column.index and Column.name properties added.
  • Table.counts implemented. (#271)
  • Table.bins implemented. (#267, #227)
  • Table.join now raises ColumnDoesNotExistError. (#264)
  • Table.select now raises ColumnDoesNotExistError.
  • computations.ZScores moved into agate-stats.
  • computations.Rank cmp argument renamed comparer.
  • aggregations.MaxPrecision added. (#265)
  • Table.print_bars added.
  • Table.pretty_print renamed Table.print_table.
  • Reimplement Table method proxying via @allow_tableset_proxy decorator. (#263)
  • Add agate-stats references to docs.
  • Move stdev_outliers, mad_outliers and pearson_correlation into agate-stats. (#260)
  • Prevent issues with applying patches multiple times. (#258)

0.10.0 - September 22, 2015

  • Add reverse and cmp arguments to Rank computation. (#248)
  • Document how to use agate-sql to read/write SQL tables. (#238, #241)
  • Document how to write extensions.
  • Add monkeypatching extensibility pattern via utils.Patchable.
  • Reversed order of argument pairs for Table.compute. (#249)
  • TableSet.merge method can be used to ungroup data. (#253)
  • Columns with identical names are now suffixed “2” after a Table.join.
  • Duplicate key columns are no longer included in the result of a Table.join. (#250)
  • Table.join right_key no longer necessary if identical to left_key. (#254)
  • Table.inner_join is now more. Use inner keyword to Table.join.
  • Table.left_outer_join is now Table.join.

0.9.0 - September 14, 2015

  • Add many missing unit tests. Up to 99% coverage.
  • Add property accessors for TableSet.key_name and TableSet.key_type. (#247)
  • Table.rows and Table.columns are now behind properties. (#247)
  • Column.data_type is now a property. (#247)
  • Table[Set].get_column_types() is now the Table[Set].column_types property. (#247)
  • Table[Set].get_column_names() is now the Table[Set].column_names property. (#247)
  • Table.pretty_print now displays consistent decimal places for each Number column.
  • Discrete data types (Number, Date etc) are now right-aligned in Table.pretty_print.
  • Implement aggregation result caching. (#245)
  • Reimplement Percentiles, Quartiles, etc as aggregations.
  • UnsupportedAggregationError is now used to disable TableSet aggregations.
  • Replaced several exceptions with more general DataTypeError.
  • Column type information can now be accessed as Column.data_type.
  • Eliminated Column subclasses. Restructured around DataType classes.
  • Table.merge implemented. (#9)
  • Cookbook: guess column types. (#230)
  • Fix issue where all group keys were being cast to text. (#235)
  • Table.group_by will now default key_type to the type of the grouping column. (#234)
  • Add Matt Riggott to AUTHORS. (#231)
  • Support file-like objects in Table.to_csv and Table.from_csv. (#229)
  • Fix bug when applying multiple computations with Table.compute.

0.8.0 - September 9, 2015

  • Cookbook: dealing with locales. (#220)
  • Cookbook: working with dates and times.
  • Add timezone support to DateTimeType.
  • Use pytimeparse instead of python-dateutil. (#221)
  • Handle percents and currency symbols when casting numbers. (#217)
  • Table.format is now Table.pretty_print. (#223)
  • Rename TextType to Text, NumberType to Number, etc.
  • Rename agate.ColumnType to agate.DataType (#216)
  • Rename agate.column_types to agate.data_types.
  • Implement locale support for number parsing. (#116)
  • Cookbook: ranking. (#110)
  • Cookbook: date change and date ranking. (#113)
  • Add tests for unicode support. (#138)
  • Fix computations.ZScores calculation. (#123)
  • Differentiate sample and population variance and stdev. (#208)
  • Support for overriding column inference with “force”.
  • Competition ranking implemented as default. (#125)
  • TypeTester: robust type inference. (#210)

0.7.0 - September 3, 2015

  • Cookbook: USA Today diversity index.
  • Cookbook: filter to top x%. (#47)
  • Cookbook: fuzzy string search example. (#176)
  • Values to coerce to true/false can now be overridden for BooleanType.
  • Values to coerce to null can now be overridden for all ColumnType subclasses. (#206)
  • Add key_type argument to TableSet and Table.group_by. (#205)
  • Nested TableSet’s and multi-dimensional aggregates. (#204)
  • TableSet.aggregate will now use key_name as the group column name. (#203)
  • Added key_name argument to TableSet and Table.group_by.
  • Added Length aggregation and removed count from TableSet.aggregate output. (#203)
  • Fix error messages for RowDoesNotExistError and ColumnDoesNotExistError.

0.6.0 - September 1, 2015

  • Fix missing package definition in setup.py.
  • Split Analysis off into the proof library.
  • Change computation now works with DateType, DateTimeType and TimeDeltaType. (#159)
  • TimeDeltaType and TimeDeltaColumn implemented.
  • NonNullAggregation class removed.
  • Some private Column methods made public. (#183)
  • Rename agate.aggegators to agate.aggregations.
  • TableSet.to_csv implemented. (#195)
  • TableSet.from_csv implemented. (#194)
  • Table.to_csv implemented (#169)
  • Table.from_csv implemented. (#168)
  • Added Table.format method for pretty-printing tables. (#191)
  • Analysis class now implements a caching workflow. (#171)

0.5.0 - August 28, 2015

  • Table now takes (column_name, column_type) pairs. (#180)
  • Renamed the library to agate. (#179)
  • Results of common column operations are now cached using a common memoize decorator. (#162)
  • ated support for Python version 3.2.
  • Added support for Python wheel packaging. (#127)
  • Add PercentileRank computation and usage example to cookbook. (#152)
  • Add indexed change example to cookbook. (#151)
  • Add annual change example to cookbook. (#150)
  • Column.aggregate now invokes Aggregations.
  • Column.any, NumberColumn.sum, etc. converted to Aggregations.
  • Implement Aggregation and subclasses. (#155)
  • Move ColumnType subclasses and ColumnOperation subclasses into new modules.
  • Table.percent_change, Table.rank and Table.zscores reimplemented as Computers.
  • Computer implemented. Table.compute reimplemented. (#147)
  • NumberColumn.iqr (inter-quartile range) implemented. (#102)
  • Remove Column.counts as it is not the best way.
  • Implement ColumnOperation and subclasses.
  • Table.aggregate migrated to TableSet.aggregate.
  • Table.group_by now supports grouping by a key function. (#140)
  • NumberColumn.deciles implemented.
  • NumberColumn.quintiles implemented. (#46)
  • NumberColumn.quartiles implemented. (#45)
  • Added robust test case for NumberColumn.percentiles. (#129)
  • NumberColumn.percentiles reimplemented using new method. (#130)
  • Reorganized and modularized column implementations.
  • Table.group_by now returns a TableSet.
  • Implement TableSet object. (#141)

0.4.0 - September 27, 2014

  • Upgrade to python-dateutil 2.2. (#134)
  • Wrote introductory tutorial. (#133)
  • Reorganize documentation (#132)
  • Add John Heasly to AUTHORS.
  • Implement percentile. (#35)
  • no_null_computations now accepts args. (#122)
  • Table.z_scores implemented. (#123)
  • DateTimeColumn implemented. (#23)
  • Column.counts now returns dict instead of Table. (#109)
  • ColumnType.create_column renamed _create_column. (#118)
  • Added Mick O’Brien to AUTHORS. (#121)
  • Pearson correlation implemented. (#103)

0.3.0

  • DateType.date_format implemented. (#112)
  • Create ColumnType classes to simplify data parsing.
  • DateColumn implemented. (#7)
  • Cookbook: Excel pivot tables. (#41)
  • Cookbook: statistics, including outlier detection. (#82)
  • Cookbook: emulating Underscore’s any and all. (#107)
  • Parameter documention for method parameters. (#108)
  • Table.rank now accepts a column name or key function.
  • Optionally use cdecimal for improved performance. (#106)
  • Smart naming of aggregate columns.
  • Duplicate columns names are now an error. (#92)
  • BooleanColumn implemented. (#6)
  • TextColumn.max_length implemented. (#95)
  • Table.find implemented. (#14)
  • Better error handling in Table.__init__. (#38)
  • Collapse IntColumn and FloatColumn into NumberColumn. (#64)
  • Table.mad_outliers implemented. (#93)
  • Column.mad implemented. (#93)
  • Table.stdev_outliers implemented. (#86)
  • Table.group_by implemented. (#3)
  • Cookbook: emulating R. (#81)
  • Table.left_outer_join now accepts column names or key functions. (#80)
  • Table.inner_join now accepts column names or key functions. (#80)
  • Table.distinct now accepts a column name or key function. (#80)
  • Table.order_by now accepts a column name or key function. (#80)
  • Table.rank implemented. (#15)
  • Reached 100% test coverage. (#76)
  • Tests for Column._cast methods. (#20)
  • Table.distinct implemented. (#83)
  • Use assertSequenceEqual in tests. (#84)
  • Docs: features section. (#87)
  • Cookbook: emulating SQL. (#79)
  • Table.left_outer_join implemented. (#11)
  • Table.inner_join implemented. (#11)

0.2.0

  • Python 3.2, 3.3 and 3.4 support. (#52)
  • Documented supported platforms.
  • Cookbook: csvkit. (#36)
  • Cookbook: glob syntax. (#28)
  • Cookbook: filter to values in range. (#30)
  • RowDoesNotExistError implemented. (#70)
  • ColumnDoesNotExistError implemented. (#71)
  • Cookbook: percent change. (#67)
  • Cookbook: sampleing. (#59)
  • Cookbook: random sort order. (#68)
  • Eliminate Table.get_data.
  • Use tuples everywhere. (#66)
  • Fixes for Python 2.6 compatibility. (#53)
  • Cookbook: multi-column sorting. (#13)
  • Cookbook: simple sorting.
  • Destructive Table ops now deepcopy row data. (#63)
  • Non-destructive Table ops now share row data. (#63)
  • Table.sort_by now accepts a function. (#65)
  • Cookbook: pygal.
  • Cookbook: Matplotlib.
  • Cookbook: VLOOKUP. (#40)
  • Cookbook: Excel formulas. (#44)
  • Cookbook: Rounding to two decimal places. (#49)
  • Better repr for Column and Row. (#56)
  • Cookbook: Filter by regex. (#27)
  • Cookbook: Underscore filter & reject. (#57)
  • Table.limit implemented. (#58)
  • Cookbook: writing a CSV. (#51)
  • Kill Table.filter and Table.reject. (#55)
  • Column.map removed. (#43)
  • Column instance & data caching implemented. (#42)
  • Table.select implemented. (#32)
  • Eliminate repeated column index lookups. (#25)
  • Precise DecimalColumn tests.
  • Use Decimal type everywhere internally.
  • FloatColumn converted to DecimalColumn. (#17)
  • Added Eric Sagara to AUTHORS. (#48)
  • NumberColumn.variance implemented. (#1)
  • Cookbook: loading a CSV. (#37)
  • Table.percent_change implemented. (#16)
  • Table.compute implemented. (#31)
  • Table.filter and Table.reject now take funcs. (#24)
  • Column.count implemented. (#12)
  • Column.counts implemented. (#8)
  • Column.all implemented. (#5)
  • Column.any implemented. (#4)
  • Added Jeff Larson to AUTHORS. (#18)
  • NumberColumn.mode implmented. (#18)

0.1.0

  • Initial prototype

Show me docs

  • About - why you should use agate and the principles that guide its development
  • Install - how to install for users and developers
  • Tutorial - a step-by-step guide to start using agate
  • Cookbook - sample code showing how to accomplish dozens of common tasks, including comparisons to SQL, R, etc.
  • Extensions - a list of libraries that extend agate functionality and how to build your own
  • API - technical documentation for every agate feature
  • Changelog - a record of every change made to agate for each release

Show me code

import agate

purchases = agate.Table.from_csv('examples/realdata/ks_1033_data.csv')

by_county = purchases.group_by('county')

totals = by_county.aggregate([
    ('county_cost', agate.Sum('total_cost'))
])

totals = totals.order_by('county_cost', reverse=True)
totals.limit(10).print_bars('county', 'county_cost', width=80)
county     county_cost
SEDGWICK    977,174.45 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
COFFEY      691,749.03 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
MONTGOMERY  447,581.20 ▓░░░░░░░░░░░░░░░░░░░░░░░░░
JOHNSON     420,628.00 ▓░░░░░░░░░░░░░░░░░░░░░░░░
SALINE      245,450.24 ▓░░░░░░░░░░░░░░
FINNEY      171,862.20 ▓░░░░░░░░░░
BROWN       145,254.96 ▓░░░░░░░░
KIOWA        97,974.00 ▓░░░░░
WILSON       74,747.10 ▓░░░░
FORD         70,780.00 ▓░░░░
                       +-------------+-------------+-------------+-------------+
                       0          250,000       500,000       750,000  1,000,000

This example, along with detailed comments, are available as a Jupyter notebook.

Join us

  • Contributing - guidance for developers who want to contribute to agate
  • Release process - the process for maintainers to publish new releases
  • License - a copy of the MIT open source license covering agate

Who we are

agate is made by a community. The following individuals have contributed code, documentation, or expertise to agate:

Indices and tables