agate 1.6.0¶
agate is a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that solves real-world problems with readable code.
agate was previously known as journalism.
Important links:
- Documentation: http://agate.rtfd.org
- Repository: https://github.com/wireservice/agate
- Issues: https://github.com/wireservice/agate/issues
About agate¶
Why agate?¶
- A readable and user-friendly API.
- A complete set of SQL-like operations.
- Unicode support everywhere.
- Decimal precision everywhere.
- Exhaustive user documentation.
- Pluggable extensions that add SQL integration, Excel support, and more.
- Designed with iPython, Jupyter and atom/hydrogen in mind.
- Pure Python. No C dependencies to compile.
- Exhaustive test coverage.
- MIT licensed and free for all purposes.
- Zealously zen.
- Made with love.
Principles¶
agate is a intended to fill a very particular programming niche. It should not be allowed to become as complex as numpy or pandas. Please bear in mind the following principles when considering a new feature:
- Humans have less time than computers. Optimize for humans.
- Most datasets are small. Don’t optimize for “big data”.
- Text is data. It must always be a first-class citizen.
- Python gets it right. Make it work like Python does.
- Humans lives are nasty, brutish and short. Make it easy.
- Mutability leads to confusion. Processes that alter data must create new copies.
- Extensions are the way. Don’t add it to core unless everybody needs it.
Installation¶
Users¶
To use agate install it with pip:
pip install agate
Note
Need more speed? Upgrade to Python 3. It’s 3-5x faster than Python 2.
If you must use Python 2 you can you can pip install cdecimal
for a performance boost.
Developers¶
If you are a developer that also wants to hack on agate, install it from git:
git clone git://github.com/onyxfish/agate.git
cd agate
mkvirtualenv agate
# If running Python 3 (strongly recommended for development)
pip install -r requirements-py3.txt
# If running Python 2
pip install -r requirements-py2.txt
python setup.py develop
tox
Note
To run the agate tests with coverage:
nosetests --with-coverage tests
Tutorial¶
The agate tutorial is now available in new-and-improved Jupyter Notebook format.
Find it on Github!
Cookbook¶
Welcome to the agate cookbook, a source of how-to’s and use cases.
Creating tables¶
From data in memory¶
From a list of lists.
column_names = ['letter', 'number']
column_types = [agate.Text(), agate.Number()]
rows = [
('a', 1),
('b', 2),
('c', None)
]
table = agate.Table(rows, column_names, column_types)
From a list of dictionaries.
rows = [
dict(letter='a', number=1),
dict(letter='b', number=2),
dict(letter='c', number=None)
]
table = agate.Table.from_object(rows)
From a CSV¶
By default, loading a table from a CSV will use agate’s builtin TypeTester
to infer column types:
table = agate.Table.from_csv('filename.csv')
Override type inference¶
In some cases agate’s TypeTester
may guess incorrectly. To override the type for some columns and use TypeTester for the rest, pass a dictionary to the column_types
argument.
specified_types = {
'column_name_one': agate.Text(),
'column_name_two': agate.Number()
}
table = agate.Table.from_csv('filename.csv', column_types=specified_types)
This will use a generic TypeTester and override your specified columns with TypeTester.force
.
Limit type inference¶
For large datasets TypeTester
may be unreasonably slow. In order to limit the amount of data it uses you can specify the limit
argument. Note that if data after the limit invalidates the TypeTester’s inference you may get errors when the data is loaded.
tester = agate.TypeTester(limit=100)
table = agate.Table.from_csv('filename.csv', column_types=tester)
Manually specify columns¶
If you know the types of your data you may find it more efficient to manually specify the names and types of your columns. This also gives you an opportunity to rename columns when you load them.
text_type = agate.Text()
number_type = agate.Number()
column_names = ['city', 'area', 'population']
column_types = [text_type, number_type, number_type]
table = agate.Table.from_csv('population.csv', column_names, column_types)
Or, you can use this method to load data from a file that does not have a header row:
table = agate.Table.from_csv('population.csv', column_names, column_types, header=False)
From a unicode CSV¶
You don’t have to do anything special. It just works!
From a latin1 CSV¶
table = agate.Table.from_csv('census.csv', encoding='latin1')
From a semicolon delimited CSV¶
Normally, agate will automatically guess the delimiter of your CSV, but if that guess fails you can specify it manually:
table = agate.Table.from_csv('filename.csv', delimiter=';')
From a TSV (tab-delimited CSV)¶
This is the same as the previous example, but in this case we specify that the delimiter is a tab:
table = agate.Table.from_csv('filename.csv', delimiter='\t')
From JSON¶
table = agate.Table.from_json('filename.json')
From newline-delimited JSON¶
table = agate.Table.from_json('filename.json', newline=True)
From a SQL database¶
Use the agate-sql extension.
import agatesql
agatesql.patch()
table = agate.Table.from_sql('postgresql:///database', 'input_table')
From an Excel spreadsheet¶
Use the agate-excel extension. It supports both .xls and .xlsx files.
import agateexcel
agateexcel.patch()
table = agate.Table.from_xls('test.xls', sheet='data')
table2 = agate.Table.from_xlsx('test.xlsx', sheet='data')
From a DBF table¶
DBF is the file format used to hold tabular data for ArcGIS shapefiles. agate-dbf extension.
import agatedbf
agatedbf.patch()
table = agate.Table.from_dbf('test.dbf')
From a remote file¶
Use the agate-remote extension.
import agateremote
agateremote.patch()
table = agate.Table.from_url('https://raw.githubusercontent.com/onyxfish/agate/master/examples/test.csv')
agate-remote also let’s you create an Archive, which is a reference to a group of tables with a known path structure.
archive = agateremote.Archive('https://github.com/vincentarelbundock/Rdatasets/raw/master/csv/')
table = archive.get_table('sandwich/PublicSchools.csv')
Save a table¶
To a CSV¶
table.to_csv('filename.csv')
To JSON¶
table.to_json('filename.json')
To newline-delimited JSON¶
table.to_json('filename.json', newline=True)
Remove columns¶
Include specific columns¶
Create a new table with only a specific set of columns:
include_columns = ['column_name_one', 'column_name_two']
new_table = table.select(include_columns)
Exclude specific columns¶
Create a new table without a specific set of columns:
exclude_columns = ['column_name_one', 'column_name_two']
new_table = table.exclude(exclude_columns)
Filter rows¶
By regex¶
You can use Python’s builtin re
module to introduce a regular expression into a Table.where()
query.
For example, here we find all states that start with “C”.
import re
new_table = table.where(lambda row: re.match('^C', row['state']))
This can also be useful for finding values that don’t match your expectations. For example, finding all values in the “phone number” column that don’t look like phone numbers:
new_table = table.where(lambda row: not re.match('\d{3}-\d{3}-\d{4}', row['phone']))
By glob¶
Hate regexes? You can use glob (fnmatch
) syntax too!
from fnmatch import fnmatch
new_table = table.where(lambda row: fnmatch('C*', row['state']))
Values within a range¶
This snippet filters the dataset to incomes between 100,000 and 200,000.
new_table = table.where(lambda row: 100000 < row['income'] < 200000)
Dates within a range¶
This snippet filters the dataset to events during the summer of 2015:
import datetime
new_table = table.where(lambda row: datetime.datetime(2015, 6, 1) <= row['date'] <= datetime.datetime(2015, 8, 31))
If you want to filter to events during the summer of any year:
new_table = table.where(lambda row: 6 <= row['date'].month <= 8)
Top N percent¶
To filter a dataset to the top 10% percent of values we first compute the percentiles for the column and then use the result in the Table.where()
truth test:
percentiles = table.aggregate(agate.Percentiles('salary'))
top_ten_percent = table.where(lambda r: r['salary'] >= percentiles[90])
Random sample¶
By combining a random sort with limiting, we can effectively get a random sample from a table.
import random
randomized = table.order_by(lambda row: random.random())
sampled = table.limit(10)
Ordered sample¶
With can also get an ordered sample by simply using the step
parameter of the Table.limit()
method to get every Nth row.
sampled = table.limit(step=10)
Distinct values¶
You can retrieve a distinct list of values in a column using Column.values_distinct()
or Table.distinct()
.
Table.distinct()
returns the entire row so it’s necessary to chain a select on the specific column.
columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)
new_table.columns['value'].values_distinct()
# or
new_table.distinct('value').columns['value'].values()
(Decimal('1'), Decimal('2'), Decimal('5'))
Sort¶
Reverse order¶
The order of any sort can be reversed by using the reverse
keyword:
new_table = table.order_by('birth_date', reverse=True)
Multiple columns¶
Because Python’s internal sorting works natively with sequences, we can implement multi-column sort by returning a tuple from the key function.
new_table = table.order_by(lambda row: (row['last_name'], row['first_name']))
This table will now be ordered by last_name
, then first_name
.
Random order¶
import random
new_table = table.order_by(lambda row: random.random())
Search¶
Exact search¶
Find all individuals with the last_name “Groskopf”:
family = table.where(lambda r: r['last_name'] == 'Groskopf')
Fuzzy search by edit distance¶
By leveraging an existing Python library for computing the Levenshtein edit distance it is trivially easy to implement a fuzzy string search.
For example, to find all names within 2 edits of “Groskopf”:
from Levenshtein import distance
fuzzy_family = table.where(lambda r: distance(r['last_name'], 'Groskopf') <= 2)
These results will now include all those “Grosskopfs” and “Groskoffs” whose mail I am always getting.
Fuzzy search by phonetic similarity¶
By using Fuzzy to calculate phonetic similarity, it is possible to implement a fuzzy phonetic search.
For example to find all rows with first_name phonetically similar to “Catherine”:
import fuzzy
dmetaphone = fuzzy.DMetaphone(4)
phonetic_search = dmetaphone('Catherine')
def phonetic_match(r):
return any(x in dmetaphone(r['first_name']) for x in phonetic_search)
phonetic_family = table.where(lambda r: phonetic_match(r))
Standardize names and values¶
Standardize row and columns names¶
The Table.rename()
method has arguments to convert row or column names to slugs and append unique identifiers to duplicate values.
Using an existing table object:
# Convert column names to unique slugs
table.rename(slug_columns=True)
# Convert row names to unique slugs
table.rename(slug_rows=True)
# Convert both column and row names to unique slugs
table.rename(slug_columns=True, slug_rows=True)
Standardize column values¶
agate has a Slug
computation that can be used to also standardize text column values. The computation has an option to also append unique identifiers to duplicate values.
Using an existing table object:
# Convert the values in column 'title' to slugs
new_table = table.compute([
('title-slug', agate.Slug('title'))
])
# Convert the values in column 'title' to unique slugs
new_table = table.compute([
('title-slug', agate.Slug('title', ensure_unique=True))
])
Statistics¶
Common descriptive and aggregate statistics are included with the core agate library. For additional statistical methods beyond the scope of agate consider using the agate-stats extension or integrating with scipy.
Descriptive statistics¶
agate includes a full set of standard descriptive statistics that can be applied to any column containing Number
data.
table.aggregate(agate.Sum('salary'))
table.aggregate(agate.Min('salary'))
table.aggregate(agate.Max('salary'))
table.aggregate(agate.Mean('salary'))
table.aggregate(agate.Median('salary'))
table.aggregate(agate.Mode('salary'))
table.aggregate(agate.Variance('salary'))
table.aggregate(agate.StDev('salary'))
table.aggregate(agate.MAD('salary'))
Or, get several at once:
table.aggregate([
agate.Min('salary'),
agate.Mean('salary'),
agate.Max('salary')
])
Aggregate statistics¶
You can also generate aggregate statistics for subsets of data (sometimes referred to as “rolling up”):
doctors = patients.group_by('doctor')
patient_ages = doctors.aggregate([
('patient_count', agate.Count()),
('age_mean', agate.Mean('age')),
('age_median', agate.Median('age'))
])
The resulting table will have four columns: doctor
, patient_count
, age_mean
and age_median
.
You can roll up by multiple columns by chaining agate’s Table.group_by()
method.
doctors_by_state = patients.group_by("state").group_by('doctor')
Distribution by count (frequency)¶
Counting the number of each unique value in a column can be accomplished with the Table.pivot()
method:
# Counts of a single column's values
table.pivot('doctor')
# Counts of all combinations of more than one column's values
table.pivot(['doctor', 'hospital'])
The resulting tables will have a column for each key column and another Count
column counting the number of instances of each value.
Distribution by percent¶
Table.pivot()
can also be used to calculate the distribution of values as a percentage of the total number:
# Percents of a single column's values
table.pivot('doctor', computation=agate.Percent('Count'))
# Percents of all combinations of more than one column's values
table.pivot(['doctor', 'hospital'], computation=agate.Percent('Count'))
The output table will be the same format as the previous example, except the value column will be named Percent
.
Identify outliers¶
The agate-stats extension adds methods for finding outliers.
import agatestats
agatestats.patch()
outliers = table.stdev_outliers('salary', deviations=3, reject=False)
By specifying reject=True
you can instead return a table including only those values not identified as outliers.
not_outliers = table.stdev_outliers('salary', deviations=3, reject=True)
The second, more robust, method for identifying outliers is by identifying values which are more than some number of “median absolute deviations” from the median (typically 3).
outliers = table.mad_outliers('salary', deviations=3, reject=False)
As with the first example, you can specify reject=True
to exclude outliers in the resulting table.
Custom statistics¶
You can also generate custom aggregated statistics for your data by defining your own ‘summary’ aggregation. This might be especially useful for performing calculations unique to your data. Here’s a simple example:
# Create a custom summary aggregation with agate.Summary
# Input a column name, a return data type and a function to apply on the column
count_millionaires = agate.Summary('salary', agate.Number(), lambda r: sum(salary > 1000000 for salary in r.values()))
table.aggregate([
count_millionaires
])
Your custom aggregation can be used to determine both descriptive and aggregate statistics shown above.
Compute new values¶
Change¶
new_table = table.compute([
('2000_change', agate.Change('2000', '2001')),
('2001_change', agate.Change('2001', '2002')),
('2002_change', agate.Change('2002', '2003'))
])
Or, better yet, compute the whole decade using a loop:
computations = []
for year in range(2000, 2010):
change = agate.Change(year, year + 1)
computations.append(('%i_change' % year, change))
new_table = table.compute(computations)
Percent¶
Calculate the percentage for each value in a column with Percent
.
Values are divided into the sum of the column by default.
columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)
new_table = new_table.compute([
('percent', agate.Percent('value'))
])
new_table.print_table()
| value | percent |
| ----- | ------- |
| 1 | 10 |
| 2 | 20 |
| 2 | 20 |
| 5 | 50 |
Override the denominator with a keyword argument.
new_table = new_table.compute([
('percent', agate.Percent('value', 5))
])
new_table.print_table()
| value | percent |
| ----- | ------- |
| 1 | 20 |
| 2 | 40 |
| 2 | 40 |
| 5 | 100 |
Percent change¶
Want percent change instead of value change? Just swap out the Computation
:
computations = []
for year in range(2000, 2010):
change = agate.PercentChange(year, year + 1)
computations.append(('%i_change' % year, change))
new_table = table.compute(computations)
Indexed/cumulative change¶
Need your change indexed to a starting year? Just fix the first argument:
computations = []
for year in range(2000, 2010):
change = agate.Change(2000, year + 1)
computations.append(('%i_change' % year, change))
new_table = table.compute(computations)
Of course you can also use PercentChange
if you need percents rather than values.
Round to two decimal places¶
agate stores numerical values using Python’s decimal.Decimal
type. This data type ensures numerical precision beyond what is supported by the native float()
type, however, because of this we can not use Python’s builtin round()
function. Instead we must use decimal.Decimal.quantize()
.
We can use Table.compute()
to apply the quantize to generate a rounded column from an existing one:
from decimal import Decimal
number_type = agate.Number()
def round_price(row):
return row['price'].quantize(Decimal('0.01'))
new_table = table.compute([
('price_rounded', agate.Formula(number_type, round_price))
])
To round to one decimal place you would simply change 0.01
to 0.1
.
Difference between dates¶
Calculating the difference between dates (or dates and times) works exactly the same as it does for numbers:
new_table = table.compute([
('age_at_death', agate.Change('born', 'died'))
])
Levenshtein edit distance¶
The Levenshtein edit distance is a common measure of string similarity. It can be used, for instance, to check for typos between manually-entered names and a version that is known to be spelled correctly.
Implementing Levenshtein requires writing a custom Computation
. To save ourselves building the whole thing from scratch, we will lean on the python-Levenshtein library for the actual algorithm.
import agate
from Levenshtein import distance
import six
class LevenshteinDistance(agate.Computation):
"""
Computes Levenshtein edit distance between the column and a given string.
"""
def __init__(self, column_name, compare_string):
self._column_name = column_name
self._compare_string = compare_string
def get_computed_data_type(self, table):
"""
The return value is a numerical distance.
"""
return agate.Number()
def validate(self, table):
"""
Verify the column is text.
"""
column = table.columns[self._column_name]
if not isinstance(column.data_type, agate.Text):
raise agate.DataTypeError('Can only be applied to Text data.')
def run(self, table):
"""
Find the distance, returning null when the input column was null.
"""
new_column = []
for row in table.rows:
val = row[self._column_name]
if val is None:
new_column.append(None)
else:
new_column.append(distance(val, self._compare_string))
return new_column
This code can now be applied to any Table
just as any other Computation
would be:
new_table = table.compute([
('distance', LevenshteinDistance('column_name', 'string to compare'))
])
The resulting column will contain an integer measuring the edit distance between the value in the column and the comparison string.
USA Today Diversity Index¶
The USA Today Diversity Index is a widely cited method for evaluating the racial diversity of a given area. Using a custom Computation
makes it simple to calculate.
Assuming that your data has a column for the total population, another for the population of each race and a final column for the hispanic population, you can implement the diversity index like this:
class USATodayDiversityIndex(agate.Computation):
def get_computed_data_type(self, table):
return agate.Number()
def run(self, table):
new_column = []
for row in table.rows:
race_squares = 0
for race in ['white', 'black', 'asian', 'american_indian', 'pacific_islander']:
race_squares += (row[race] / row['population']) ** 2
hispanic_squares = (row['hispanic'] / row['population']) ** 2
hispanic_squares += (1 - (row['hispanic'] / row['population'])) ** 2
new_column.append((1 - (race_squares * hispanic_squares)) * 100)
return new_column
We apply the diversity index like any other computation:
with_index = table.compute([
('diversity_index', USATodayDiversityIndex())
])
Simple Moving Average¶
A simple moving average is the average of some number of prior values in a series. It is typically used to smooth out variation in time series data.
The following custom Computation
will compute a simple moving average. This example assumes your data is already sorted.
class SimpleMovingAverage(agate.Computation):
"""
Computes the simple moving average of a column over some interval.
"""
def __init__(self, column_name, interval):
self._column_name = column_name
self._interval = interval
def get_computed_data_type(self, table):
"""
The return value is a numerical average.
"""
return agate.Number()
def validate(self, table):
"""
Verify the column is numerical.
"""
column = table.columns[self._column_name]
if not isinstance(column.data_type, agate.Number):
raise agate.DataTypeError('Can only be applied to Number data.')
def run(self, table):
new_column = []
for i, row in enumerate(table.rows):
if i < self._interval:
new_column.append(None)
else:
values = tuple(r[self._column_name] for r in table.rows[i - self._interval:i])
if None in values:
new_column.append(None)
else:
new_column.append(sum(values) / self._interval)
return new_column
You would use the simple moving average like so:
with_average = table.compute([
('six_month_moving_average', SimpleMovingAverage('price', 6))
])
Dates and times¶
Specify a date format¶
By default agate will attempt to guess the format of a Date
or DateTime
column. In some cases, it may not be possible to automatically figure out the format of a date. In this case you can specify a datetime.datetime.strptime()
formatting string to specify how the dates should be parsed. For example, if your dates were formatted as “15-03-15” (March 15th, 2015) then you could specify:
date_type = agate.Date('%d-%m-%y')
Another use for this feature is if you have a column that contains extraneous data. For instance, imagine that your column contains hours and minutes, but they are always zero. It would make more sense to load that data as type Date
and ignore the extra time information:
date_type = agate.Date('%m/%d/%Y 00:00')
Specify a timezone¶
Timezones are hard. Under normal circumstances (no arguments specified), agate will not try to parse timezone information, nor will it apply a timezone to the datetime.datetime
instances it creates. (They will be naive in Python parlance.) There are two ways to force timezone data into your agate columns.
The first is to use a format string, as shown above, and specify a pattern for timezone information:
datetime_type = agate.DateTime('%Y-%m-%d %H:%M:%S%z')
The second way is to specify a timezone as an argument to the type constructor:
import pytz
eastern = pytz.timezone('US/Eastern')
datetime_type = agate.DateTime(timezone=eastern)
In this case all timezones that are processed will be set to have the Eastern timezone. Note, the timezone will be set, not converted. You cannot use this method to convert your timezones from UTC to another timezone. To do that see Convert timezones.
Calculate a time difference¶
Convert timezones¶
If you load data from a spreadsheet in one timezone and you need to convert it to another, you can do this using a Formula
. Your datetime column must have timezone data for the following example to work. See Specify a timezone.
import pytz
us_eastern = pytz.timezone('US/Eastern')
datetime_type = agate.DateTime(timezone=us_eastern)
column_names = ['what', 'when']
column_types = [text_type, datetime_type]
table = agate.Table.from_csv('events.csv', columns)
rome = timezone('Europe/Rome')
timezone_shifter = agate.Formula(lambda r: r['when'].astimezone(rome))
table = agate.Table.compute([
('when_in_rome', timezone_shifter)
])
Emulate SQL¶
agate’s command structure is very similar to SQL. The primary difference between agate and SQL is that commands like SELECT
and WHERE
explicitly create new tables. You can chain them together as you would with SQL, but be aware each command is actually creating a new table.
Note
All examples in this section use the PostgreSQL dialect for comparison.
If you want to read and write data from SQL, see From a SQL database.
WHERE¶
SQL:
SELECT * FROM table WHERE LOWER(state) = 'california';
agate:
new_table = table.where(lambda row: row['state'].lower() == 'california')
ORDER BY¶
SQL:
SELECT * FROM table ORDER BY total DESC;
agate:
new_table = table.order_by(lambda row: row['total'], reverse=True)
DISTINCT¶
SQL:
SELECT DISTINCT ON (state) * FROM table;
agate:
new_table = table.distinct('state')
Note
Unlike most SQL implementations, agate always returns the full row. Use Table.select()
if you want to filter the columns first.
INNER JOIN¶
SQL (two ways):
SELECT * FROM patient, doctor WHERE patient.doctor = doctor.id;
SELECT * FROM patient INNER JOIN doctor ON (patient.doctor = doctor.id);
agate:
joined = patients.join(doctors, 'doctor', 'id', inner=True)
LEFT OUTER JOIN¶
SQL:
SELECT * FROM patient LEFT OUTER JOIN doctor ON (patient.doctor = doctor.id);
agate:
joined = patients.join(doctors, 'doctor', 'id')
FULL OUTER JOIN¶
SQL:
SELECT * FROM patient FULL OUTER JOIN doctor ON (patient.doctor = doctor.id);
agate:
joined = patients.join(doctors, 'doctor', 'id', full_outer=True)
GROUP BY¶
agate’s Table.group_by()
works slightly different than SQLs. It does not require an aggregate function. Instead it returns TableSet
. To see how to perform the equivalent of a SQL aggregate, see below.
doctors = patients.group_by('doctor')
You can group by two or more columns by chaining the command.
doctors_by_state = patients.group_by('state').group_by('doctor')
HAVING¶
agate’s TableSet.having()
works very similar to SQL’s keyword of the same name.
doctors = patients.group_by('doctor')
popular_doctors = doctors.having([
('patient_count', Count())
], lambda t: t['patient_count'] > 100)
This filters to only those doctors whose table includes at least 100 results. Can add as many aggregations as you want to the list and each will be available, by name in the test function you pass.
For example, here we filter to popular doctors with more an average review of at least three stars:
doctors = patients.group_by('doctor')
popular_doctors = doctors.having([
('patient_count', Count()),
('average_stars', Average('stars'))
], lambda t: t['patient_count'] > 100 and t['average_stars'] >= 3)
Chain commands together¶
SQL:
SELECT state, total FROM table WHERE LOWER(state) = 'california' ORDER BY total DESC;
agate:
new_table = table \
.select(['state', 'total']) \
.where(lambda row: row['state'].lower() == 'california') \
.order_by('total', reverse=True)
Note
Chaining commands in this way is sometimes not a good idea. Being explicit about each step can lead to clearer code.
Aggregate functions¶
SQL:
SELECT mean(age), median(age) FROM patients GROUP BY doctor;
agate:
doctors = patients.group_by('doctor')
patient_ages = doctors.aggregate([
('patient_count', agate.Count()),
('age_mean', agate.Mean('age')),
('age_median', agate.Median('age'))
])
The resulting table will have four columns: doctor
, patient_count
, age_mean
and age_median
.
Emulate Excel¶
One of agate’s most powerful assets is that instead of a wimpy “formula” language, you have the entire Python language at your disposal. Here are examples of how to translate a few common Excel operations.
Simple formulas¶
If you need to simulate a simple Excel formula you can use the Formula
class to apply an arbitrary function.
Excel:
=($A1 + $B1) / $C1
agate:
def f(row):
return (row['a'] + row['b']) / row['c']
new_table = table.compute([
('new_column', agate.Formula(agate.Number(), f))
])
If this still isn’t enough flexibility, you can also create your own subclass of Computation
.
SUM¶
number_type = agate.Number()
def five_year_total(row):
columns = ('2009', '2010', '2011', '2012', '2013')
return sum(tuple(row[c] for c in columns)]
formula = agate.Formula(number_type, five_year_total)
new_table = table.compute([
('five_year_total', formula)
])
TRIM¶
new_table = table.compute([
('name_stripped', agate.Formula(text_type, lambda r: r['name'].strip()))
])
CONCATENATE¶
new_table = table.compute([
('full_name', agate.Formula(text_type, lambda r: '%(first_name)s %(middle_name)s %(last_name)s' % r))
])
IF¶
new_table = table.compute([
('mvp_candidate', agate.Formula(boolean_type, lambda r: row['batting_average'] > 0.3))
])
VLOOKUP¶
There are two ways to get the equivalent of Excel’s VLOOKUP with agate. If your lookup source is another agate Table
, then you’ll want to use the Table.join()
method:
new_table = mvp_table.join(states, 'state_abbr')
This will add all the columns from the states table to the mvp_table, where their state_abbr columns match.
If your lookup source is a Python dictionary or some other object you can implement the lookup using a Formula
computation:
states = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
...
}
new_table = table.compute([
('mvp_candidate', agate.Formula(text_type, lambda r: states[row['state_abbr']]))
])
Pivot tables as cross-tabulations¶
Pivot tables in Excel implement a tremendous range of functionality. Agate divides this functionality into a few different methods.
If what you want is to convert rows to columns to create a “crosstab”, then you’ll want to use the Table.pivot()
method:
jobs_by_state_and_year = employees.pivot('state', 'year')
This will generate a table with a row for each value in the state column and a column for each value in the year column. The intersecting cells will contains the counts grouped by state and year. You can pass the aggregation keyword to aggregate some other value, such as Mean
or Median
.
Pivot tables as summaries¶
On the other hand, if what you want is to summarize your table with descriptive statistics, then you’ll want to use Table.group_by()
and TableSet.aggregate()
:
jobs = employees.group_by('job_title')
summary = jobs.aggregate([
('employee_count', agate.Count()),
('salary_mean', agate.Mean('salary')),
('salary_median', agate.Median('salary'))
])
The resulting summary
table will have four columns: job_title
, employee_count
, salary_mean
and salary_median
.
You may also want to look at the Table.normalize()
and Table.denormalize()
methods for examples of functionality frequently accomplished with Excel’s pivot tables.
Emulate R¶
c()¶
Agate’s Table.select()
and Table.exclude()
are the equivalent of R’s c
for selecting columns.
R:
selected <- data[c("last_name", "first_name", "age")]
excluded <- data[!c("last_name", "first_name", "age")]
agate:
selected = table.select(['last_name', 'first_name', 'age'])
excluded = table.exclude(['last_name', 'first_name', 'age'])
subset¶
Agate’s Table.where()
is the equivalent of R’s subset
.
R:
newdata <- subset(data, age >= 20 | age < 10)
agate:
new_table = table.where(lambda row: row['age'] >= 20 or row['age'] < 10)
order¶
Agate’s Table.order_by()
is the equivalent of R’s order
.
R:
newdata <- employees[order(last_name),]
agate:
new_table = employees.order_by('last_name')
merge¶
Agate’s Table.join()
is the equivalent of R’s merge
.
R:
joined <- merge(employees, states, by="usps")
agate:
joined = employees.join(states, 'usps')
rbind¶
Agate’s Table.merge()
is the equivalent of R’s rbind
.
R:
merged <- rbind(first_year, second_year)
agate:
merged = agate.Table.merge(first_year, second_year)
aggregate¶
Agate’s Table.group_by()
and TableSet.aggregate()
can be used to recreate the functionality of R’s aggregate
.
R:
aggregates = aggregate(employees$salary, list(job = employees$job), mean)
agate:
jobs = employees.group_by('job')
aggregates = jobs.aggregate([
('mean', agate.Mean('salary'))
])
melt¶
Agate’s Table.normalize()
is the equivalent of R’s melt
.
R:
melt(employees, id=c("last_name", "first_name"))
agate:
employees.normalize(['last_name', 'first_name'])
cast¶
Agate’s Table.denormalize()
is the equivalent of R’s cast
.
R:
melted = melt(employees, id=c("name"))
casted = cast(melted, name~variable, mean)
agate:
normalized = employees.normalize(['name'])
denormalized = normalized.denormalize('name')
Emulate underscore.js¶
filter¶
agate’s Table.where()
functions exactly like Underscore’s filter
.
new_table = table.where(lambda row: row['state'] == 'Texas')
reject¶
To simulate Underscore’s reject
, simply negate the return value of the function you pass into agate’s Table.where()
.
new_table = table.where(lambda row: not (row['state'] == 'Texas'))
find¶
agate’s Table.find()
works exactly like Underscore’s find
.
row = table.find(lambda row: row['state'].startswith('T'))
any¶
The Any
aggregation works like Underscore’s any
.
true_or_false = table.aggregate(Any('salaries', lambda d: d > 100000))
You can also use Table.where()
to filter to columns that pass the truth test.
Homogenize rows¶
Fill in missing rows in a series. This can be used, for instance, to add rows for missing years in a time series.
Create rows for missing values¶
We can insert a default row for each value that is missing in a table from a given sequence of values.
Starting with a table like this, we can fill in rows for all missing years:
year | female_count | male_count |
---|---|---|
1997 | 2 | 1 |
2000 | 4 | 3 |
2002 | 4 | 5 |
2003 | 1 | 2 |
key = 'year'
expected_values = (1997, 1998, 1999, 2000, 2001, 2002, 2003)
# Your default row should specify column values not in `key`
default_row = (0, 0)
new_table = table.homogenize(key, expected_values, default_row)
The result will be:
year | female_count | male_count |
---|---|---|
1997 | 2 | 1 |
1998 | 0 | 0 |
1999 | 0 | 0 |
2000 | 4 | 3 |
2001 | 0 | 0 |
2002 | 4 | 5 |
2003 | 1 | 2 |
Create dynamic rows based on missing values¶
We can also specify new row values with a value-generating function:
key = 'year'
expected_values = (1997, 1998, 1999, 2000, 2001, 2002, 2003)
# If default row is a function, it should return a full row
def default_row(missing_value):
return (missing_value, missing_value-1997, missing_value-1997)
new_table = table.homogenize(key, expected_values, default_row)
The new table will be:
year | female_count | male_count |
---|---|---|
1997 | 2 | 1 |
1998 | 1 | 1 |
1999 | 2 | 2 |
2000 | 4 | 3 |
2001 | 4 | 4 |
2002 | 4 | 5 |
2003 | 1 | 2 |
Renaming and reordering columns¶
Rename columns¶
You can rename the columns in a table by using the Table.rename()
method and specifying the new column names as an array or dictionary mapping old column names to new ones.
table = Table(rows, column_names = ['a', 'b', 'c'])
new_table = table.rename(column_names = ['one', 'two', 'three'])
# or
new_table = table.rename(column_names = {'a': 'one', 'b': 'two', 'c': 'three'})
Reorder columns¶
You can reorder the columns in a table by using the Table.select()
method and specifying the column names in the order you want:
new_table = table.select(['3rd_column_name', '1st_column_name', '2nd_column_name'])
Transform¶
Pivot by a single column¶
The Table.pivot()
method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:
name | race | gender | age |
---|---|---|---|
Joe | white | female | 20 |
Jane | asian | male | 20 |
Jill | black | female | 20 |
Jim | latino | male | 25 |
Julia | black | female | 25 |
Joan | asian | female | 25 |
In the very simplest case, this table can be pivoted to count the number occurences of values in a column:
transformed = table.pivot('race')
Result:
race | pivot |
---|---|
white | 1 |
asian | 2 |
black | 2 |
latino | 1 |
Pivot by multiple columns¶
You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:
transformed = table.pivot(['race', 'gender'])
Result:
race | gender | pivot |
---|---|---|
white | female | 1 |
asian | male | 1 |
black | female | 2 |
latino | male | 1 |
asian | female | 1 |
For the column, version you would do:
transformed = table.pivot('race', 'gender')
Result:
race | male | female |
---|---|---|
white | 0 | 1 |
asian | 1 | 1 |
black | 0 | 2 |
latino | 1 | 0 |
Pivot to sum¶
The default pivot aggregation is Count
but you can also supply other operations. For example, to aggregate each group by Sum
of their ages:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
race | male | female |
---|---|---|
white | 0 | 20 |
asian | 20 | 25 |
black | 0 | 45 |
latino | 25 | 0 |
Pivot to percent of total¶
Pivot allows you to apply a Computation
to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
race | male | female |
---|---|---|
white | 0 | 14.8 |
asian | 14.8 | 18.4 |
black | 0 | 33.3 |
latino | 18.4 | 0 |
Note: actual computed percentages will be much more precise.
It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.
Denormalize key/value columns into separate columns¶
It’s common for very large datasets to be distributed in a “normalized” format, such as:
name | property | value |
---|---|---|
Jane | gender | female |
Jane | race | black |
Jane | age | 24 |
... | ... | ... |
The Table.denormalize()
method can be used to transform the table so that each unique property has its own column.
transformed = table.denormalize('name', 'property', 'value')
Result:
name | gender | race | age |
---|---|---|---|
Jane | female | black | 24 |
Jack | male | white | 35 |
Joe | male | black | 28 |
Normalize separate columns into key/value columns¶
Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:
name | gender | race | age |
---|---|---|---|
Jane | female | black | 24 |
Jack | male | white | 35 |
Joe | male | black | 28 |
The Table.normalize()
method can be used to transform the table so that all the properties and their values share two columns.
transformed = table.normalize('name', ['gender', 'race', 'age'])
Result:
name | property | value |
---|---|---|
Jane | gender | female |
Jane | race | black |
Jane | age | 24 |
... | ... | ... |
Locales¶
agate strives to work equally well for users from all parts of the world. This means properly handling foreign currencies, date formats, etc. To facilitate this, agate makes a hard distinction between your locale and the locale of the data you are working with. This allows you to work seamlessly with data from other countries.
Set your locale¶
Setting your locale will change how numbers are displayed when you print an agate Table
or serialize it to, for example, a CSV file. This works the same as it does for any other Python module. See the locale
documentation for details. Changing your locale will not affect how they are parsed from the files you are using. To change how data is parsed see Specify locale of numbers.
Specify locale of numbers¶
To correctly parse numbers from non-US locales, you must pass a locale
parameter to the Number
constructor. For example, to parse Dutch numbers (which use a period to separate thousands and a comma to separate fractions):
dutch_numbers = agate.Number(locale='de_DE')
column_names = ['city', 'population']
column_types = [text_type, dutch_numbers]
table = agate.Table.from_csv('dutch_cities.csv', columns)
Rank¶
There are many ways to rank a sequence of values. agate strives to find a balance between simple, intuitive ranking and flexibility when you need it.
Competition rank¶
The basic rank supported by agate is standard “competition ranking”. In this model the values [3, 4, 4, 5]
would be ranked [1, 2, 2, 4]
. You can apply competition ranking using the Rank
computation:
new_table = table.compute([
('rank', agate.Rank('value'))
])
Rank descending¶
Descending competition ranking is specified using the reverse
argument.
new_table = table.compute([
('rank', agate.Rank('value', reverse=True))
])
Rank change¶
You can compute the change from one rank to another by combining the Rank
and Change
computations:
new_table = table.compute([
('rank2014', agate.Rank('value2014')),
('rank2015', agate.Rank('value2015'))
])
new_table2 = new_table.compute([
('rank_change', agate.Change('rank2014', 'rank2015'))
])
Percentile rank¶
“Percentile rank” is a bit of a misnomer. Really, this is the percentile in which each value in a column is located. This column can be computed for your data using the PercentileRank
computation:
new_table = table.compute([
('percentile_rank', agate.PercentileRank('value'))
])
Note that there is no entirely standard method for computing percentiles. The percentiles computed in this manner may not agree precisely with those generated by other software. See the Percentiles
class documentation for implementation details.
Charts¶
Agate offers two kinds of built in charting: very simple text bar charts and SVG charting via leather. Both are intended for efficiently exploring data, rather than producing publication-ready charts.
Text-based bar chart¶
agate has a builtin text-based bar-chart generator:
table.limit(10).print_bars('State Name', 'TOTAL', width=80)
State Name TOTAL
ALABAMA 19,582 ▓░░░░░░░░░░░░░
ALASKA 2,705 ▓░░
ARIZONA 46,743 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
ARKANSAS 7,932 ▓░░░░░
CALIFORNIA 76,639 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
COLORADO 21,485 ▓░░░░░░░░░░░░░░░
CONNECTICUT 4,350 ▓░░░
DELAWARE 1,904 ▓░
DIST. OF COLUMBIA 2,185 ▓░
FLORIDA 59,519 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
+-------------+------------+------------+-------------+
0 20,000 40,000 60,000 80,000
Text-based histogram¶
Table.print_bars()
can be combined with Table.pivot()
or Table.bins()
to produce fast histograms:
table.bins('TOTAL', start=0, end=100000).print_bars('TOTAL', width=80)
TOTAL Count
[0 - 10,000) 30 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
[10,000 - 20,000) 12 ▓░░░░░░░░░░░░░░░░░░░░░░
[20,000 - 30,000) 7 ▓░░░░░░░░░░░░░
[30,000 - 40,000) 1 ▓░░
[40,000 - 50,000) 2 ▓░░░░
[50,000 - 60,000) 1 ▓░░
[60,000 - 70,000) 1 ▓░░
[70,000 - 80,000) 1 ▓░░
[80,000 - 90,000) 0 ▓
[90,000 - 100,000] 0 ▓
+-------------+------------+------------+-------------+
0.0 7.5 15.0 22.5 30.0
SVG column chart¶
table.limit(10).column_chart('State Name', 'TOTAL', 'docs/images/column_chart.svg')
SVG line chart¶
by_year_exonerated = table.group_by('exonerated')
counts = by_year_exonerated.aggregate([
('count', agate.Count())
])
counts.order_by('exonerated').line_chart('exonerated', 'count', 'docs/images/line_chart.svg')
SVG lattice chart¶
top_crimes = table.group_by('crime').having([
('count', agate.Count())
], lambda t: t['count'] > 100)
by_year = top_crimes.group_by('exonerated')
counts = by_year.aggregate([
('count', agate.Count())
])
by_crime = counts.group_by('crime')
by_crime.order_by('exonerated').line_chart('exonerated', 'count', 'docs/images/lattice.svg')
Using matplotlib¶
If you need to make more complex charts, you can always use agate with matplotlib.
Here is an example of how you might generate a line chart:
import pylab
pylab.plot(table.columns['homeruns'], table.columns['wins'])
pylab.xlabel('Homeruns')
pylab.ylabel('Wins')
pylab.title('How homeruns correlate to wins')
pylab.show()
Lookup¶
Generate new columns by mapping existing data to common lookup tables.
CPI deflation¶
The agate-lookup extension adds a lookup
method to agate’s Table class.
Starting with a table that looks like this:
year | cost |
---|---|
1995 | 2.0 |
1997 | 2.2 |
1996 | 2.3 |
2003 | 4.0 |
2007 | 5.0 |
2005 | 6.0 |
We can map the year
column to its annual CPI index in one lookup call.
import agatelookup
agatelookup.patch()
join_year_cpi = table.lookup('year', 'cpi')
The return table will have now have a new column:
year | cost | cpi |
---|---|---|
1995 | 2.0 | 152.383 |
1997 | 2.2 | 160.525 |
1996 | 2.3 | 156.858 |
2003 | 4.0 | 184.000 |
2007 | 5.0 | 207.344 |
2005 | 6.0 | 195.267 |
A simple computation tacked on to this lookup can then get the 2015 equivalent values of each cost:
cpi_2015 = Decimal(216.909)
def cpi_adjust_2015(row):
return (row['cost'] * (cpi_2015 / row['cpi'])).quantize(Decimal('0.01'))
cost_2015 = join_year_cpi.compute([
('cost_2015', agate.Formula(agate.Number(), cpi_adjust_2015))
])
And the final table will look like this:
year | cost | cpi | cost_2015 |
---|---|---|---|
1995 | 2.0 | 152.383 | 2.85 |
1997 | 2.2 | 160.525 | 2.97 |
1996 | 2.3 | 156.858 | 3.18 |
2003 | 4.0 | 184.000 | 4.72 |
2007 | 5.0 | 207.344 | 5.23 |
2005 | 6.0 | 195.267 | 6.66 |
Basics¶
Coming from other tools¶
- SQL
- Excel
- R
- Underscore.js
- Pandas (coming soon!)
Advanced techniques¶
- Filling missing rows in a dataset
- Renaming and reordering columns
- Transforming data (pivot/normalize/denormalize)
- Setting your locale and working with foreign data
- Ranking a sequence of data
- Creating simple charts
- Mapping columns to common lookup tables
Have a common use case that isn’t covered? Please submit an issue on the GitHub repository.
Extensions¶
The core agate library is designed rely on as few dependencies as possible. However, in the real world you’re often going to want to interface with more specialized tools, or with other formats, such as SQL or Excel.
Using extensions¶
agate support’s plugin-style extensions using a monkey-patching pattern. Libraries can be created that add new methods onto Table
and TableSet
. For example, agate-sql adds the ability to read and write tables from a SQL database:
import agate
import agatesql
# After calling patch the from_sql and to_sql methods are now part of the Table class
table = agate.Table.from_sql('postgresql:///database', 'input_table')
table.to_sql('postgresql:///database', 'output_table')
List of extensions¶
Here is a list of agate extensions that are known to be actively maintained:
- agate-sql: Read and write tables in SQL databases
- agate-stats: Additional statistical methods
- agate-excel: Read excel tables (xls and xlsx)
- agate-dbf: Read dbf tables (from shapefiles)
- agate-remote: Read from remote files
- agate-lookup: Instantly join to hosted lookup tables.
Writing your own extensions¶
Writing your own extensions is straightforward. Create a function that acts as your “patch” and then dynamically add it to Table
or TableSet
.
import agate
def new_method(self):
print('I do something to a Table when you call me.')
agate.Table.new_method = new_method
You can also create new classmethods:
def new_class_method(cls):
print('I make Tables when you call me.')
agate.Table.new_method = classmethod(new_method)
These methods can now be called on Table
class in your code:
>>> import agate
>>> import myextension
>>> table = agate.Table(rows, column_names, column_types)
>>> table.new_method()
'I do something to a Table when you call me.'
>>> agate.Table.new_class_method()
'I make Tables when you call me.'
The same pattern also works for adding methods to TableSet
.
API¶
Table¶
The Table
object is the most important class in agate. Tables are
created by supplying row data, column names and subclasses of DataType
to the constructor. Once created, the data in a table can not be changed.
This concept is central to agate.
Instead of modifying the data, various methods can be used to create new,
derivative tables. For example, the Table.select()
method creates a new
table with only the specified columns. The Table.where()
method creates
a new table with only those rows that pass a test. And Table.order_by()
creates a sorted table. In all of these cases the output is a new Table
and the existing table remains unmodified.
Tables are not themselves iterable, but the columns of the table can be
accessed via Table.columns
and the rows via Table.rows
. Both
sequences can be accessed either by numeric index or by name. (In the case of
rows, row names are optional.)
agate.Table |
A dataset consisting of rows and columns. |
Properties¶
agate.Table.columns |
A MappedSequence with column names for keys and Column instances for values. |
agate.Table.column_names |
An tuple of strings. |
agate.Table.column_types |
An tuple DataType instances. |
agate.Table.rows |
A MappedSeqeuence with row names for keys (if specified) and Row instances for values. |
agate.Table.row_names |
An tuple of strings, if this table has row names. |
Creating¶
agate.Table.from_csv |
Create a new table from a CSV. |
agate.Table.from_json |
Create a new table from a JSON file. |
agate.Table.from_fixed |
Create a new table from a fixed-width file and a CSV schema. |
agate.Table.from_object |
Create a new table from a Python object. |
Saving¶
agate.Table.to_csv |
Write this table to a CSV. |
agate.Table.to_json |
Write this table to a JSON file or file-like object. |
Basic processing¶
agate.Table.distinct |
Create a new table with only unique rows. |
agate.Table.exclude |
Create a new table without the specified columns. |
agate.Table.find |
Find the first row that passes a test. |
agate.Table.limit |
Create a new table with fewer rows. |
agate.Table.order_by |
Create a new table that is sorted. |
agate.Table.select |
Create a new table with only the specified columns. |
agate.Table.where |
Create a new Table with only those rows that pass a test. |
Calculating new data¶
agate.Table.aggregate |
Apply one or more Aggregation instances to this table. |
agate.Table.compute |
Create a new table by applying one or more Computation instances to each row. |
Advanced processing¶
agate.Table.bins |
Generates (approximately) evenly sized bins for the values in a column. |
agate.Table.denormalize |
Create a new table with row values converted into columns. |
agate.Table.group_by |
Create a TableSet with a table for each unique key. |
agate.Table.homogenize |
Fill in missing rows in a series. |
agate.Table.join |
Create a new table by joining two table’s on common values. |
agate.Table.merge |
Create a new table from a sequence of similar tables. |
agate.Table.normalize |
Create a new table with columns converted into rows values. |
agate.Table.pivot |
Create a new table by grouping the data, aggregating those groups, applying a computation, and then organizing the groups into new rows and columns. |
agate.Table.rename |
Create a copy of this table with different column names or row names. |
Previewing¶
agate.Table.print_bars |
Print a text-based bar chart based on this table. |
agate.Table.print_csv |
Print this table as a CSV. |
agate.Table.print_html |
Print an HTML version of this table. |
agate.Table.print_json |
Print this table as JSON. |
agate.Table.print_structure |
Print this table’s column names and types as a plain-text table. |
agate.Table.print_table |
Print a text-based view of the data in this table. |
Charting¶
agate.Table.bar_chart |
Render a bar chart using leather.Chart . |
agate.Table.column_chart |
Render a column chart using leather.Chart . |
agate.Table.line_chart |
Render a line chart using leather.Chart . |
agate.Table.scatterplot |
Render a scatterplot using leather.Chart . |
Detailed list¶
-
class
agate.
Table
(rows, column_names=None, column_types=None, row_names=None, _is_fork=False)¶ Bases:
object
A dataset consisting of rows and columns. Columns refer to “vertical” slices of data that must all be of the same type. Rows refer to “horizontal” slices of data that may (and usually do) contain mixed types.
The sequence of
Column
instances are retrieved via theTable.columns
property. They may be accessed by either numeric index or by unique column name.The sequence of
Row
instances are retrieved via theTable.rows
property. They may be accessed by either numeric index or, if specified, unique row names.Parameters: - rows – The data as a sequence of any sequences: tuples, lists, etc. If any row has fewer values than the number of columns, it will be filled out with nulls. No row may have more values than the number of columns.
- column_names – A sequence of string names for each column or None, in which case
column names will be automatically assigned using
letter_name()
. - column_types – A sequence of instances of
DataType
or an instance ofTypeTester
or None in which case a generic TypeTester will be used. Alternatively, a dictionary with column names as keys and instances ofDataType
as values to specify some types. - row_names – Specifies unique names for each row. This parameter is
optional. If specified it may be 1) the name of a single column that
contains a unique identifier for each row, 2) a key function that takes
a
Row
and returns a unique identifier or 3) a sequence of unique identifiers of the same length as the sequence of rows. The uniqueness of resulting identifiers is not validated, so be certain the values you provide are truly unique. - _is_fork – Used internally to skip certain validation steps when data
is propagated from an existing table. When
True
, rows are assumed to beRow
instances, rather than raw data.
-
aggregate
(aggregations)¶ Apply one or more
Aggregation
instances to this table.Parameters: aggregations – A single Aggregation
instance or a sequence of tuples in the format(name, aggregation)
, where eachaggregation
is an instance ofAggregation
.Returns: If the input was a single Aggregation
then a single result will be returned. If it was a sequence then anOrderedDict
of results will be returned.
-
bar_chart
(label=0, value=1, path=None, width=None, height=None)¶ Render a bar chart using
leather.Chart
.Parameters: - label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
- value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
bins
(column_name, count=10, start=None, end=None)¶ Generates (approximately) evenly sized bins for the values in a column. Bins may not be perfectly even if the spread of the data does not divide evenly, but all values will always be included in some bin.
The resulting table will have two columns. The first will have the same name as the specified column, but will be type
Text
. The second will be namedcount
and will be of typeNumber
.Parameters: - column_name – The name of the column to bin. Must be of type
Number
- count – The number of bins to create. If not specified then each value will be counted as its own bin.
- start – The minimum value to start the bins at. If not specified the minimum value in the column will be used.
- end – The maximum value to end the bins at. If not specified the maximum value in the column will be used.
Returns: A new
Table
.- column_name – The name of the column to bin. Must be of type
-
column_chart
(label=0, value=1, path=None, width=None, height=None)¶ Render a column chart using
leather.Chart
.Parameters: - label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
- value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
column_names
¶ An tuple of strings.
-
columns
¶ A
MappedSequence
with column names for keys andColumn
instances for values.
-
compute
(computations, replace=False)¶ Create a new table by applying one or more
Computation
instances to each row.Parameters: - computations – A sequence of pairs of new column names and
Computation
instances. - replace – If
True
then new column names can match existing names, and those columns will be replaced with the computed data.
Returns: A new
Table
.- computations – A sequence of pairs of new column names and
-
denormalize
(key=None, property_column='property', value_column='value', default_value=<object object>, column_types=None)¶ Create a new table with row values converted into columns.
For example:
name property value Jane gender female Jane race black Jane age 24 ... ... ... Can be denormalized so that each unique value in field becomes a column with value used for its values.
name gender race age Jane female black 24 Jack male white 35 Joe male black 28 If one or more keys are specified then the resulting table will automatically have
row_names
set to those keys.This is the opposite of
Table.normalize()
.Parameters: - key – A column name or a sequence of column names that should be
maintained as they are in the normalized table. Typically these
are the tables unique identifiers and any metadata about them. Or,
None
if there are no key columns. - field_column – The column whose values should become column names in the new table.
- property_column – The column whose values should become the values of the property columns in the new table.
- default_value – Value to be used for missing values in the pivot table. If not
specified
Decimal(0)
will be used for aggregations that returnNumber
data andNone
will be used for all others. - column_types – A sequence of column types with length equal to number of unique
values in field_column or an instance of
TypeTester
. Defaults to a genericTypeTester
.
Returns: A new
Table
.- key – A column name or a sequence of column names that should be
maintained as they are in the normalized table. Typically these
are the tables unique identifiers and any metadata about them. Or,
-
distinct
(key=None)¶ Create a new table with only unique rows.
Parameters: key – Either the name of a single column to use to identify unique rows, a sequence of such column names, a function
that takes a row and returns a value to identify unique rows, or None, in which case the entire row will be checked for uniqueness.Returns: A new Table
.
-
exclude
(key)¶ Create a new table without the specified columns.
Parameters: key – Either the name of a single column to exclude or a sequence of such names. Returns: A new Table
.
-
find
(test)¶ Find the first row that passes a test.
Parameters: test ( function
) – A function that takes aRow
and returnsTrue
if it matches.Returns: A single Row
if found, or None.
-
classmethod
from_csv
(path, column_names=None, column_types=None, row_names=None, skip_lines=0, header=True, sniff_limit=0, encoding='utf-8', **kwargs)¶ Create a new table from a CSV.
This method uses agate’s builtin CSV reader, which supplies encoding support for both Python 2 and Python 3.
kwargs
will be passed through to the CSV reader.Parameters: - path – Filepath or file-like object from which to read CSV data. If a file-like object is specified, it must be seekable. If using Python 2, the file should be opened in binary mode (rb).
- column_names – See
Table.__init__()
. - column_types – See
Table.__init__()
. - row_names – See
Table.__init__()
. - skip_lines – The number of lines to skip from the top of the file. Note that skip lines will not work with
- header – If
True
, the first row of the CSV is assumed to contain column names. Ifheader
andcolumn_names
are both specified then a row will be skipped, butcolumn_names
will be used. - sniff_limit – Limit CSV dialect sniffing to the specified number of bytes. Set to None to sniff the entire file. Defaults to 0 (no sniffing).
- encoding – Character encoding of the CSV file. Note: if passing in a file handle it is assumed you have already opened it with the correct encoding specified.
-
classmethod
from_fixed
(path, schema_path, column_names=<object object>, column_types=None, row_names=None, encoding='utf-8', schema_encoding='utf-8')¶ Create a new table from a fixed-width file and a CSV schema.
Schemas must be in the “ffs” format. There is a repository of such schemas maintained at wireservice/ffs.
Parameters: - path – File path or file-like object from which to read fixed-width data.
- schema_path – File path or file-like object from which to read schema (CSV) data.
- column_names – By default, these will be parsed from the schema. For alternatives, see
Table.__init__()
. - column_types – See
Table.__init__()
. - row_names – See
Table.__init__()
. - encoding – Character encoding of the fixed-width file. Note: if passing in a file handle it is assumed you have already opened it with the correct encoding specified.
- schema_encoding – Character encoding of the schema file. Note: if passing in a file handle it is assumed you have already opened it with the correct encoding specified.
-
classmethod
from_json
(path, row_names=None, key=None, newline=False, column_types=None, **kwargs)¶ Create a new table from a JSON file.
Once the JSON has been deseralized, the resulting Python object is passed to
Table.from_object()
.If the file contains a top-level dictionary you may specify what property contains the row list using the
key
parameter.kwargs
will be passed through tojson.load()
.Parameters: - path – Filepath or file-like object from which to read JSON data.
- row_names – See the
Table.__init__()
. - key – The key of the top-level dictionary that contains a list of row arrays.
- newline – If True then the file will be parsed as “newline-delimited JSON”.
- column_types – See
Table.__init__()
.
-
classmethod
from_object
(obj, row_names=None, column_types=None)¶ Create a new table from a Python object.
The object should be a list containing a dictionary for each “row”. Nested objects or lists will also be parsed. For example, this object:
{ 'one': { 'a': 1, 'b': 2, 'c': 3 }, 'two': [4, 5, 6], 'three': 'd' }
Would generate these columns and values:
{ 'one/a': 1, 'one/b': 2, 'one/c': 3, 'two.0': 4, 'two.1': 5, 'two.2': 6, 'three': 'd' }
Column names and types will be inferred from the data.
Not all rows are required to have the same keys. Missing elements will be filled in with null values.
Parameters: - obj – Filepath or file-like object from which to read JSON data.
- row_names – See
Table.__init__()
. - column_types – See
Table.__init__()
.
-
group_by
(key, key_name=None, key_type=None)¶ Create a
TableSet
with a table for each unique key.Note that group names will always be coerced to a string, regardless of the format of the input column.
Parameters: - key – Either the name of a column from the this table to group by, or a
function
that takes a row and returns a value to group by. - key_name – A name that describes the grouped properties. Defaults to the
column name that was grouped on or “group” if grouping with a key
function. See
TableSet
for more. - key_type – An instance of any subclass of
DataType
. If not provided it will default to a :class`.Text`.
Returns: A
TableSet
mapping where the keys are unique values from thekey
and the values are newTable
instances containing the grouped rows.- key – Either the name of a column from the this table to group by, or a
-
homogenize
(key, compare_values, default_row=None)¶ Fill in missing rows in a series.
This can be used, for instance, to add rows for missing years in a time series.
Missing rows are found by comparing the values in the
key
columns with those provided ascompare_values
.Values not found in the table will be used to generate new rows with the given
default_row
.default_row
should be an array of values or an array-generating function. If not specified, the new rows will haveNone
in columns all columns not specified inkey
.If
default_row
is an array of values, its length should be row length minus the number of column names provided in thekey
.If it is an array-generating function, the function should take an array of missing values for each new row and output a full row including those values.
Parameters: - key – Either a column name or a sequence of such names.
- compare_values – Either an array of column values if key is a single column name or a sequence of arrays of values if key is a sequence of names. It can also be a generator that yields either of the two. A row is created for each value or list of values not found in the rows of the table.
- default_row – An array of values or a function to generate new rows. The length of the input array should be equal to row length minus column_names count. The length of array generated by the function should be the row length.
Returns: A new
Table
.
-
join
(right_table, left_key=None, right_key=None, inner=False, full_outer=False, require_match=False, columns=None)¶ Create a new table by joining two table’s on common values. This method implements most varieties of SQL join, in addition to some unique features.
If
left_key
andright_key
are bothNone
then this method will peform a “sequential join”, which is to say it will join on row number. Theinner
andfull_outer
arguments will determine whether dangling left-hand and right-hand rows are included, respectively.If
left_key
is specified, then a “left outer join” will be performed. This will combine columns from theright_table
anywhere thatleft_key
andright_key
are equal. Unmatched rows from the left table will be included with the right-hand columns set toNone
.If
inner
isTrue
then an “inner join” will be performed. Unmatched rows from either table will be left out.If
full_outer
isTrue
then a “full outer join” will be performed. Unmatched rows from both tables will be included, with the columns in the other table set toNone
.In all cases, if
right_key
isNone
then itleft_key
will be used for both tables.If
left_key
andright_key
are column names, the right-hand identifier column will not be included in the output table.If
require_match
isTrue
unmatched rows will raise an exception. This is like an “inner join” except any row that doesn’t have a match will raise an exception instead of being dropped. This is useful for enforcing expectations about datasets that should match.Column names from the right table which also exist in this table will be suffixed “2” in the new table.
A subset of columns from the right-hand table can be included in the joined table using the
columns
argument.Parameters: - right_table – The “right” table to join to.
- left_key – Either the name of a column from the this table to join on, the index
of a column, a sequence of such column identifiers, a
function
that takes a row and returns a value to join on, orNone
in which case the tables will be joined on row number. - right_key – Either the name of a column from :code:table` to join on, the index of
a column, a sequence of such column identifiers, or a
function
that takes a ow and returns a value to join on. IfNone
thenleft_key
will be used for both. Ifleft_key
isNone
then this value is ignored. - inner – Perform a SQL-style “inner join” instead of a left outer join. Rows
which have no match for
left_key
will not be included in the output table. - full_outer – Perform a SQL-style “full outer” join rather than a left or a right.
May not be used in combination with
inner
. - require_match – If true, an exception will be raised if there is a left_key with no matching right_key.
- columns – A sequence of column names from
right_table
to include in the final output table. Defaults to all columns not inright_key
. Ignored whenfull_outer
isTrue
.
Returns: A new
Table
.
-
limit
(start_or_stop=None, stop=None, step=None)¶ Create a new table with fewer rows.
See also: Python’s builtin
slice()
.Parameters: - start_or_stop – If the only argument, then how many rows to include, otherwise, the index of the first row to include.
- stop – The index of the last row to include.
- step – The size of the jump between rows to include. (step=2 will return every other row.)
Returns: A new
Table
.
-
line_chart
(x=0, y=1, path=None, width=None, height=None)¶ Render a line chart using
leather.Chart
.Parameters: - x – The name or index of a column to plot as the x-axis. Defaults to the first column in the table.
- y – The name or index of a column to plot as the y-axis. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
classmethod
merge
(tables, row_names=None, column_names=None)¶ Create a new table from a sequence of similar tables.
This method will not carry over row names from the merged tables, but new row names can be specified with the
row_names
argument.It is possible to limit the columns included in the new
Table
withcolumn_names
argument. For example, to only include columns from a specific table, setcolumn_names
equal totable.column_names
.Parameters: Returns: A new
Table
.
-
normalize
(key, properties, property_column='property', value_column='value', column_types=None)¶ Create a new table with columns converted into rows values.
For example:
name gender race age Jane female black 24 Jack male white 35 Joe male black 28 can be normalized on columns ‘gender’, ‘race’ and ‘age’:
name property value Jane gender female Jane race black Jane age 24 ... ... ... This is the opposite of
Table.denormalize()
.Parameters: - key – A column name or a sequence of column names that should be maintained as they are in the normalized self. Typically these are the tables unique identifiers and any metadata about them.
- properties – A column name or a sequence of column names that should be converted to properties in the new self.
- property_column – The name to use for the column containing the property names.
- value_column – The name to use for the column containing the property values.
- column_types – A sequence of two column types for the property and value column in
that order or an instance of
TypeTester
. Defaults to a genericTypeTester
.
Returns: A new
Table
.
-
order_by
(key, reverse=False)¶ Create a new table that is sorted.
Parameters: - key – Either the name of a single column to sort by, a sequence of such
names, or a
function
that takes a row and returns a value to sort by. - reverse – If True then sort in reverse (typically, descending) order.
Returns: A new
Table
.- key – Either the name of a single column to sort by, a sequence of such
names, or a
-
pivot
(key=None, pivot=None, aggregation=None, computation=None, default_value=<object object>, key_name=None)¶ Create a new table by grouping the data, aggregating those groups, applying a computation, and then organizing the groups into new rows and columns.
This is sometimes called a “crosstab”.
name race gender Joe white male Jane black female Josh black male Jim asian female This table can be pivoted with
key
equal to “race” andcolumns
equal to “gender”. The default aggregation isCount
. This would result in the following table.race male female white 1 0 black 1 1 asian 0 1 If one or more keys are specified then the resulting table will automatically have
row_names
set to those keys.See also the related method
Table.denormalize()
.Parameters: - key – Either the name of a column from the this table to group by, a
sequence of such column names, a
function
that takes a row and returns a value to group by, orNone
, in which case there will be only a single row in the output table. - pivot – A column name whose unique values will become columns in the new
table, or
None
in which case there will be a single value column in the output table. - aggregation –
An instance of an
Aggregation
to perform on each group of data in the pivot table. (Each cell is the result of an aggregation of the grouped data.)If not specified this defaults to
Count
with no arguments. - computation –
An optional
Computation
instance to be applied to the aggregated sequence of values before they are transposed into the pivot table.Use the class name of the aggregation as your column name argument when constructing your computation. (This is “Count” if using the default value for
aggregation
.) - default_value – Value to be used for missing values in the pivot table. Defaults to
Decimal(0)
. If performing non-mathematical aggregations you may wish to set this toNone
. - key_name – A name for the key column in the output table. This is most
useful when the provided key is a function. This argument is not
valid when
key
is a sequence.
Returns: A new
Table
.- key – Either the name of a column from the this table to group by, a
sequence of such column names, a
-
print_bars
(label_column_name='group', value_column_name='Count', domain=None, width=120, output=<open file '<stdout>', mode 'w'>, printable=False)¶ Print a text-based bar chart based on this table.
Parameters: - label_column_name – The column containing the label values. Defaults to
group
, which is the default output ofTable.pivot()
orTable.bins()
. - value_column_name – The column containing the bar values. Defaults to
Count
, which is the default output ofTable.pivot()
orTable.bins()
. - domain – A 2-tuple containing the minimum and maximum values for the chart’s x-axis. The domain must be large enough to contain all values in the column.
- width – The width, in characters, to use for the bar chart. Defaults to
120
. - output – A file-like object to print to. Defaults to
sys.stdout
. - printable – If true, only printable characters will be outputed.
- label_column_name – The column containing the label values. Defaults to
-
print_csv
(**kwargs)¶ Print this table as a CSV.
This is the same as passing
sys.stdout
toTable.to_csv()
.kwargs
will be passed on toTable.to_csv()
.
-
print_html
(max_rows=20, max_columns=6, output=<open file '<stdout>', mode 'w'>, max_column_width=20, locale=None)¶ Print an HTML version of this table.
Parameters: - max_rows – The maximum number of rows to display before truncating the data. This
defaults to
20
to prevent accidental printing of the entire table. PassNone
to disable the limit. - max_columns – The maximum number of columns to display before truncating the data.
This defaults to
6
to prevent wrapping in most cases. PassNone
to disable the limit. - output – A file-like object to print to. Defaults to
sys.stdout
, unless running in Jupyter. (See above.) - max_column_width – Truncate all columns to at most this width. The remainder will be replaced with ellipsis.
- locale – Provide a locale you would like to be used to format the output. By default it will use the system’s setting.
- max_rows – The maximum number of rows to display before truncating the data. This
defaults to
-
print_json
(**kwargs)¶ Print this table as JSON.
This is the same as passing
sys.stdout
toTable.to_json()
.kwargs
will be passed on toTable.to_json()
.
-
print_structure
(output=<open file '<stdout>', mode 'w'>, max_rows=None)¶ Print this table’s column names and types as a plain-text table.
Parameters: output – The output to print to.
-
print_table
(max_rows=20, max_columns=6, output=<open file '<stdout>', mode 'w'>, max_column_width=20, locale=None, max_precision=3)¶ Print a text-based view of the data in this table.
The output of this method is Github Friendly Markdown (GFM) compatible.
Parameters: - max_rows – The maximum number of rows to display before truncating the data. This
defaults to
20
to prevent accidental printing of the entire table. PassNone
to disable the limit. - max_columns – The maximum number of columns to display before truncating the data.
This defaults to
6
to prevent wrapping in most cases. PassNone
to disable the limit. - output – A file-like object to print to.
- max_column_width – Truncate all columns to at most this width. The remainder will be replaced with ellipsis.
- locale – Provide a locale you would like to be used to format the output. By default it will use the system’s setting.
Max_precision: Puts a limit on the maximum precision displayed for number types. Numbers with lesser precision won’t be affected. This defaults to
3
. PassNone
to disable limit.- max_rows – The maximum number of rows to display before truncating the data. This
defaults to
-
rename
(column_names=None, row_names=None, slug_columns=False, slug_rows=False, **kwargs)¶ Create a copy of this table with different column names or row names.
By enabling
slug_columns
orslug_rows
and not specifying new names you may slugify the table’s existing names.kwargs
will be passed to the slugify method in python-slugify. See: https://github.com/un33k/python-slugifyParameters: - column_names – New column names for the renamed table. May be either an array or a dictionary mapping existing column names to new names. If not specified, will use this table’s existing column names.
- row_names – New row names for the renamed table. May be either an array or a dictionary mapping existing row names to new names. If not specified, will use this table’s existing row names.
- slug_columns – If True, column names will be converted to slugs and duplicate names will have unique identifiers appended.
- slug_rows – If True, row names will be converted to slugs and dupicate names will have unique identifiers appended.
-
row_names
¶ An tuple of strings, if this table has row names.
If this table does not have row names, then
None
.
-
scatterplot
(x=0, y=1, path=None, width=None, height=None)¶ Render a scatterplot using
leather.Chart
.Parameters: - x – The name or index of a column to plot as the x-axis. Defaults to the first column in the table.
- y – The name or index of a column to plot as the y-axis. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
select
(key)¶ Create a new table with only the specified columns.
Parameters: key – Either the name of a single column to include or a sequence of such names. Returns: A new Table
.
-
to_csv
(path, **kwargs)¶ Write this table to a CSV. This method uses agate’s builtin CSV writer, which supports unicode on both Python 2 and Python 3.
kwargs will be passed through to the CSV writer.
Parameters: path – Filepath or file-like object to write to.
-
to_json
(path, key=None, newline=False, indent=None, **kwargs)¶ Write this table to a JSON file or file-like object.
kwargs
will be passed through to the JSON encoder.Parameters: - path – File path or file-like object to write to.
- key – If specified, JSON will be output as an hash instead of a list. May
be either the name of a column from the this table containing
unique values or a
function
that takes a row and returns a unique value. - newline – If True, output will be in the form of “newline-delimited JSON”.
- indent – If specified, the number of spaces to indent the JSON for formatting.
TableSet¶
The TableSet
class collects a set of related tables in a single data
structure. The most common way of creating a TableSet
is using the
Table.group_by()
method, which is similar to SQL’s GROUP BY
keyword.
The resulting set of tables will all have identical columns structure.
TableSet
functions as a dictionary. Individual tables in the set can
be accessed by using their name as a key. If the table set was created using
Table.group_by()
then the names of the tables will be the grouping
factors found in the original data.
TableSet
replicates the majority of the features of Table
.
When methods such as TableSet.select()
, TableSet.where()
or
TableSet.order_by()
are used, the operation is applied to each table
in the set and the result is a new TableSet
instance made up of
entirely new Table
instances.
TableSet
instances can also contain other TableSet’s. This means you
can chain calls to Table.group_by()
and TableSet.group_by()
and end up with data grouped across multiple dimensions.
TableSet.aggregate()
on nested TableSets will then group across multiple
dimensions.
agate.TableSet |
An group of named tables with identical column definitions. |
Properties¶
agate.TableSet.key_name |
Get the name of the key this TableSet is grouped by. |
agate.TableSet.key_type |
Get the DataType this TableSet is grouped by. |
agate.TableSet.column_types |
Get an ordered list of this TableSet ‘s column types. |
agate.TableSet.column_names |
Get an ordered list of this TableSet ‘s column names. |
Creating¶
agate.TableSet.from_csv |
Create a new TableSet from a directory of CSVs. |
agate.TableSet.from_json |
Create a new TableSet from a directory of JSON files or a single JSON object with key value (Table key and list of row objects) pairs for each Table . |
Saving¶
agate.TableSet.to_csv |
Write each table in this set to a separate CSV in a given directory. |
agate.TableSet.to_json |
Write TableSet to either a set of JSON files for each table or a single nested JSON file. |
Processing¶
agate.TableSet.aggregate |
Aggregate data from the tables in this set by performing some set of column operations on the groups and coalescing the results into a new Table . |
agate.TableSet.having |
Create a new TableSet with only those tables that pass a test. |
agate.TableSet.merge |
Convert this TableSet into a single table. |
Previewing¶
agate.TableSet.print_structure |
Print the keys and row counts of each table in the tableset. |
Charting¶
agate.TableSet.bar_chart |
Render a lattice/grid of bar charts using leather.Lattice . |
agate.TableSet.column_chart |
Render a lattice/grid of column charts using leather.Lattice . |
agate.TableSet.line_chart |
Render a lattice/grid of line charts using leather.Lattice . |
agate.TableSet.scatterplot |
Render a lattice/grid of scatterplots using leather.Lattice . |
Table Proxy Methods¶
agate.TableSet.bins |
Calls Table.bins() on each table in the TableSet. |
agate.TableSet.compute |
Calls Table.compute() on each table in the TableSet. |
agate.TableSet.denormalize |
Calls Table.denormalize() on each table in the TableSet. |
agate.TableSet.distinct |
Calls Table.distinct() on each table in the TableSet. |
agate.TableSet.exclude |
Calls Table.exclude() on each table in the TableSet. |
agate.TableSet.find |
Calls Table.find() on each table in the TableSet. |
agate.TableSet.group_by |
Calls Table.group_by() on each table in the TableSet. |
agate.TableSet.homogenize |
Calls Table.homogenize() on each table in the TableSet. |
agate.TableSet.join |
Calls Table.join() on each table in the TableSet. |
agate.TableSet.limit |
Calls Table.limit() on each table in the TableSet. |
agate.TableSet.normalize |
Calls Table.normalize() on each table in the TableSet. |
agate.TableSet.order_by |
Calls Table.order_by() on each table in the TableSet. |
agate.TableSet.pivot |
Calls Table.pivot() on each table in the TableSet. |
agate.TableSet.select |
Calls Table.select() on each table in the TableSet. |
agate.TableSet.where |
Calls Table.where() on each table in the TableSet. |
Detailed list¶
-
class
agate.
TableSet
(tables, keys, key_name='group', key_type=None, _is_fork=False)¶ Bases:
agate.mapped_sequence.MappedSequence
An group of named tables with identical column definitions. Supports (almost) all the same operations as
Table
. When executed on aTableSet
, any operation that would have returned a newTable
instead returns a newTableSet
. Any operation that would have returned a single value instead returns a dictionary of values.TableSet is implemented as a subclass of
MappedSequence
Parameters: - tables – A sequence
Table
instances. - keys – A sequence of keys corresponding to the tables. These may be any type
except
int
. - key_name – A name that describes the grouping properties. Used as the column header when the groups are aggregated. Defaults to the column name that was grouped on.
- key_type – An instance some subclass of
DataType
. If not provided it will default to a :class`.Text`. - _is_fork – Used internally to skip certain validation steps when data is propagated from an existing tablset.
-
aggregate
(aggregations)¶ Aggregate data from the tables in this set by performing some set of column operations on the groups and coalescing the results into a new
Table
.aggregations
must be a sequence of tuples, where each has two parts: anew_column_name
and aAggregation
instance.The resulting table will have the keys from this
TableSet
(and any nested TableSets) set as itsrow_names
. SeeTable.__init__()
for more details.Parameters: aggregations – A list of tuples in the format (new_column_name, aggregation)
, where eachaggregation
is an instance ofAggregation
.Returns: A new Table
.
-
bar_chart
(label=0, value=1, path=None, width=None, height=None)¶ Render a lattice/grid of bar charts using
leather.Lattice
.Parameters: - label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
- value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
bins
(*args, **kwargs)¶ Calls
Table.bins()
on each table in the TableSet.
-
column_chart
(label=0, value=1, path=None, width=None, height=None)¶ Render a lattice/grid of column charts using
leather.Lattice
.Parameters: - label – The name or index of a column to plot as the labels of the chart. Defaults to the first column in the table.
- value – The name or index of a column to plot as the values of the chart. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
column_types
¶ Get an ordered list of this
TableSet
‘s column types.Returns: A tuple
ofDataType
instances.
-
compute
(*args, **kwargs)¶ Calls
Table.compute()
on each table in the TableSet.
-
count
(value) → integer -- return number of occurrences of value¶
-
denormalize
(*args, **kwargs)¶ Calls
Table.denormalize()
on each table in the TableSet.
-
dict
()¶ Retrieve the contents of this sequence as an
collections.OrderedDict
.
-
distinct
(*args, **kwargs)¶ Calls
Table.distinct()
on each table in the TableSet.
-
exclude
(*args, **kwargs)¶ Calls
Table.exclude()
on each table in the TableSet.
-
find
(*args, **kwargs)¶ Calls
Table.find()
on each table in the TableSet.
-
classmethod
from_csv
(dir_path, column_names=None, column_types=None, row_names=None, header=True, **kwargs)¶ Create a new
TableSet
from a directory of CSVs.See
Table.from_csv()
for additional details.Parameters: - dir_path – Path to a directory full of CSV files. All CSV files in this directory will be loaded.
- column_names – See
Table.__init__()
. - column_types – See
Table.__init__()
. - row_names – See
Table.__init__()
. - header – See
Table.from_csv()
.
-
classmethod
from_json
(path, column_names=None, column_types=None, keys=None, **kwargs)¶ Create a new
TableSet
from a directory of JSON files or a single JSON object with key value (Table key and list of row objects) pairs for eachTable
.See
Table.from_json()
for additional details.Parameters: - path – Path to a directory containing JSON files or filepath/file-like object of nested JSON file.
- keys – A list of keys of the top-level dictionaries for each file. If specified, length must be equal to number of JSON files in path.
- column_types – See
Table.__init__()
.
-
get
(key, default=None)¶ Equivalent to
collections.OrderedDict.get()
.
-
group_by
(*args, **kwargs)¶ Calls
Table.group_by()
on each table in the TableSet.
-
having
(aggregations, test)¶ Create a new
TableSet
with only those tables that pass a test.This works by applying a sequence of
Aggregation
instances to each table. The resulting dictionary of properties is then passed to thetest
function.This method does not modify the underlying tables in any way.
Parameters: - aggregations – A list of tuples in the format
(name, aggregation)
, where eachaggregation
is an instance ofAggregation
. - test (
function
) – A function that takes a dictionary of aggregated properties and returnsTrue
if it should be included in the newTableSet
.
Returns: A new
TableSet
.- aggregations – A list of tuples in the format
-
homogenize
(*args, **kwargs)¶ Calls
Table.homogenize()
on each table in the TableSet.
-
index
(value) → integer -- return first index of value.¶ Raises ValueError if the value is not present.
-
items
()¶ Equivalent to
collections.OrderedDict.items()
.
-
join
(*args, **kwargs)¶ Calls
Table.join()
on each table in the TableSet.
-
key_name
¶ Get the name of the key this TableSet is grouped by. (If created using
Table.group_by()
then this is the original column name.)
-
key_type
¶ Get the
DataType
this TableSet is grouped by. (If created usingTable.group_by()
then this is the original column type.)
-
keys
()¶ Equivalent to
collections.OrderedDict.keys()
.
-
limit
(*args, **kwargs)¶ Calls
Table.limit()
on each table in the TableSet.
-
line_chart
(x=0, y=1, path=None, width=None, height=None)¶ Render a lattice/grid of line charts using
leather.Lattice
.Parameters: - x – The name or index of a column to plot as the x axis of the chart. Defaults to the first column in the table.
- y – The name or index of a column to plot as the y axis of the chart. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
merge
(groups=None, group_name=None, group_type=None)¶ Convert this TableSet into a single table. This is the inverse of
Table.group_by()
.Any row_names set on the merged tables will be lost in this process.
Parameters: - groups – A list of grouping factors to add to merged rows in a new column.
If specified, it should have exactly one element per
Table
in theTableSet
. If not specified or None, the grouping factor will be the name of theRow
‘s original Table. - group_name – This will be the column name of the grouping factors. If None,
defaults to the
TableSet.key_name
. - group_type – This will be the column type of the grouping factors. If None,
defaults to the
TableSet.key_type
.
Returns: A new
Table
.- groups – A list of grouping factors to add to merged rows in a new column.
If specified, it should have exactly one element per
-
normalize
(*args, **kwargs)¶ Calls
Table.normalize()
on each table in the TableSet.
-
order_by
(*args, **kwargs)¶ Calls
Table.order_by()
on each table in the TableSet.
-
pivot
(*args, **kwargs)¶ Calls
Table.pivot()
on each table in the TableSet.
-
print_structure
(max_rows=20, output=<open file '<stdout>', mode 'w'>)¶ Print the keys and row counts of each table in the tableset.
Parameters: - max_rows – The maximum number of rows to display before truncating the data. Defaults to 20.
- output – The output used to print the structure of the
Table
.
Returns: None
-
scatterplot
(x=0, y=1, path=None, width=None, height=None)¶ Render a lattice/grid of scatterplots using
leather.Lattice
.Parameters: - x – The name or index of a column to plot as the x axis of the chart. Defaults to the first column in the table.
- y – The name or index of a column to plot as the y axis of the chart. Defaults to the second column in the table.
- path – If specified, the resulting SVG will be saved to this location. If
None
and running in IPython, then the SVG will be rendered inline. Otherwise, the SVG data will be returned as a string. - width – The width of the output SVG.
- height – The height of the output SVG.
-
select
(*args, **kwargs)¶ Calls
Table.select()
on each table in the TableSet.
-
to_csv
(dir_path, **kwargs)¶ Write each table in this set to a separate CSV in a given directory.
See
Table.to_csv()
for additional details.Parameters: dir_path – Path to the directory to write the CSV files to.
-
to_json
(path, nested=False, indent=None, **kwargs)¶ Write
TableSet
to either a set of JSON files for each table or a single nested JSON file.See
Table.to_json()
for additional details.Parameters: - path – Path to the directory to write the JSON file(s) to. If nested is True, this should be a file path or file-like object to write to.
- nested – If True, the output will be a single nested JSON file with each Table’s key paired with a list of row objects. Otherwise, the output will be a set of files for each table. Defaults to False.
- indent – See
Table.to_json()
.
-
values
()¶ Equivalent to
collections.OrderedDict.values()
.
-
where
(*args, **kwargs)¶ Calls
Table.where()
on each table in the TableSet.
- tables – A sequence
Columns and rows¶
agate.MappedSequence |
A generic container for immutable data that can be accessed either by numeric index or by key. |
agate.Column |
Proxy access to column data. |
agate.Row |
A row of data. |
-
class
agate.
MappedSequence
(values, keys=None)¶ Bases:
_abcoll.Sequence
A generic container for immutable data that can be accessed either by numeric index or by key. This is similar to an
collections.OrderedDict
except that the keys are optional and iteration over it returns the values instead of keys.This is the base class for both
Column
andRow
.Parameters: - values – A sequence of values.
- keys – A sequence of keys.
-
dict
()¶ Retrieve the contents of this sequence as an
collections.OrderedDict
.
-
get
(key, default=None)¶ Equivalent to
collections.OrderedDict.get()
.
-
items
()¶ Equivalent to
collections.OrderedDict.items()
.
-
keys
()¶ Equivalent to
collections.OrderedDict.keys()
.
-
values
()¶ Equivalent to
collections.OrderedDict.values()
.
-
class
agate.
Column
(index, name, data_type, rows, row_names=None)¶ Bases:
agate.mapped_sequence.MappedSequence
Proxy access to column data. Instances of
Column
should not be constructed directly. They are created byTable
instances and are unique to them.Columns are implemented as subclass of
MappedSequence
. They deviate from the underlying implementation in that loading of their data is deferred until it is needed.Parameters: - name – The name of this column.
- data_type – An instance of
DataType
. - rows – A
MappedSequence
that contains theRow
instances containing the data for this column. - row_names – An optional list of row names (keys) for this column.
-
data_type
¶ This column’s data type.
-
index
¶ This column’s index.
-
name
¶ This column’s name.
-
values
()¶ Get the values in this column, as a tuple.
-
values_distinct
()¶ Get the distinct values in this column, as a tuple.
-
values_sorted
()¶ Get the values in this column sorted.
-
values_without_nulls
()¶ Get the values in this column with any null values removed.
-
values_without_nulls_sorted
()¶ Get the values in this column with any null values removed and sorted.
-
class
agate.
Row
(values, keys=None)¶ Bases:
agate.mapped_sequence.MappedSequence
A row of data. Values within a row can be accessed by column name or column index. Row are immutable and may be shared between
Table
instances.Currently row instances are a no-op subclass of
MappedSequence
. They are being maintained in this fashion in order to support future features.
Data types¶
Data types define how data should be imported during the creation of a
Table
.
If column types are not explicitly specified when a Table
is created,
agate will attempt to guess them. The TypeTester
class can be used to
control how types are guessed.
agate.DataType |
Specifies how values should be parsed when creating a Table . |
Supported types¶
agate.Text |
Data representing text. |
agate.Number |
Data representing numbers. |
agate.Boolean |
Data representing true and false. |
agate.Date |
Data representing dates alone. |
agate.DateTime |
Data representing dates with times. |
agate.TimeDelta |
Data representing the interval between two dates and/or times. |
Detailed list¶
-
class
agate.
DataType
(null_values=('', 'na', 'n/a', 'none', 'null', '.'))¶ Bases:
object
Specifies how values should be parsed when creating a
Table
.Parameters: null_values – A sequence of values which should be cast to None
when encountered by this data type.-
cast
(d)¶ Coerce a given string value into this column’s data type.
-
csvify
(d)¶ Format a given native value for CSV serialization.
-
jsonify
(d)¶ Format a given native value for JSON serialization.
-
test
(d)¶ Test, for purposes of type inference, if a value could possibly be coerced to this data type.
This is really just a thin wrapper around
DataType.cast()
.
-
-
class
agate.
Text
(cast_nulls=True, **kwargs)¶ Bases:
agate.data_types.base.DataType
Data representing text.
Parameters: cast_nulls – If True
, values inDEFAULT_NULL_VALUES
will be converted to None. Disable to retain them as strings.-
cast
(d)¶ Cast a single value to
unicode()
(str()
in Python 3).Parameters: d – A value to cast. Returns: unicode()
(str()
in Python 3) orNone
-
-
class
agate.
Number
(locale='en_US', group_symbol=None, decimal_symbol=None, currency_symbols=[u'u060b', u'$', u'u0192', u'u17db', u'xa5', u'u20a1', u'u20b1', u'xa3', u'u20ac', u'xa2', u'ufdfc', u'u20aa', u'u20a9', u'u20ad', u'u20ae', u'u20a6', u'u0e3f', u'u20a4', u'u20ab'], **kwargs)¶ Bases:
agate.data_types.base.DataType
Data representing numbers.
Parameters: - locale – A locale specification such as
en_US
orde_DE
to use for parsing formatted numbers. - group_symbol – A grouping symbol used in the numbers. Overrides the value provided by
the specified
locale
. - decimal_symbol – A decimal separate symbol used in the numbers. Overrides the value
provided by the specified
locale
. - currency_symbols – A sequence of currency symbols to strip from numbers.
-
cast
(d)¶ Cast a single value to a
decimal.Decimal
.Returns: decimal.Decimal
orNone
.
- locale – A locale specification such as
-
class
agate.
Boolean
(true_values=('yes', 'y', 'true', 't', '1'), false_values=('no', 'n', 'false', 'f', '0'), null_values=('', 'na', 'n/a', 'none', 'null', '.'))¶ Bases:
agate.data_types.base.DataType
Data representing true and false.
Note that by default numerical 1 and 0 are considered valid boolean values, but other numbers are not.
Parameters: - true_values – A sequence of values which should be cast to
True
when encountered with this type. - false_values – A sequence of values which should be cast to
False
when encountered with this type.
- true_values – A sequence of values which should be cast to
-
class
agate.
Date
(date_format=None, **kwargs)¶ Bases:
agate.data_types.base.DataType
Data representing dates alone.
Parameters: date_format – A formatting string for datetime.datetime.strptime()
to use instead of using regex-based parsing.-
cast
(d)¶ Cast a single value to a
datetime.date
.Parameters: date_format – An optional datetime.strptime()
format string for parsing datetimes in this column.Returns: datetime.date
orNone
.
-
-
class
agate.
DateTime
(datetime_format=None, timezone=None, **kwargs)¶ Bases:
agate.data_types.base.DataType
Data representing dates with times.
Parameters: - datetime_format – A formatting string for
datetime.datetime.strptime()
to use instead of using regex-based parsing. - timezone – A pytz timezone to apply to each parsed date.
-
cast
(d)¶ Cast a single value to a
datetime.datetime
.Parameters: datetime_format – An optional datetime.strptime()
format string for parsing datetimes in this column.Returns: datetime.datetime
orNone
.
- datetime_format – A formatting string for
-
class
agate.
TimeDelta
(null_values=('', 'na', 'n/a', 'none', 'null', '.'))¶ Bases:
agate.data_types.base.DataType
Data representing the interval between two dates and/or times.
-
cast
(d)¶ Cast a single value to
datetime.timedelta
.Parameters: d – A value to cast. Returns: datetime.timedelta
orNone
-
Type inference¶
-
class
agate.
TypeTester
(force={}, limit=None, types=None)¶ Bases:
object
Control how data types are inferred for columns in a given set of data.
This class is used by passing it to the
column_types
argument of theTable
constructor, or the same argument for any other method that create aTable
Type inference can be a slow process. To limit the number of rows of data to be tested, pass the
limit
argument. Note that may cause errors if your data contains different types of values after the specified number of rows.By default, data types will be tested against each column in this order:
Individual types may be specified using the
force
argument. The type order by be changed, or entire types disabled, by using thetypes
argument. Beware that changing the order of the types may cause unexpected behavior.Parameters: - force – A dictionary where each key is a column name and each value is a
DataType
instance that overrides inference. - limit – An optional limit on how many rows to evaluate before selecting the most likely type. Note that applying a limit may mean errors arise when the data is cast–if the guess is proved incorrect in further rows of data.
- types – A sequence of possible types to test against. This be used to specify
what data formats you want to test against. For instance, you may want
to exclude
TimeDelta
from testing. It can also be used to pass options such aslocale
toNumber
orcast_nulls
toText
. Take care in specifying the order of the list. It is the order they are tested in.Text
should always be last.
-
run
(rows, column_names)¶ Apply type inference to the provided data and return an array of column types.
Parameters: rows – The data as a sequence of any sequences: tuples, lists, etc.
- force – A dictionary where each key is a column name and each value is a
Aggregations¶
Aggregations create a new value by summarizing a Column
. For
example, Mean
, when applied to a column containing Number
data, returns a single decimal.Decimal
value which is the average of
all values in that column.
Aggregations can be applied to single columns using the Table.aggregate()
method. The result is a single value if a one aggregation was applied, or
a tuple of values if a sequence of aggregations was applied.
Aggregations can be applied to instances of TableSet
using the
TableSet.aggregate()
method. The result is a new Table
with a column for each aggregation and a row for each table in the set.
agate.Aggregation |
Aggregations create a new value by summarizing a Column . |
agate.Summary |
Apply an arbitrary function to a column. |
Basic aggregations¶
agate.All |
Check if all values in a column pass a test. |
agate.Any |
Check if any value in a column passes a test. |
agate.Count |
Count occurences of a value or values. |
agate.HasNulls |
Check if the column contains null values. |
agate.Min |
Find the minimum value in a column. |
agate.Max |
Find the maximum value in a column. |
agate.MaxPrecision |
Find the most decimal places present for any value in this column. |
Statistical aggregations¶
agate.Deciles |
Calculate the deciles of a column based on its percentiles. |
agate.IQR |
Calculate the interquartile range of a column. |
agate.MAD |
Calculate the median absolute deviation of a column. |
agate.Mean |
Calculate the mean of a column. |
agate.Median |
Calculate the median of a column. |
agate.Mode |
Calculate the mode of a column. |
agate.Percentiles |
Divide a column into 100 equal-size groups using the “CDF” method. |
agate.PopulationStDev |
Calculate the population standard of deviation of a column. |
agate.PopulationVariance |
Calculate the population variance of a column. |
agate.Quartiles |
Calculate the quartiles of column based on its percentiles. |
agate.Quintiles |
Calculate the quintiles of a column based on its percentiles. |
agate.StDev |
Calculate the sample standard of deviation of a column. |
agate.Sum |
Calculate the sum of a column. |
agate.Variance |
Calculate the sample variance of a column. |
Text aggregations¶
agate.MaxLength |
Find the length of the longest string in a column. |
Detailed list¶
-
class
agate.
Aggregation
¶ Bases:
object
Aggregations create a new value by summarizing a
Column
.Aggregations are applied with
Table.aggregate()
andTableSet.aggregate()
.When creating a custom aggregation, ensure that the values returned by
Aggregation.run()
are of the type specified byAggregation.get_aggregate_data_type()
. This can be ensured by using theDataType.cast()
method. SeeSummary
for an example.-
get_aggregate_data_type
(table)¶ Get the data type that should be used when using this aggregation with a
TableSet
to produce a new column.Should raise
UnsupportedAggregationError
if this column does not support aggregation into aTableSet
. (For example, if it does not return a single value.)
-
run
(table)¶ Execute this aggregation on a given column and return the result.
-
validate
(table)¶ Perform any checks necessary to verify this aggregation can run on the provided table without errors. This is called by
Table.aggregate()
beforerun()
.
-
-
class
agate.
All
(column_name, test)¶ Bases:
agate.aggregations.base.Aggregation
Check if all values in a column pass a test.
Parameters: - column_name – The name of the column to check.
- test – Either a single value that all values in the column are compared against (for equality) or a function that takes a column value and returns True or False.
-
class
agate.
Any
(column_name, test)¶ Bases:
agate.aggregations.base.Aggregation
Check if any value in a column passes a test.
Parameters: - column_name – The name of the column to check.
- test – Either a single value that all values in the column are compared against (for equality) or a function that takes a column value and returns True or False.
-
class
agate.
Count
(column_name=None, value=<object object>)¶ Bases:
agate.aggregations.base.Aggregation
Count occurences of a value or values.
This aggregation can be used in three ways:
- If no arguments are specified, then it will count the number of rows in the table.
- If only
column_name
is specified, then it will count the number of non-null values in that column. - If both
column_name
andvalue
are specified, then it will count occurrences of a specific value.
Parameters: - column_name – The column containing the values to be counted.
- value – Any value to be counted, including
None
.
-
class
agate.
Deciles
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the deciles of a column based on its percentiles.
Deciles will be equivalent to the 10th, 20th ... 90th percentiles.
“Zeroth” (min value) and “Tenth” (max value) deciles are included for reference and intuitive indexing.
See
Percentiles
for implementation details.This aggregation can not be applied to a
TableSet
.Parameters: column_name – The name of a column containing Number
data.-
run
(table)¶ Returns: An instance of Quantiles
.
-
-
class
agate.
HasNulls
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Check if the column contains null values.
Parameters: column_name – The name of the column to check.
-
class
agate.
IQR
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the interquartile range of a column.
Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
MAD
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the median absolute deviation of a column.
Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Min
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Find the minimum value in a column.
This aggregation can be applied to columns containing
Date
,DateTime
, orNumber
data.Parameters: column_name – The name of the column to be searched.
-
class
agate.
Max
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Find the maximum value in a column.
This aggregation can be applied to columns containing
Date
,DateTime
, orNumber
data.Parameters: column_name – The name of the column to be searched.
-
class
agate.
MaxLength
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Find the length of the longest string in a column.
Note: On Python 2.7 this function may miscalcuate the length of unicode strings that contain “wide characters”. For details see this StackOverflow answer: http://stackoverflow.com/a/35462951
Parameters: column_name – The name of a column containing Text
data.
-
class
agate.
MaxPrecision
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Find the most decimal places present for any value in this column.
Parameters: column_name – The name of the column to be searched.
-
class
agate.
Mean
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the mean of a column.
Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Median
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the median of a column.
Median is equivalent to the 50th percentile. See
Percentiles
for implementation details.Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Mode
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the mode of a column.
Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Percentiles
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Divide a column into 100 equal-size groups using the “CDF” method.
See this explanation of the various methods for computing percentiles.
“Zeroth” (min value) and “Hundredth” (max value) percentiles are included for reference and intuitive indexing.
A reference implementation was provided by pycalcstats.
This aggregation can not be applied to a
TableSet
.Parameters: column_name – The name of a column containing Number
data.-
run
(table)¶ Returns: An instance of Quantiles
.
-
-
class
agate.
PopulationStDev
(column_name)¶ Bases:
agate.aggregations.stdev.StDev
Calculate the population standard of deviation of a column.
For the sample standard of deviation see
StDev
.Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
PopulationVariance
(column_name)¶ Bases:
agate.aggregations.variance.Variance
Calculate the population variance of a column.
For the sample variance see
Variance
.Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Quartiles
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the quartiles of column based on its percentiles.
Quartiles will be equivalent to the the 25th, 50th and 75th percentiles.
“Zeroth” (min value) and “Fourth” (max value) quartiles are included for reference and intuitive indexing.
See
Percentiles
for implementation details.This aggregation can not be applied to a
TableSet
.Parameters: column_name – The name of a column containing Number
data.-
run
(table)¶ Returns: An instance of Quantiles
.
-
-
class
agate.
Quintiles
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the quintiles of a column based on its percentiles.
Quintiles will be equivalent to the 20th, 40th, 60th and 80th percentiles.
“Zeroth” (min value) and “Fifth” (max value) quintiles are included for reference and intuitive indexing.
See
Percentiles
for implementation details.This aggregation can not be applied to a
TableSet
.Parameters: column_name – The name of a column containing Number
data.-
run
(table)¶ Returns: An instance of Quantiles
.
-
-
class
agate.
StDev
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the sample standard of deviation of a column.
For the population standard of deviation see
PopulationStDev
.Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Sum
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the sum of a column.
Parameters: column_name – The name of a column containing Number
data.
-
class
agate.
Summary
(column_name, data_type, func, cast=True)¶ Bases:
agate.aggregations.base.Aggregation
Apply an arbitrary function to a column.
Parameters: - column_name – The name of a column to be summarized.
- data_type – The return type of this aggregation.
- func – A function which will be passed the column for processing.
- cast – If
True
, each return value will be cast to the specifieddata_type
to ensure it is valid. Only disable this if you are certain your summary always returns the correct type.
-
class
agate.
Variance
(column_name)¶ Bases:
agate.aggregations.base.Aggregation
Calculate the sample variance of a column.
For the population variance see
PopulationVariance
.Parameters: column_name – The name of a column containing Number
data.
Computations¶
Computations create a new value for each Row
in a Table
.
When used with Table.compute()
these new values become a new column.
For instance, the PercentChange
computation takes two column names as
arguments and computes the percentage change between them for each row.
There are a variety of basic computations, such as Change
and
Percent
. If none of these meet your needs you can use the
Formula
computation to apply an arbitrary function to the row.
If this still isn’t flexible enough, it’s simple to create a custom computation
class by inheriting from Computation
.
agate.Computation |
Computations produce a new column by performing a calculation on each row. |
agate.Formula |
Apply an arbitrary function to each row. |
Mathematical computations¶
agate.Change |
Calculate the difference between two columns. |
agate.Percent |
Calculate each values percentage of a total. |
agate.PercentChange |
Calculate the percent difference between two columns. |
agate.PercentileRank |
Calculate the percentile into which each value falls. |
agate.Rank |
Calculate rank order of the values in a column. |
Detailed list¶
-
class
agate.
Change
(before_column_name, after_column_name)¶ Bases:
agate.computations.base.Computation
Calculate the difference between two columns.
This calculation can be applied to
Number
columns to calculate numbers. It can also be applied toDate
,DateTime
, andTimeDelta
columns to calculate time deltas.Parameters: - before_column_name – The name of a column containing the “before” values.
- after_column_name – The name of a column containing the “after” values.
-
class
agate.
Computation
¶ Bases:
object
Computations produce a new column by performing a calculation on each row.
Computations are applied with
TableSet.compute
.When implementing a custom computation, ensure that the values returned by
Computation.run()
are of the type specified byComputation.get_computed_data_type()
. This can be ensured by using theDataType.cast()
method. SeeFormula
for an example.-
get_computed_data_type
(table)¶ Returns an instantiated
DataType
which will be appended to the table.
-
run
(table)¶ When invoked with a table, returns a sequence of new column values.
-
validate
(table)¶ Perform any checks necessary to verify this computation can run on the provided table without errors. This is called by
Table.compute()
beforerun()
.
-
-
class
agate.
Formula
(data_type, func, cast=True)¶ Bases:
agate.computations.base.Computation
Apply an arbitrary function to each row.
Parameters: - data_type – The data type this formula will return.
- func – The function to be applied to each row. Must return a valid value for the specified data type.
- cast – If
True
, each return value will be cast to the specifieddata_type
to ensure it is valid. Only disable this if you are certain your formula always returns the correct type.
-
class
agate.
Percent
(column_name, total=None)¶ Bases:
agate.computations.base.Computation
Calculate each values percentage of a total.
Parameters: -
run
(table)¶ Returns: decimal.Decimal
-
-
class
agate.
PercentChange
(before_column_name, after_column_name)¶ Bases:
agate.computations.base.Computation
Calculate the percent difference between two columns.
Parameters: -
run
(table)¶ Returns: decimal.Decimal
-
-
class
agate.
PercentileRank
(column_name, comparer=None, reverse=None)¶ Bases:
agate.computations.rank.Rank
Calculate the percentile into which each value falls.
See
Percentiles
for implementation details.Parameters: column_name – The name of a column containing the Number
values.
-
class
agate.
Rank
(column_name, comparer=None, reverse=None)¶ Bases:
agate.computations.base.Computation
Calculate rank order of the values in a column.
Uses the “competition” ranking method: if there are four values and the middle two are tied, then the output will be [1, 2, 2, 4].
Null values will always be ranked last.
Parameters: - column_name – The name of the column to rank.
- comparer – An optional comparison function. If not specified ranking will be ascending, with nulls ranked last.
- reverse – Reverse sort order before ranking.
-
class
agate.
Slug
(column_name, ensure_unique=False, **kwargs)¶ Bases:
agate.computations.base.Computation
Convert text values from one or more columns into slugs. If multiple column names are given, values from those columns will be appended in the given order before standardizing.
Parameters: - column_name – The name of a column or a sequence of column names containing
Text
values. - ensure_unique – If True, any duplicate values will be appended with unique identifers. Defaults to False.
-
run
(table)¶ Returns: string
- column_name – The name of a column or a sequence of column names containing
CSV reader and writer¶
Agate contains CSV readers and writers that are intended to be used as a drop-in replacement for csv
. These versions add unicode support for Python 2 and several other minor features.
Agate methods will use these version automatically. If you would like to use them in your own code, you can import them, like this:
import agate as csv
Due to nuanced differences between the versions, these classes are implemented seperately for Python 2 and Python 3. The documentation for both versions is provided below, but only the one for your version of Python is imported with the above code.
Python 3¶
agate.csv_py3.reader |
A replacement for Python’s csv.reader() that uses csv_py3.Reader . |
agate.csv_py3.writer |
A replacement for Python’s csv.writer() that uses csv_py3.Writer . |
agate.csv_py3.Reader |
A wrapper around Python 3’s builtin csv.reader() . |
agate.csv_py3.Writer |
A wrapper around Python 3’s builtin csv.writer() . |
agate.csv_py3.DictReader |
A wrapper around Python 3’s builtin csv.DictReader . |
agate.csv_py3.DictWriter |
A wrapper around Python 3’s builtin csv.DictWriter . |
Python 2¶
agate.csv_py2.reader |
A replacement for Python’s csv.reader() that uses csv_py2.Reader . |
agate.csv_py2.writer |
A replacement for Python’s csv.writer() that uses csv_py2.Writer . |
agate.csv_py2.Reader |
A unicode-aware CSV reader. |
agate.csv_py2.Writer |
A unicode-aware CSV writer. |
agate.csv_py2.DictReader |
A unicode-aware CSV DictReader. |
agate.csv_py2.DictWriter |
A unicode-aware CSV DictWriter. |
Python 3 details¶
-
agate.csv_py3.
reader
(*args, **kwargs)¶ A replacement for Python’s
csv.reader()
that usescsv_py3.Reader
.
-
agate.csv_py3.
writer
(*args, **kwargs)¶ A replacement for Python’s
csv.writer()
that usescsv_py3.Writer
.
-
class
agate.csv_py3.
Reader
(f, field_size_limit=None, line_numbers=False, header=True, **kwargs)¶ Bases:
six.Iterator
A wrapper around Python 3’s builtin
csv.reader()
.
-
class
agate.csv_py3.
Writer
(f, line_numbers=False, **kwargs)¶ Bases:
object
A wrapper around Python 3’s builtin
csv.writer()
.
-
class
agate.csv_py3.
DictReader
(f, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)¶ Bases:
csv.DictReader
A wrapper around Python 3’s builtin
csv.DictReader
.
-
class
agate.csv_py3.
DictWriter
(f, fieldnames, line_numbers=False, **kwargs)¶ Bases:
csv.DictWriter
A wrapper around Python 3’s builtin
csv.DictWriter
.
Python 2 details¶
-
agate.csv_py2.
reader
(*args, **kwargs)¶ A replacement for Python’s
csv.reader()
that usescsv_py2.Reader
.
-
agate.csv_py2.
writer
(*args, **kwargs)¶ A replacement for Python’s
csv.writer()
that usescsv_py2.Writer
.
-
class
agate.csv_py2.
Reader
(f, encoding='utf-8', field_size_limit=None, line_numbers=False, header=True, **kwargs)¶ Bases:
agate.csv_py2.UnicodeReader
A unicode-aware CSV reader.
-
class
agate.csv_py2.
Writer
(f, encoding='utf-8', line_numbers=False, **kwargs)¶ Bases:
agate.csv_py2.UnicodeWriter
A unicode-aware CSV writer.
-
class
agate.csv_py2.
DictReader
(f, fieldnames=None, restkey=None, restval=None, *args, **kwargs)¶ Bases:
agate.csv_py2.UnicodeDictReader
A unicode-aware CSV DictReader.
-
class
agate.csv_py2.
DictWriter
(f, fieldnames, encoding='utf-8', line_numbers=False, **kwargs)¶ Bases:
agate.csv_py2.UnicodeDictWriter
A unicode-aware CSV DictWriter.
Fixed-width reader¶
Agate contains a fixed-width file reader that is designed to work like Python’s csv
.
These readers work with CSV-formatted schemas, such as those maintained at wireservice/ffs.
agate.fixed.reader |
A wrapper around fixed.Reader , so that it can be used in the same way as a normal CSV reader. |
agate.fixed.Reader |
Reads a fixed-width file using a column schema in CSV format. |
agate.fixed.DictReader |
A fixed-width reader that returns collections.OrderedDict rather than a list. |
Detailed list¶
-
agate.fixed.
reader
(*args, **kwargs)¶ A wrapper around
fixed.Reader
, so that it can be used in the same way as a normal CSV reader.
-
class
agate.fixed.
Reader
(f, schema_f)¶ Bases:
six.Iterator
Reads a fixed-width file using a column schema in CSV format.
This works almost exactly like Python’s built-in CSV reader.
Schemas must be in the “ffs” format, with
column
,start
, andlength
columns. There is a repository of such schemas maintained at wireservice/ffs.-
fieldnames
¶ The names of the columns read from the schema.
-
-
class
agate.fixed.
DictReader
(f, schema_f)¶ Bases:
agate.fixed.Reader
A fixed-width reader that returns
collections.OrderedDict
rather than a list.
Miscellaneous¶
agate.Patchable |
|
agate.NullOrder |
Dummy object used for sorting in place of None. |
agate.Quantiles |
A class representing quantiles (percentiles, quartiles, etc.) for a given column of Number data. |
Exceptions¶
agate.DataTypeError |
A calculation was attempted with an invalid DataType . |
agate.UnsupportedAggregationError |
An Aggregation was attempted which is not supported. |
agate.CastError |
A column value can not be cast to the correct type. |
agate.FieldSizeLimitError |
A field in a CSV file exceeds the maximum length. |
-
exception
agate.
DataTypeError
¶ Bases:
exceptions.TypeError
A calculation was attempted with an invalid
DataType
.
-
exception
agate.
UnsupportedAggregationError
¶ Bases:
exceptions.TypeError
An
Aggregation
was attempted which is not supported.For example, if a
Percentiles
is applied to aTableSet
.
-
exception
agate.
CastError
¶ Bases:
exceptions.Exception
A column value can not be cast to the correct type.
-
exception
agate.
FieldSizeLimitError
(limit)¶ Bases:
exceptions.Exception
A field in a CSV file exceeds the maximum length.
This length may be the default or one set by the user.
Warnings¶
-
class
agate.
NullCalculationWarning
¶ Bases:
exceptions.RuntimeWarning
Warning raised if a calculation which can not logically account for null values is performed on a
Column
containing nulls.
-
class
agate.
DuplicateColumnWarning
¶ Bases:
exceptions.RuntimeWarning
Warning raised if multiple columns with the same name are added to a new
Table
.
-
agate.
warn_null_calculation
(operation, column)¶
-
agate.
warn_duplicate_column
(column_name, column_rename)¶
Unit testing helpers¶
-
class
agate.
AgateTestCase
(methodName='runTest')¶ Bases:
unittest2.case.TestCase
Unittest case for quickly asserting logic about tables.
-
assertColumnNames
(table, names)¶ Verify the column names in the given table match what is expected.
-
assertColumnTypes
(table, types)¶ Verify the column types in the given table are of the expected types.
-
assertRowNames
(table, names)¶ Verify the row names in the given table match what is expected.
-
assertRows
(table, rows)¶ Verify the row data in the given table match what is expected.
-
Config¶
This module contains the global configuration for agate. Users should use
get_option()
and set_option()
to modify the global
configuration.
Available configuation options:
Option | Description | Default value |
---|---|---|
default_locale | Default locale for number formatting | default_locale(‘LC_NUMERIC’) or ‘en_US’ |
horizontal_line_char | Character to render for horizontal lines | u’-‘ |
vertical_line_char | Character to render for vertical lines | u’|’ |
bar_char | Character to render for bar chart units | u’░’ |
printable_bar_char | Printable character for bar chart units | u’:’ |
zero_line_char | Character to render for zero line units | u’▓’ |
printable_zero_line_char | Printable character for zero line units | u’|’ |
tick_char | Character to render for axis ticks | u’+’ |
ellipsis_chars | Characters to render for ellipsis | u’...’ |
-
agate.config.
get_option
(key)¶ Get a global configuration option for agate.
Parameters: key – The name of the configuration option.
-
agate.config.
set_option
(key, value)¶ Set a global configuration option for agate.
Parameters: - key – The name of the configuration option.
- value – The new value to set for the configuration option.
-
agate.config.
set_options
(options)¶ Set a dictionary of options simultaneously.
Parameters: hash – A dictionary of option names and values.
Contributing¶
agate actively encourages contributions from people of all genders, races, ethnicities, ages, creeds, nationalities, persuasions, alignments, sizes, shapes, and journalistic affiliations. You are welcome here.
We seek contributions from developers and non-developers of all skill levels. We will typically accept bug fixes, documentation updates, and new cookbook recipes with minimal fuss. If you want to work on a larger feature—great! The maintainers will be happy to provide feedback and code review on your implementation.
Before making any changes or additions to agate, please be sure to read about the principles of agate in the About section of the documentation.
Process for documentation¶
Not a developer? That’s fine! As long as you can use git (there are many tutorials) then you can contribute to agate. Please follow this process:
- Fork the project on Github.
- If you don’t have a specific task in mind, check out the issue tracker and find a documentation ticket that needs to be done.
- Comment on the ticket letting everyone know you’re going to be working on it so that nobody duplicates your effort.
- Write the documentation. Documentation files live in the docs directory and are in Restructured Text Format.
- Add yourself to the AUTHORS file if you aren’t already there.
- Once your contribution is complete, submit a pull request on Github.
- Wait for it to either be merged by a maintainer or to receive feedback about what needs to be revised.
- Rejoice!
Process for code¶
Hacker? We’d love to have you hack with us. Please follow this process to make your contribution:
- Fork the project on Github.
- If you don’t have a specific task in mind, check out the issue tracker and find a task that needs to be done and is of a scope you can realistically expect to complete in a few days. Don’t worry about the priority of the issues at first, but try to choose something you’ll enjoy. You’re much more likely to finish something to the point it can be merged if it’s something you really enjoy hacking on.
- If you already have a task you know you want to work on, open a ticket or comment on the existing ticket letting everyone know you’re going to be working on it. It’s also good practice to provide some general idea of how you plan on resolving the issue so that other developers can make suggestions.
- Write tests for the feature you’re building. Follow the format of the existing tests in the test directory to see how this works. You can run all the tests with the command
nosetests tests
. - Verify your tests work on all supported versions of Python by runing
tox
. - Write the code. Try to stay consistent with the style and organization of the existing codebase. A good patch won’t be refused for stylistic reasons, but large parts of it may be rewritten and nobody wants that.
- As you are coding, periodically merge in work from the master branch and verify you haven’t broken anything by running the test suite.
- Write documentation. This means docstrings on all classes and methods, including parameter explanations. It also means, when relevant, cookbook recipes and updates to the agate user tutorial.
- Add yourself to the AUTHORS file if you aren’t already there.
- Once your contribution is complete, tested, and has documentation, submit a pull request on Github.
- Wait for it to either be merged by a maintainer or to receive feedback about what needs to be revisited.
- Rejoice!
Licensing¶
To the extent that they care, contributors should keep in mind that the source of agate and therefore of any contributions are licensed under the permissive MIT license. By submitting a patch or pull request you are agreeing to release your code under this license. You will be acknowledged in the AUTHORS list, the commit history and the hearts and minds of journalists everywhere.
Release process¶
This is the release process for agate:
- Verify all unit tests pass with fresh environments:
tox -r
. - Verify 100% test coverage:
nosetests --with-coverage tests
. - Ensure any new modules have been added to setup.py’s
packages
list. - Ensure any new public interfaces have been added to the documentation.
- Ensure TableSet proxy methods have been added for new Table methods.
- Make sure the example script still works:
python example.py
. - Ensure
python charts.py
works and has been run recently. - Ensure
CHANGELOG.rst
is up to date. Add the release date and summary. - Create a release tag:
git tag -a x.y.z -m "x.y.z release."
- Push tags upstream:
git push --tags
- If this is a major release, merge
master
intostable
:git checkout stable; git merge master; git push
- Upload to PyPI:
python setup.py sdist bdist_wheel upload
. - Flag the release to build on RTFD.
- Update the “default version” on RTFD to the latest.
- Rev to latest version:
docs/conf.py
,docs/tutorial.rst
,setup.py
,CHANGELOG.rst
need updates. - Find/replace
en/[old version]
toen/[new version]
intutorial.ipynb
. - Commit revision:
git commit -am "Update to version x.y.z for development."
.
License¶
The MIT License
Copyright (c) 2017 Christopher Groskopf and contributors
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Changelog¶
1.6.0 - February 28, 2017¶
This update should not cause any breaking changes, however, it is being classified as major release because the dependency on awesome-slugify, which is licensed with GPLv3, has been replaced with python-slugify, which is licensed with MIT.
- Suppress warning from babel about Time Zone expressions on Python 3.6. (#665)
- Reimplemented slugify with python-slugify instead of awesome-slugify. (#660)
- Slugify renaming of duplicate values is now consistent with
Table.init()
. (#615)
1.5.5 - December 29, 2016¶
- Added a “full outer join” example to the SQL section of the cookbook. (#658)
- Warnings are now more explicit when column names are missing. (#652)
Date.cast()
will no longer parse strings like05_leslie3d_base
as dates. (#653)Text.cast()
will no longer strip leading or trailing whitespace. (#654)- Fixed
'NoneType' object has no attribute 'groupdict'
error inTimeDelta.cast()
. (#656)
1.5.4 - December 27, 2016¶
- Cleaned up handling of warnings in tests.
- Blank column names are not treated as unspecified (letter names will be generated).
1.5.3 - December 26, 2016¶
This is a minor release that adds one feature: sequential joins (by row number). It also fixes several small bugs blocking a downstream release of csvkit.
- Fixed empty
Table
column names would be intialized as list instead of tuple. Table.join()
can now join by row numbers—a sequential join.Table.join()
now supports full outer joins via thefull_outer
keyword.Table.join()
can now accept column indicies instead of column names.Table.from_csv()
now buffers input files to prevent issues with using STDIN as an input.
1.5.2 - December 24, 2016¶
- Improved handling of non-ascii encoded CSV files under Python 2.
1.5.1 - December 23, 2016¶
This is a minor release fixing several small bugs that were blocking a downstream release of csvkit.
- Documented differing behavior of
MaxLength
under Python 2. (#649) - agate is now tested against Python 3.6. (#650)
- Fix bug when
MaxLength
was called on an all-null column. - Update extensions documentation to match new API. (#645)
- Fix bug in
Change
andPercentChange
where0
values could causeNone
to be returned incorrectly.
1.5.0 - November 16, 2016¶
This release adds SVG charting via the leather charting library. Charts methods have been added for both Table
and TableSet
. (The latter create lattice plots.) See the revised tutorial and new cookbook entries for examples. Leather is still an early library. Please report any bugs.
Also in this release are a Slugify
computation and a variety of small fixes and improvements.
The complete list of changes is as follows:
- Remove support for monkey-patching of extensions. (#594)
TableSet
methods which proxyTable
methods now appear in the API docs. (#640)Any
andAll
aggregations no longer behave differently for boolean data. (#636)Any
andAll
aggregations now accept a single value as a test argument, in addition to a function.Any
andAll
aggregations now require a test argument.- Tables rendered by
Table.print_table()
are now Github Friendly Markdown (GFM) compatible. (#626) - The agate tutorial has been converted to a Jupyter Notebook.
Table
now supportslen
as a proxy forlen(table.rows)
.- Simple SVG charting is now integrated via leather.
- Added
First
computation. (#634) Table.print_table()
now has a max_precision argument to limit Number precision. (#544)- Slug computation now accepts an array of column names to merge. (#617)
- Cookbook: standardize column values with
Slugify
computation. (#613) - Cookbook: slugify/standardize row and column names. (#612)
- Fixed condition that prevents integer row names to allow bools in
Table.__init__()
. (#627) PercentChange
is now null-safe, returns None for null values. (#623)Table
can now be iterated, yieldingRow
instances. (Previously it was necessarily to iteratetable.rows
.)
1.4.0 - May 26, 2016¶
This release adds several new features, fixes numerous small bug-fixes, and improves performance for common use cases. There are some minor breaking changes, but few user are likely to encounter them. The most important changes in this release are:
- There is now a
TableSet.having()
method, which behaves similarly to SQL’sHAVING
keyword. Table.from_csv()
is much faster. In particular, the type inference routines for parsing numbers have been optimized.- The
Table.compute()
method now accepts areplace
keyword which allows new columns to replace existing columns “in place.”” (As with all agate operations, a new table is still created.) - There is now a
Slug
computation which can be used to compute a column of slugs. TheTable.rename()
method has also added new options for slugifying column and row names.
The complete list of changes is as follows:
- Added a deprecation warning for
patch
methods. New extensions should not use it. (#594) - Added
Slug
computation (#466) - Added
slug_columns
andslug_rows
arguments toTable.rename()
. (#466) - Added
utils.slugify()
to standardize a sequence of strings. (#466) Table.__init__()
now prints row and column onCastError
. (#593)- Fix null sorting in
Table.order_by()
when ordering by multiple columns. (#607) - Implemented configuration system.
- Fixed bug in
Table.print_bars()
whenvalue_column
containsNone
(#608) Table.print_table()
now restricts header on max_column_width. (#605)- Cookbook: filling gaps in a dataset with Table.homogenize. (#538)
- Reduced memory usage and improved performance of
Table.from_csv()
. Table.from_csv()
no longer accepts a sequence of row ids forskip_lines
.Number.cast()
is now three times as fast.Number
now acceptsgroup_symbol
,decimal_symbol
andcurrency_symbols
arguments. (#224)- Tutorial: clean up state data under computing columns (#570)
Table.__init__()
now explicitly checks thatrow_names
are not ints. (#322)- Cookbook: CPI deflation, agate-lookup. (#559)
Table.bins()
now includes values outsidestart
orend
in computedcolumn_names
. (#596)- Fixed bug in
Table.bins()
wherestart
orend
arguments were ignored when specified alone. (#599) Table.compute()
now accepts areplace
argument that allows columns to be overwritten. (#597)Table.from_fixed()
now creates an agate table from a fixed-width file. (#358)fixed
now implements a general-purpose fixed-width file reader. (#358)TypeTester
now correctly parses negative currency values as Number. (#595)- Cookbook: removing a column (select and exclude). (#592)
- Cookbook: overriding specific column types. (#591)
TableSet
now has aTableSet._fork()
method used internally for deriving new tables.- Added an example of SQL’s
HAVING
to the cookbook. Table.aggregate()
interface has been revised to be more similar toTableSet.aggregate()
.TableSet.having()
is now implemented. (#587)- There is now a better error when a forced column name does not exist. (#591)
- Arguments to
Table.print_html()
now mirrorTable.print_table()
.
1.3.1 - March 30, 2016¶
The major feature of this release is new API documentation. Several minor features and bug fixes are also included. There are no major breaking changes in this release.
Internally, the agate codebase has been reorganized to be more modular, but this should be invisible to most users.
- The
MaxLength
aggregation now returns a Decimal object. (#574) - Fixed an edge case where datetimes were parsed as dates. (#568)
- Fixed column alignment in tutorial tables. (#572)
Table.print_table()
now defaults to printing20
rows and6
columns. (#589)- Added Eli Murray to AUTHORS.
Table.__init__()
now accepts a dict to specify partial column types. (#580)Table.from_csv()
now accepts askip_lines
argument. (#581)- Moved every
Aggregation
andComputation
into their own modules. (#565) Column
andRow
are now importable from agate.- Completely reorgnized the API documentation.
- Moved unit tests into modules to match new code organization.
- Moved major
Table
andTableSet
methods into their own modules. - Fixed bug when using non-unicode encodings with
Table.from_csv()
. (#560) Table.homogenize()
now accepts an array of values as compare values if key is a single column name. (#539)
1.3.0 - February 28, 2016¶
This version implements several new features and includes two major breaking changes.
Please take note of the following breaking changes:
- There is no longer a
Length
aggregation. The more obviousCount
is now used instead. - Agate’s replacements for Python’s CSV reader and writer have been moved to the
agate.csv
namespace. To use as a drop-in replacement:from agate import csv
.
The major new features in this release are primarly related to transforming (reshaping) tables. They are:
Table.normalize()
for converting columns to rows.Table.denormalize()
for converting rows to columns.Table.pivot()
for generating “crosstabs”.Table.homogenize()
for filling gaps in data series.
Please see the following complete list of changes for a variety of other bug fixes and improvements.
- Moved CSV reader/writer to
agate.csv
namespace. - Added numerous new examples to the R section of the cookbook. (#529-#535)
- Updated Excel cookbook entry for pivot tables. (#536)
- Updated Excel cookbook entry for VLOOKUP. (#537)
- Fix number rendering in
Table.print_table()
on Windows. (#528) - Added cookbook examples of using
Table.pivot()
to count frequency/distribution. Table.bins()
now has smarter output column names. (#524)Table.bins()
is now a wrapper around pivot. (#522)Table.counts()
has been removed. UseTable.pivot()
instead. (#508)Count
can now count non-null values in a column.- Removed
Length
.Count
now works without any arguments. (#520) Table.pivot()
implemented. (#495)Table.denormalize()
implemented. (#493)- Added
columns
argument toTable.join()
. (#479) - Cookbook: Custom statistics/agate.Summary
- Added Kevin Schaul to AUTHORS.
Quantiles.locate()
now correctly returns Decimal instances. (#509)- Cookbook: Filter for distinct values of a column (#498)
- Added
Column.values_distinct()
(#498) - Cookbook: Fuzzy phonetic search example. (#207)
- Cookbook: Create a table from a remote file. (#473)
- Added
printable
argument toTable.print_bars()
to use only printable characters. (#500) MappedSequence
now throws an explicit error on __setitem__. (#499)- Added
require_match
argument toTable.join()
. (#480) - Cookbook: Rename columns in a table. (#469)
Table.normalize()
implemented. (#487)- Added
Percent
computation with example in Cookbook. (#490) - Added Ben Welsh to AUTHORS.
Table.__init__()
now throws a warning if auto-generated columns are used. (#483)Table.__init__()
no longer fails on duplicate columns. Instead it renames them and throws a warning. (#484)Table.merge()
now takes acolumn_names
argument to specify columns included in new table. (#481)Table.select()
now accepts a single column name as a key.Table.exclude()
now accepts a single column name as a key.- Added
Table.homogenize()
to find gaps in a table and fill them with default rows. (#407) Table.distinct()
now accepts sequences of column names as a key.Table.join()
now accepts sequences of column names as either a left or right key. (#475)Table.order_by()
now accepts a sequence of column names as a key.Table.distinct()
now accepts a sequence of column names as a key.Table.join()
now accepts a sequence of column names as either a left or right key. (#475)- Cookbook: Create a table from a DBF file. (#472)
- Cookbook: Create a table from an Excel spreadsheet.
- Added explicit error if a filename is passed to the
Table
constructor. (#438)
1.2.2 - February 5, 2016¶
This release adds several minor features. The only breaking change is that default column names will now be lowercase instead of uppercase. If you depended on these names in your scripts you will need to update them accordingly.
TypeTester
no longer takes alocale
argument. Usetypes
instead.TypeTester
now takes atypes
argument that is a list of possible types to test. (#461)- Null conversion can now be disabled for
Text
by passingcast_nulls=False
. (#460) - Default column names are now lowercase letters instead of uppercase. (#464)
Table.merge()
can now merge tables with different columns or columns in a different order. (#465)MappedSequence.get()
will no longer raiseKeyError
if a default is not provided. (#467)Number
can now test/cast thelong
type on Python 2.
1.2.1 - February 5, 2016¶
This release implements several new features and bug fixes. There are no significant breaking changes.
Special thanks to Neil Bedi for his extensive contributions to this release.
- Added a
max_column_width
argument toTable.print_table()
. Defaults to20
. (#442) Table.from_json()
now defers most functionality toTable.from_object()
.- Implemented
Table.from_object()
for parsing JSON-like Python objects. - Fixed a bug that prevented
Table.order_by()
on empty table. (#454) Table.from_json()
andTableSet.from_json()
now havecolumn_types
as an optional argument. (#451)csv.Reader
now hasline_numbers
andheader
options to add column for line numbers (#447)- Renamed
maxfieldsize
tofield_size_limit
incsv.Reader
for consistency (#447) Table.from_csv()
now has asniff_limit
option to usecsv.Sniffer
(#444)csv.Sniffer
implemented. (#444)Table.__init__()
no longer fails on empty rows. (#445)TableSet.from_json()
implemented. (#373)- Fixed a bug that breaks
TypeTester.run()
on variable row length. (#440) - Added
TableSet.__str__()
to displayTable
keys and row counts. (#418) - Fixed a bug that incorrectly checked for column_types equivalence in
Table.merge()
andTableSet.__init__()
. (#435) TableSet.merge()
now has the ability to specify grouping factors withgroup
,group_name
andgroup_type
. (#406)Table
can now be constructed withNone
for some column names. Those columns will receive letter names. (#432)- Slightly changed the parsing of dates and datetimes from strings.
- Numbers are now written to CSV without extra zeros after the decimal point. (#429)
- Made it possible for
datetime.date
instances to be considered validDateTime
inputs. (#427) - Changed preference order in type testing so
Date
is preferred toDateTime
. - Removed
float_precision
argument fromNumber
. (#428) AgateTestCase
is now available asagate.AgateTestCase
. (#426)TableSet.to_json()
now has anindent
option for use withnested
.TableSet.to_json()
now has anested
option for writing a single, nested JSON file. (#417)TestCase.assertRowNames()
andTestCase.assertColumnNames()
now validate the row and column instance keys.- Fixed a bug that prevented
Table.rename()
from renaming column names inRow
instances. (#423)
1.2.0 - January 18, 2016¶
This version introduces one breaking change, which is only relevant if you are using custom Computation
subclasses.
Computation
has been modified so thatComputation.run()
takes aTable
instance as its argument, rather than a single row. It must return a sequence of values to use for a new column. In addition, theComputation._prepare()
method has been renamed toComputation.validate()
to more accurately describe it’s function. These changes were made to facilitate computing moving averages, streaks and other values that require data for the full column.
- Existing
Aggregation
subclasses have been updated to useAggregate.validate()
. (This brings a noticeable performance boost.) Aggregation
now has aAggregation.validate()
method that functions identically toComputation.validate()
. (#421)Change.validate()
now correctly raisesDataTypeError
.- Added a
SimpleMovingAverage
implementation to the cookbook’s examples of customComputation
classes. Computation._prepare()
has been renamed toComputation.validate()
.Computation.run()
now takes aTable
instance as an argument. (#415)- Fix a bug in Python 2 where printing a table could raise
decimal.InvalidOperation
. (#412) - Fix
Rank
so it returns Decimal. (#411) - Added Taurus Olson to AUTHORS.
- Printing a table will now print the table’s structure.
Table.print_structure()
implemented. (#393)- Added Geoffrey Hing to AUTHORS.
Table.print_html()
implemented. (#408)- Instances of
Date
andDateTime
can now be pickled. (#362) AgateTestCase
is available asagate.testcase.AgateTestCase
for extensions to use. (#384)Table.exclude()
implemented. Opposite ofTable.select()
. (#388)Table.merge()
now accepts arow_names
argument. (#403)Formula
now automatically casts computed values to specified data type unlesscast
is set toFalse
. (#398)- Added Neil Bedi to AUTHORS.
Table.rename()
is implemented. (#389)TableSet.to_json()
is implemented. (#374)Table.to_csv()
andTable.to_json()
will now create the target directory if it does not exist. (#392)Boolean
will now correctly cast numerical0
and1
. (#386)Table.merge()
now consistently maps column names to rows. (#402)
1.1.0 - November 4, 2015¶
This version of agate introduces three major changes.
Table
,Table.from_csv()
andTableSet.from_csv()
now all takecolumn_names
andcolumn_types
as separate arguments instead of as a sequence of tuples. This was done to enable more flexible type inference and to streamline the API.- The interfaces for
TableSet.aggregate()
andTable.compute()
have been changed. In both cases the new column name now comes first. Aggregations have also been modified so that the input column name is an argument to the aggregation class, rather than a third element in the tuple. - This version drops support for Python 2.6. Testing and bug-fixing for this version was taking substantial time with no evidence that anyone was actually using it. Also, multiple dependencies claim to not support 2.6, even though agate’s tests were passing.
- DataType’s now have
DataType.csvify()
andDataType.jsonify()
methods for serializing native values. - Added a dependency on isodate for handling ISO8601 formatted dates. (#233)
Aggregation
results are no longer cached. (#378)- Removed Column.aggregate method. Use
Table.aggregate()
instead. (#378) - Added
Table.aggregate()
for aggregating single column results. (#378) Aggregation
subclasses now take column names as their first argument. (#378)TableSet.aggregate()
andTable.compute()
now take the new column name as the first argument. (#378)- Remove support for Python 2.6.
Table.to_json()
is implemented. (#345)Table.from_json()
is implemented. (#344, #347)Date
andDateTime
type testing now takes specified format into account. (#361)Number
data type now takes afloat_precision
argument.Number
data types now work with native float values. (#370)TypeTester
can now validate Python native types (not just strings). (#367)TypeTester
can now be used with theTable
constructor, not justTable.from_csv()
. (#350)Table
,Table.from_csv()
andTableSet.from_csv()
now takecolumn_names
andcolumn_types
as separate parameters. (#350)DEFAULT_NULL_VALUES
(the list of strings that mean null) is now importable fromagate
.Table.from_csv()
andTable.to_csv()
are now unicode-safe without separately importing csvkit.agate
can now be used as a drop-in replacement for Python’scsv
module.- Migrated csvkit‘s unicode CSV reading/writing support into agate. (#354)
1.0.1 - October 29, 2015¶
- TypeTester now takes a “limit” arg that restricts how many rows it tests. (#332)
- Table.from_csv now supports CSVs with neither headers nor manual column names.
- Tables can now be created with automatically generated column names. (#331)
- File handles passed to Table.to_csv are now left open. (#330)
- Added Table.print_csv method. (#307, #339)
- Fixed stripping currency symbols when casting Numbers from strings. (#333)
- Fixed two major join issues. (#336)
1.0.0 - October 22, 2015¶
- Table.from_csv now defaults to TypeTester() if column_info is not provided. (#324)
- New tutorial section: “Navigating table data” (#315)
- 100% test coverage reached. (#312)
- NullCalculationError is now a warning instead of an error. (#311)
- TableSet is now a subclass of MappedSequence.
- Rows and Columns are now subclasses of MappedSequence.
- Add Column.values_without_nulls_sorted().
- Column.get_data_without_nulls() is now Column.values_without_nulls().
- Column.get_data_sorted() is now Column.values_sorted().
- Column.get_data() is now Column.values().
- Columns can now be sliced.
- Columns can now be indexed by row name. (#301)
- Added support for Python 3.5.
- Row objects can now be sliced. (#303)
- Replaced RowSequence and ColumnSequence with MappedSequence.
- Replace RowDoesNotExistError with KeyError.
- Replaced ColumnDoesNotExistError with IndexError.
- Removed unnecessary custom RowIterator, ColumnIterator and CellIterator.
- Performance improvements for Table “forks”. (where, limit, etc)
- TableSet keys are now converted to row names during aggregation. (#291)
- Removed fancy __repr__ implementations. Use __str__ instead. (#290)
- Rows can now be accessed by name as well as index. (#282)
- Added row_names argument to Table constructor. (#282)
- Removed Row.table and Row.index properties. (#287)
- Columns can now be accessed by index as well as name. (#281)
- Added column name and type validation to Table constructor. (#285)
- Table now supports variable-length rows during construction. (#39)
- aggregations.Summary implemented for generic aggregations. (#181)
- Fix TableSet.key_type being lost after proxying Table methods. (#278)
- Massive performance increases for joins. (#277)
- Added join benchmark. (#73)
0.11.0 - October 6, 2015¶
- Implemented __repr__ for Table, TableSet, Column and Row. (#261)
- Row.index property added.
- Column constructor no longer takes a data_type argument.
- Column.index and Column.name properties added.
- Table.counts implemented. (#271)
- Table.bins implemented. (#267, #227)
- Table.join now raises ColumnDoesNotExistError. (#264)
- Table.select now raises ColumnDoesNotExistError.
- computations.ZScores moved into agate-stats.
- computations.Rank cmp argument renamed comparer.
- aggregations.MaxPrecision added. (#265)
- Table.print_bars added.
- Table.pretty_print renamed Table.print_table.
- Reimplement Table method proxying via @allow_tableset_proxy decorator. (#263)
- Add agate-stats references to docs.
- Move stdev_outliers, mad_outliers and pearson_correlation into agate-stats. (#260)
- Prevent issues with applying patches multiple times. (#258)
0.10.0 - September 22, 2015¶
- Add reverse and cmp arguments to Rank computation. (#248)
- Document how to use agate-sql to read/write SQL tables. (#238, #241)
- Document how to write extensions.
- Add monkeypatching extensibility pattern via utils.Patchable.
- Reversed order of argument pairs for Table.compute. (#249)
- TableSet.merge method can be used to ungroup data. (#253)
- Columns with identical names are now suffixed “2” after a Table.join.
- Duplicate key columns are no longer included in the result of a Table.join. (#250)
- Table.join right_key no longer necessary if identical to left_key. (#254)
- Table.inner_join is now more. Use inner keyword to Table.join.
- Table.left_outer_join is now Table.join.
0.9.0 - September 14, 2015¶
- Add many missing unit tests. Up to 99% coverage.
- Add property accessors for TableSet.key_name and TableSet.key_type. (#247)
- Table.rows and Table.columns are now behind properties. (#247)
- Column.data_type is now a property. (#247)
- Table[Set].get_column_types() is now the Table[Set].column_types property. (#247)
- Table[Set].get_column_names() is now the Table[Set].column_names property. (#247)
- Table.pretty_print now displays consistent decimal places for each Number column.
- Discrete data types (Number, Date etc) are now right-aligned in Table.pretty_print.
- Implement aggregation result caching. (#245)
- Reimplement Percentiles, Quartiles, etc as aggregations.
- UnsupportedAggregationError is now used to disable TableSet aggregations.
- Replaced several exceptions with more general DataTypeError.
- Column type information can now be accessed as Column.data_type.
- Eliminated Column subclasses. Restructured around DataType classes.
- Table.merge implemented. (#9)
- Cookbook: guess column types. (#230)
- Fix issue where all group keys were being cast to text. (#235)
- Table.group_by will now default key_type to the type of the grouping column. (#234)
- Add Matt Riggott to AUTHORS. (#231)
- Support file-like objects in Table.to_csv and Table.from_csv. (#229)
- Fix bug when applying multiple computations with Table.compute.
0.8.0 - September 9, 2015¶
- Cookbook: dealing with locales. (#220)
- Cookbook: working with dates and times.
- Add timezone support to DateTimeType.
- Use pytimeparse instead of python-dateutil. (#221)
- Handle percents and currency symbols when casting numbers. (#217)
- Table.format is now Table.pretty_print. (#223)
- Rename TextType to Text, NumberType to Number, etc.
- Rename agate.ColumnType to agate.DataType (#216)
- Rename agate.column_types to agate.data_types.
- Implement locale support for number parsing. (#116)
- Cookbook: ranking. (#110)
- Cookbook: date change and date ranking. (#113)
- Add tests for unicode support. (#138)
- Fix computations.ZScores calculation. (#123)
- Differentiate sample and population variance and stdev. (#208)
- Support for overriding column inference with “force”.
- Competition ranking implemented as default. (#125)
- TypeTester: robust type inference. (#210)
0.7.0 - September 3, 2015¶
- Cookbook: USA Today diversity index.
- Cookbook: filter to top x%. (#47)
- Cookbook: fuzzy string search example. (#176)
- Values to coerce to true/false can now be overridden for BooleanType.
- Values to coerce to null can now be overridden for all ColumnType subclasses. (#206)
- Add key_type argument to TableSet and Table.group_by. (#205)
- Nested TableSet’s and multi-dimensional aggregates. (#204)
- TableSet.aggregate will now use key_name as the group column name. (#203)
- Added key_name argument to TableSet and Table.group_by.
- Added Length aggregation and removed count from TableSet.aggregate output. (#203)
- Fix error messages for RowDoesNotExistError and ColumnDoesNotExistError.
0.6.0 - September 1, 2015¶
- Fix missing package definition in setup.py.
- Split Analysis off into the proof library.
- Change computation now works with DateType, DateTimeType and TimeDeltaType. (#159)
- TimeDeltaType and TimeDeltaColumn implemented.
- NonNullAggregation class removed.
- Some private Column methods made public. (#183)
- Rename agate.aggegators to agate.aggregations.
- TableSet.to_csv implemented. (#195)
- TableSet.from_csv implemented. (#194)
- Table.to_csv implemented (#169)
- Table.from_csv implemented. (#168)
- Added Table.format method for pretty-printing tables. (#191)
- Analysis class now implements a caching workflow. (#171)
0.5.0 - August 28, 2015¶
- Table now takes (column_name, column_type) pairs. (#180)
- Renamed the library to agate. (#179)
- Results of common column operations are now cached using a common memoize decorator. (#162)
- ated support for Python version 3.2.
- Added support for Python wheel packaging. (#127)
- Add PercentileRank computation and usage example to cookbook. (#152)
- Add indexed change example to cookbook. (#151)
- Add annual change example to cookbook. (#150)
- Column.aggregate now invokes Aggregations.
- Column.any, NumberColumn.sum, etc. converted to Aggregations.
- Implement Aggregation and subclasses. (#155)
- Move ColumnType subclasses and ColumnOperation subclasses into new modules.
- Table.percent_change, Table.rank and Table.zscores reimplemented as Computers.
- Computer implemented. Table.compute reimplemented. (#147)
- NumberColumn.iqr (inter-quartile range) implemented. (#102)
- Remove Column.counts as it is not the best way.
- Implement ColumnOperation and subclasses.
- Table.aggregate migrated to TableSet.aggregate.
- Table.group_by now supports grouping by a key function. (#140)
- NumberColumn.deciles implemented.
- NumberColumn.quintiles implemented. (#46)
- NumberColumn.quartiles implemented. (#45)
- Added robust test case for NumberColumn.percentiles. (#129)
- NumberColumn.percentiles reimplemented using new method. (#130)
- Reorganized and modularized column implementations.
- Table.group_by now returns a TableSet.
- Implement TableSet object. (#141)
0.4.0 - September 27, 2014¶
- Upgrade to python-dateutil 2.2. (#134)
- Wrote introductory tutorial. (#133)
- Reorganize documentation (#132)
- Add John Heasly to AUTHORS.
- Implement percentile. (#35)
- no_null_computations now accepts args. (#122)
- Table.z_scores implemented. (#123)
- DateTimeColumn implemented. (#23)
- Column.counts now returns dict instead of Table. (#109)
- ColumnType.create_column renamed _create_column. (#118)
- Added Mick O’Brien to AUTHORS. (#121)
- Pearson correlation implemented. (#103)
0.3.0¶
- DateType.date_format implemented. (#112)
- Create ColumnType classes to simplify data parsing.
- DateColumn implemented. (#7)
- Cookbook: Excel pivot tables. (#41)
- Cookbook: statistics, including outlier detection. (#82)
- Cookbook: emulating Underscore’s any and all. (#107)
- Parameter documention for method parameters. (#108)
- Table.rank now accepts a column name or key function.
- Optionally use cdecimal for improved performance. (#106)
- Smart naming of aggregate columns.
- Duplicate columns names are now an error. (#92)
- BooleanColumn implemented. (#6)
- TextColumn.max_length implemented. (#95)
- Table.find implemented. (#14)
- Better error handling in Table.__init__. (#38)
- Collapse IntColumn and FloatColumn into NumberColumn. (#64)
- Table.mad_outliers implemented. (#93)
- Column.mad implemented. (#93)
- Table.stdev_outliers implemented. (#86)
- Table.group_by implemented. (#3)
- Cookbook: emulating R. (#81)
- Table.left_outer_join now accepts column names or key functions. (#80)
- Table.inner_join now accepts column names or key functions. (#80)
- Table.distinct now accepts a column name or key function. (#80)
- Table.order_by now accepts a column name or key function. (#80)
- Table.rank implemented. (#15)
- Reached 100% test coverage. (#76)
- Tests for Column._cast methods. (#20)
- Table.distinct implemented. (#83)
- Use assertSequenceEqual in tests. (#84)
- Docs: features section. (#87)
- Cookbook: emulating SQL. (#79)
- Table.left_outer_join implemented. (#11)
- Table.inner_join implemented. (#11)
0.2.0¶
- Python 3.2, 3.3 and 3.4 support. (#52)
- Documented supported platforms.
- Cookbook: csvkit. (#36)
- Cookbook: glob syntax. (#28)
- Cookbook: filter to values in range. (#30)
- RowDoesNotExistError implemented. (#70)
- ColumnDoesNotExistError implemented. (#71)
- Cookbook: percent change. (#67)
- Cookbook: sampleing. (#59)
- Cookbook: random sort order. (#68)
- Eliminate Table.get_data.
- Use tuples everywhere. (#66)
- Fixes for Python 2.6 compatibility. (#53)
- Cookbook: multi-column sorting. (#13)
- Cookbook: simple sorting.
- Destructive Table ops now deepcopy row data. (#63)
- Non-destructive Table ops now share row data. (#63)
- Table.sort_by now accepts a function. (#65)
- Cookbook: pygal.
- Cookbook: Matplotlib.
- Cookbook: VLOOKUP. (#40)
- Cookbook: Excel formulas. (#44)
- Cookbook: Rounding to two decimal places. (#49)
- Better repr for Column and Row. (#56)
- Cookbook: Filter by regex. (#27)
- Cookbook: Underscore filter & reject. (#57)
- Table.limit implemented. (#58)
- Cookbook: writing a CSV. (#51)
- Kill Table.filter and Table.reject. (#55)
- Column.map removed. (#43)
- Column instance & data caching implemented. (#42)
- Table.select implemented. (#32)
- Eliminate repeated column index lookups. (#25)
- Precise DecimalColumn tests.
- Use Decimal type everywhere internally.
- FloatColumn converted to DecimalColumn. (#17)
- Added Eric Sagara to AUTHORS. (#48)
- NumberColumn.variance implemented. (#1)
- Cookbook: loading a CSV. (#37)
- Table.percent_change implemented. (#16)
- Table.compute implemented. (#31)
- Table.filter and Table.reject now take funcs. (#24)
- Column.count implemented. (#12)
- Column.counts implemented. (#8)
- Column.all implemented. (#5)
- Column.any implemented. (#4)
- Added Jeff Larson to AUTHORS. (#18)
- NumberColumn.mode implmented. (#18)
0.1.0¶
- Initial prototype
Show me docs¶
- About - why you should use agate and the principles that guide its development
- Install - how to install for users and developers
- Tutorial - a step-by-step guide to start using agate
- Cookbook - sample code showing how to accomplish dozens of common tasks, including comparisons to SQL, R, etc.
- Extensions - a list of libraries that extend agate functionality and how to build your own
- API - technical documentation for every agate feature
- Changelog - a record of every change made to agate for each release
Show me code¶
import agate
purchases = agate.Table.from_csv('examples/realdata/ks_1033_data.csv')
by_county = purchases.group_by('county')
totals = by_county.aggregate([
('county_cost', agate.Sum('total_cost'))
])
totals = totals.order_by('county_cost', reverse=True)
totals.limit(10).print_bars('county', 'county_cost', width=80)
county county_cost
SEDGWICK 977,174.45 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
COFFEY 691,749.03 ▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
MONTGOMERY 447,581.20 ▓░░░░░░░░░░░░░░░░░░░░░░░░░
JOHNSON 420,628.00 ▓░░░░░░░░░░░░░░░░░░░░░░░░
SALINE 245,450.24 ▓░░░░░░░░░░░░░░
FINNEY 171,862.20 ▓░░░░░░░░░░
BROWN 145,254.96 ▓░░░░░░░░
KIOWA 97,974.00 ▓░░░░░
WILSON 74,747.10 ▓░░░░
FORD 70,780.00 ▓░░░░
+-------------+-------------+-------------+-------------+
0 250,000 500,000 750,000 1,000,000
This example, along with detailed comments, are available as a Jupyter notebook.
Join us¶
- Contributing - guidance for developers who want to contribute to agate
- Release process - the process for maintainers to publish new releases
- License - a copy of the MIT open source license covering agate
Who we are¶
agate is made by a community. The following individuals have contributed code, documentation, or expertise to agate:
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line