Filter rows¶

By regex¶

You can use Python’s builtin re module to introduce a regular expression into a Table.where() query.

For example, here we find all states that start with “C”.

import re

new_table = table.where(lambda row: re.match('^C', row['state']))

This can also be useful for finding values that don’t match your expectations. For example, finding all values in the “phone number” column that don’t look like phone numbers:

new_table = table.where(lambda row: not re.match('\d{3}-\d{3}-\d{4}', row['phone']))

By glob¶

Hate regexes? You can use glob (fnmatch) syntax too!

from fnmatch import fnmatch

new_table = table.where(lambda row: fnmatch('C*', row['state']))

Values within a range¶

This snippet filters the dataset to incomes between 100,000 and 200,000.

new_table = table.where(lambda row: 100000 < row['income'] < 200000)

Dates within a range¶

This snippet filters the dataset to events during the summer of 2015:

import datetime

new_table = table.where(lambda row: datetime.datetime(2015, 6, 1) <= row['date'] <= datetime.datetime(2015, 8, 31))

If you want to filter to events during the summer of any year:

new_table = table.where(lambda row: 6 <= row['date'].month <= 8)

Top N percent¶

To filter a dataset to the top 10% percent of values we first compute the percentiles for the column and then use the result in the Table.where() truth test:

percentiles = table.aggregate(agate.Percentiles('salary'))
top_ten_percent = table.where(lambda r: r['salary'] >= percentiles[90])

Random sample¶

By combining a random sort with limiting, we can effectively get a random sample from a table.

import random

randomized = table.order_by(lambda row: random.random())
sampled = table.limit(10)

Ordered sample¶

With can also get an ordered sample by simply using the step parameter of the Table.limit() method to get every Nth row.

sampled = table.limit(step=10)

Distinct values¶

You can retrieve a distinct list of values in a column using Column.values_distinct() or Table.distinct().

Table.distinct() returns the entire row so it’s necessary to chain a select on the specific column.

columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)

new_table.columns['value'].values_distinct()
# or
new_table.distinct('value').columns['value'].values()
(Decimal('1'), Decimal('2'), Decimal('5'))