Common descriptive and aggregate statistics are included with the core agate library. For additional statistical methods beyond the scope of agate consider using the agate-stats extension or integrating with scipy.
agate includes a full set of standard descriptive statistics that can be applied to any column containing
table.aggregate(Sum('salary')) table.aggregate(Min('salary')) table.aggregate(Max('salary')) table.aggregate(Mean('salary')) table.aggregate(Median('salary')) table.aggregate(Mode('salary')) table.aggregate(Variance('salary')) table.aggregate(StDev('salary')) table.aggregate(MAD('salary'))
Or, get several at once:
table.aggregate([ Min('salary'), Mean('salary'), Max('salary') ])
You can also generate aggregate statistics for subsets of data (sometimes referred to as “rolling up”):
doctors = patients.group_by('doctor') patient_ages = doctors.aggregate([ ('patient_count', agate.Count()) ('age_mean', agate.Mean('age')), ('age_median', agate.Median('age')) ])
The resulting table will have four columns:
Distribution by count (frequency)¶
Counting the number of each unique value in a column can be accomplished with the
# Counts of a single column's values table.pivot('doctor') # Counts of all combinations of more than one column's values table.pivot(['doctor', 'hospital'])
The resulting tables will have a column for each key column and another
Count column counting the number of instances of each value.
Distribution by percent¶
Table.pivot() can also be used to calculate the distribution of values as a percentage of the total number:
# Percents of a single column's values table.pivot('doctor', computation=Percent('Count')) # Percents of all combinations of more than one column's values table.pivot(['doctor', 'hospital'], computation=Percent('Count'))
The output table will be the same format as the previous example, except the value column will be named
The agate-stats extension adds methods for finding outliers.
import agatestats agatestats.patch() outliers = table.stdev_outliers('salary', deviations=3, reject=False)
reject=True you can instead return a table including only those values not identified as outliers.
not_outliers = table.stdev_outliers('salary', deviations=3, reject=True)
The second, more robust, method for identifying outliers is by identifying values which are more than some number of “median absolute deviations” from the median (typically 3).
outliers = table.mad_outliers('salary', deviations=3, reject=False)
As with the first example, you can specify
reject=True to exclude outliers in the resulting table.
You can also generate custom aggregated statistics for your data by defining your own ‘summary’ aggregation. This might be especially useful for performing calculations unique to your data. Here’s a simple example:
# Create a custom summary aggregation with agate.Summary # Input a column name, a return data type and a function to apply on the column count_millionaires = agate.Summary('salary', agate.Number(), lambda r: sum(salary > 1000000 for salary in r.values())) table.aggregate([ count_millionaires ])
Your custom aggregation can be used to determine both descriptive and aggregate statistics shown above.