Transform¶
Pivot by a single column¶
The Table.pivot()
method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:
name | race | gender | age |
---|---|---|---|
Joe | white | female | 20 |
Jane | asian | male | 20 |
Jill | black | female | 20 |
Jim | latino | male | 25 |
Julia | black | female | 25 |
Joan | asian | female | 25 |
In the very simplest case, this table can be pivoted to count the number occurences of values in a column:
transformed = table.pivot('race')
Result:
race | pivot |
---|---|
white | 1 |
asian | 2 |
black | 2 |
latino | 1 |
Pivot by multiple columns¶
You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:
transformed = table.pivot(['race', 'gender'])
Result:
race | gender | pivot |
---|---|---|
white | female | 1 |
asian | male | 1 |
black | female | 2 |
latino | male | 1 |
asian | female | 1 |
For the column, version you would do:
transformed = table.pivot('race', 'gender')
Result:
race | male | female |
---|---|---|
white | 0 | 1 |
asian | 1 | 1 |
black | 0 | 2 |
latino | 1 | 0 |
Pivot to sum¶
The default pivot aggregation is Count
but you can also supply other operations. For example, to aggregate each group by Sum
of their ages:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
race | male | female |
---|---|---|
white | 0 | 20 |
asian | 20 | 25 |
black | 0 | 45 |
latino | 25 | 0 |
Pivot to percent of total¶
Pivot allows you to apply a Computation
to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
race | male | female |
---|---|---|
white | 0 | 14.8 |
asian | 14.8 | 18.4 |
black | 0 | 33.3 |
latino | 18.4 | 0 |
Note: actual computed percentages will be much more precise.
It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.
Denormalize key/value columns into separate columns¶
It’s common for very large datasets to be distributed in a “normalized” format, such as:
name | property | value |
---|---|---|
Jane | gender | female |
Jane | race | black |
Jane | age | 24 |
… | … | … |
The Table.denormalize()
method can be used to transform the table so that each unique property has its own column.
transformed = table.denormalize('name', 'property', 'value')
Result:
name | gender | race | age |
---|---|---|---|
Jane | female | black | 24 |
Jack | male | white | 35 |
Joe | male | black | 28 |
Normalize separate columns into key/value columns¶
Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:
name | gender | race | age |
---|---|---|---|
Jane | female | black | 24 |
Jack | male | white | 35 |
Joe | male | black | 28 |
The Table.normalize()
method can be used to transform the table so that all the properties and their values share two columns.
transformed = table.normalize('name', ['gender', 'race', 'age'])
Result:
name | property | value |
---|---|---|
Jane | gender | female |
Jane | race | black |
Jane | age | 24 |
… | … | … |