Transform#
Pivot by a single column#
The Table.pivot()
method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:
name |
race |
gender |
age |
---|---|---|---|
Joe |
white |
female |
20 |
Jane |
asian |
male |
20 |
Jill |
black |
female |
20 |
Jim |
latino |
male |
25 |
Julia |
black |
female |
25 |
Joan |
asian |
female |
25 |
In the very simplest case, this table can be pivoted to count the number occurences of values in a column:
transformed = table.pivot('race')
Result:
race |
pivot |
---|---|
white |
1 |
asian |
2 |
black |
2 |
latino |
1 |
Pivot by multiple columns#
You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:
transformed = table.pivot(['race', 'gender'])
Result:
race |
gender |
pivot |
---|---|---|
white |
female |
1 |
asian |
male |
1 |
black |
female |
2 |
latino |
male |
1 |
asian |
female |
1 |
For the column, version you would do:
transformed = table.pivot('race', 'gender')
Result:
race |
male |
female |
---|---|---|
white |
0 |
1 |
asian |
1 |
1 |
black |
0 |
2 |
latino |
1 |
0 |
Pivot to sum#
The default pivot aggregation is Count
but you can also supply other operations. For example, to aggregate each group by Sum
of their ages:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
race |
male |
female |
---|---|---|
white |
0 |
20 |
asian |
20 |
25 |
black |
0 |
45 |
latino |
25 |
0 |
Pivot to percent of total#
Pivot allows you to apply a Computation
to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
race |
male |
female |
---|---|---|
white |
0 |
14.8 |
asian |
14.8 |
18.4 |
black |
0 |
33.3 |
latino |
18.4 |
0 |
Note: actual computed percentages will be much more precise.
It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.
Denormalize key/value columns into separate columns#
It’s common for very large datasets to be distributed in a “normalized” format, such as:
name |
property |
value |
---|---|---|
Jane |
gender |
female |
Jane |
race |
black |
Jane |
age |
24 |
… |
… |
… |
The Table.denormalize()
method can be used to transform the table so that each unique property has its own column.
transformed = table.denormalize('name', 'property', 'value')
Result:
name |
gender |
race |
age |
---|---|---|---|
Jane |
female |
black |
24 |
Jack |
male |
white |
35 |
Joe |
male |
black |
28 |
Normalize separate columns into key/value columns#
Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:
name |
gender |
race |
age |
---|---|---|---|
Jane |
female |
black |
24 |
Jack |
male |
white |
35 |
Joe |
male |
black |
28 |
The Table.normalize()
method can be used to transform the table so that all the properties and their values share two columns.
transformed = table.normalize('name', ['gender', 'race', 'age'])
Result:
name |
property |
value |
---|---|---|
Jane |
gender |
female |
Jane |
race |
black |
Jane |
age |
24 |
… |
… |
… |