Transform

Pivot by a single column

The Table.pivot() method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:

name race gender age
Joe white female 20
Jane asian male 20
Jill black female 20
Jim latino male 25
Julia black female 25
Joan asian female 25

In the very simplest case, this table can be pivoted to count the number occurences of values in a column:

transformed = table.pivot('race')

Result:

race pivot
white 1
asian 2
black 2
latino 1

Pivot by multiple columns

You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:

transformed = table.pivot(['race', 'gender'])

Result:

race gender pivot
white female 1
asian male 1
black female 2
latino male 1
asian female 1

For the column, version you would do:

transformed = table.pivot('race', 'gender')

Result:

race male female
white 0 1
asian 1 1
black 0 2
latino 1 0

Pivot to sum

The default pivot aggregation is Count but you can also supply other operations. For example, to aggregate each group by Sum of their ages:

transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
race male female
white 0 20
asian 20 25
black 0 45
latino 25 0

Pivot to percent of total

Pivot allows you to apply a Computation to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:

transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
race male female
white 0 14.8
asian 14.8 18.4
black 0 33.3
latino 18.4 0

Note: actual computed percentages will be much more precise.

It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.

Denormalize key/value columns into separate columns

It’s common for very large datasets to be distributed in a “normalized” format, such as:

name property value
Jane gender female
Jane race black
Jane age 24

The Table.denormalize() method can be used to transform the table so that each unique property has its own column.

transformed = table.denormalize('name', 'property', 'value')

Result:

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

Normalize separate columns into key/value columns

Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:

name gender race age
Jane female black 24
Jack male white 35
Joe male black 28

The Table.normalize() method can be used to transform the table so that all the properties and their values share two columns.

transformed = table.normalize('name', ['gender', 'race', 'age'])

Result:

name property value
Jane gender female
Jane race black
Jane age 24