journalism 0.3.0 (alpha)

About

journalism is a Python library that takes the horror out of basic data analysis and manipulation. It is an alternative to numpy and pandas that optimizes for human performance, rather than CPU performance.

It is inspired by underscore.js and all the other libraries that know how to get the hell out of the way and let us do Journalism.

Important links:

Features

Why use journalism?

  • A clean, readable API.
  • Optimized for exploratory use in the shell.
  • A full set of SQL-like operations.
  • Full unicode support.
  • Decimal precision everywhere.
  • Pure Python. It works everywhere.
  • 100% test coverage.
  • Extensive user documentation.
  • Access to the full power of Python in every command.

Principles

journalism is a intended to fill a very particular programming niche, that of non-professional data analysts who need to get shit done quickly. These are the principles of its development:

  • Humans have less time than computers. Always optimize for humans.
  • Most datasets are simple and small. Never optimize for quants.
  • Text is data. It must be a first-class citizen.
  • Python gets it right. Make it work like Python does.
  • Humans are busy, stupid, lazy, etc. It must be easy.
  • Mutability is confusion. Processes that alter data must create new copies.

But why not...

  • numpy: It’s hard.
  • pandas: It’s hard.
  • R: Don’t even get me started.
  • SAS: You have that kind of money?
  • SQL: It’s not code.
  • An ORM: Have you actually tried this?

I’m not reinventing the wheel, I’m just putting on the right size tires.

Installation

Users

If you only want to use journalism, install it this way:

pip install journalism

Note

Need more speed? If you’re running Python 2.6, 2.7 or 3.2, you can pip install cdecimal for a significant speed boost. This isn’t installed automatically because it can create additional complications.

Developers

If you are a developer that also wants to hack on journalism, install it this way:

git clone git://github.com/onyxfish/journalism.git
cd journalism
mkvirtualenv --no-site-packages journalism
pip install -r requirements.txt
python setup.py develop
nosetests --with-coverage --cover-package=journalism

Supported platforms

journalism supports the following versions of Python:

  • Python 2.6+
  • Python 3.2+
  • Latest PyPy

It is tested on OSX, but due to it’s minimal dependencies should work fine on both Linux and Windows.

Usage

Here is an example of how to use journalism, using financial aid data from data.gov:

#!/usr/bin/env python

import csv

from journalism import Table, TextType, NumberType

text_type = TextType()
number_type = NumberType()

COLUMNS = (
    ('state', text_type),
    ('state_abbr', text_type),
    ('9_11_gi_bill1', number_type),
    ('montogomery_gi_bill_active', number_type),
    ('montgomery_gi_bill_reserve', number_type),
    ('dependants', number_type),
    ('reserve', number_type),
    ('vietnam', number_type),
    ('total', number_type)
)

COLUMN_NAMES = tuple(c[0] for c in COLUMNS)
COLUMN_TYPES = tuple(c[1] for c in COLUMNS)

with open('examples/realdata/Datagov_FY10_EDU_recp_by_State.csv') as f:
    # Skip headers
    next(f)
    next(f)
    next(f)

    rows = list(csv.reader(f))

# Trim cruft off end
rows = rows[:-2]

# Create the table
table = Table(rows, COLUMN_TYPES, COLUMN_NAMES)

# Remove Phillipines and Puerto Rico
states = table.where(lambda r: r['state_abbr'] not in ('PR', 'PH'))

# Sum total of all states
print('Total of all states: %i' % states.columns['total'].sum())

# Sort state total, descending
order_by_total_desc = states.order_by('total', reverse=True)

# Grab just the top 5 states
top_five = order_by_total_desc.rows[:5]

for i, row in enumerate(top_five):
    print('# %i: %s %i' % (i, row['state'], row['total']))

with open('sorted.csv', 'w') as f:
    writer = csv.writer(f)

    writer.writerow(order_by_total_desc.get_column_names())
    writer.writerows(order_by_total_desc.rows)

# Grab just the bottom state
last_place = order_by_total_desc.rows[-1]

print('Lowest state: %(state)s %(total)i' % last_place)

# Calculate the standard of deviation for the state totals
stdev = states.columns['total'].stdev()

print('Standard deviation of totals: %.2f' % stdev)

Contributing

Want to hack on journalism? Here’s how:

Authors

The following individuals have contributed code to journalism:

  • Christopher Groskopf
  • Jeff Larson
  • Eric Sagara

License

The MIT License

Copyright (c) 2014 Christopher Groskopf and contributers

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Changelog

0.3.0

  • DateType.date_format implemented. (#112)
  • Create ColumnType classes to simplify data parsing.
  • DateColumn implemented. (#7)
  • Cookbook: Excel pivot tables. (#41)
  • Cookbook: statistics, including outlier detection. (#82)
  • Cookbook: emulating Underscore’s any and all. (#107)
  • Parameter documention for method parameters. (#108)
  • Table.rank now accepts a column name or key function.
  • Optionally use cdecimal for improved performance. (#106)
  • Smart naming of aggregate columns.
  • Duplicate columns names are now an error. (#92)
  • BooleanColumn implemented. (#6)
  • TextColumn.max_length implemented. (#95)
  • Table.find implemented. (#14)
  • Better error handling in Table.__init__. (#38)
  • Collapse IntColumn and FloatColumn into NumberColumn. (#64)
  • Table.mad_outliers implemented. (#93)
  • Column.mad implemented. (#93)
  • Table.stdev_outliers implemented. (#86)
  • Table.group_by implemented. (#3)
  • Cookbook: emulating R. (#81)
  • Table.left_outer_join now accepts column names or key functions. (#80)
  • Table.inner_join now accepts column names or key functions. (#80)
  • Table.distinct now accepts a column name or key function. (#80)
  • Table.order_by now accepts a column name or key function. (#80)
  • Table.rank implemented. (#15)
  • Reached 100% test coverage. (#76)
  • Tests for Column._cast methods. (#20)
  • Table.distinct implemented. (#83)
  • Use assertSequenceEqual in tests. (#84)
  • Docs: features section. (#87)
  • Cookbook: emulating SQL. (#79)
  • Table.left_outer_join implemented. (#11)
  • Table.inner_join implemented. (#11)

0.2.0

  • Python 3.2, 3.3 and 3.4 support. (#52)
  • Documented supported platforms.
  • Cookbook: csvkit. (#36)
  • Cookbook: glob syntax. (#28)
  • Cookbook: filter to values in range. (#30)
  • RowDoesNotExistError implemented. (#70)
  • ColumnDoesNotExistError implemented. (#71)
  • Cookbook: percent change. (#67)
  • Cookbook: sampleing. (#59)
  • Cookbook: random sort order. (#68)
  • Eliminate Table.get_data.
  • Use tuples everywhere. (#66)
  • Fixes for Python 2.6 compatibility. (#53)
  • Cookbook: multi-column sorting. (#13)
  • Cookbook: simple sorting.
  • Destructive Table ops now deepcopy row data. (#63)
  • Non-destructive Table ops now share row data. (#63)
  • Table.sort_by now accepts a function. (#65)
  • Cookbook: pygal.
  • Cookbook: Matplotlib.
  • Cookbook: VLOOKUP. (#40)
  • Cookbook: Excel formulas. (#44)
  • Cookbook: Rounding to two decimal places. (#49)
  • Better repr for Column and Row. (#56)
  • Cookbook: Filter by regex. (#27)
  • Cookbook: Underscore filter & reject. (#57)
  • Table.limit implemented. (#58)
  • Cookbook: writing a CSV. (#51)
  • Kill Table.filter and Table.reject. (#55)
  • Column.map removed. (#43)
  • Column instance & data caching implemented. (#42)
  • Table.select implemented. (#32)
  • Eliminate repeated column index lookups. (#25)
  • Precise DecimalColumn tests.
  • Use Decimal type everywhere internally.
  • FloatColumn converted to DecimalColumn. (#17)
  • Added Eric Sagara to AUTHORS. (#48)
  • NumberColumn.variance implemented. (#1)
  • Cookbook: loading a CSV. (#37)
  • Table.percent_change implemented. (#16)
  • Table.compute implemented. (#31)
  • Table.filter and Table.reject now take funcs. (#24)
  • Column.count implemented. (#12)
  • Column.counts implemented. (#8)
  • Column.all implemented. (#5)
  • Column.any implemented. (#4)
  • Added Jeff Larson to AUTHORS. (#18)
  • NumberColumn.mode implmented. (#18)

0.1.0

  • Initial prototype

Indices and tables