journalism 0.2.0 (pre-alpha)¶
journalism is a Python library that takes the horror out of basic data analysis and manipulation. It is an alternative to numpy and pandas that optimizes for human performance, rather than CPU performance.
It is inspired by underscore.js and all the other libraries that know how to get the hell out of the way and let us do Journalism.
journalism is a intended to fill a very particular programming niche, that of non-professional data analysts who need to get shit done quickly. These are the principles of its development:
- Humans have less time than computers. Always optimize for humans.
- Most datasets are simple and small. Never optimize for quants.
- Text is data. It must be a first-class citizen.
- Python gets it right. Make it work like Python does.
- Humans are busy, stupid, lazy, etc. It must be easy.
- Mutability is confusion. Processes that alter data must create new copies.
But why not...
- numpy: It’s hard.
- pandas: It’s hard.
- R: Don’t even get me started.
- SAS: You have that kind of money?
- SQL: It’s not code.
- An ORM: Have you actually tried this?
I’m not reinventing the wheel, I’m just putting on the right size tires.
If you only want to use journalism, install it this way:
pip install journalism
If you are a developer that also wants to hack on journalism, install it this way:
git clone git://github.com/onyxfish/journalism.git cd journalism mkvirtualenv --no-site-packages journalism pip install -r requirements.txt python setup.py develop nosetests --with-coverage --cover-package=journalism
Here is an example of how to use journalism, using financial aid data from data.gov:
#!/usr/bin/env python import csv from journalism import Table, TextColumn, IntColumn COLUMNS = ( ('state', TextColumn), ('state_abbr', TextColumn), ('9_11_gi_bill1', IntColumn), ('montogomery_gi_bill_active', IntColumn), ('montgomery_gi_bill_reserve', IntColumn), ('dependants', IntColumn), ('reserve', IntColumn), ('vietnam', IntColumn), ('total', IntColumn) ) COLUMN_NAMES = tuple(c for c in COLUMNS) COLUMN_TYPES = tuple(c for c in COLUMNS) with open('examples/realdata/Datagov_FY10_EDU_recp_by_State.csv') as f: # Skip headers next(f) next(f) next(f) rows = list(csv.reader(f)) # Trim cruft off end rows = rows[:-2] # Create the table table = Table(rows, COLUMN_TYPES, COLUMN_NAMES, cast=True) # Remove Phillipines and Puerto Rico states = table.where(lambda r: r['state_abbr'] not in ('PR', 'PH')) # Sum total of all states print('Total of all states: %i' % states.columns['total'].sum()) # Sort state total, descending order_by_total_desc = states.order_by(lambda r: r['total'], reverse=True) # Grab just the top 5 states top_five = order_by_total_desc.rows[:5] for i, row in enumerate(top_five): print('# %i: %s %i' % (i, row['state'], row['total'])) with open('sorted.csv', 'w') as f: writer = csv.writer(f) writer.writerow(order_by_total_desc.get_column_names()) writer.writerows(order_by_total_desc.rows) # Grab just the bottom state last_place = order_by_total_desc.rows[-1] print('Lowest state: %(state)s %(total)i' % last_place) # Calculate the standard of deviation for the state totals stdev = states.columns['total'].stdev() print('Standard deviation of totals: %.2f' % stdev) print 'Standard deviation of totals: %.2f' % stdev
Need some more specific examples? Try these out:
- The basics
- Modifying data
- Plotting with matplotlib
- Plotting with pygal
- Emulating Excel
- Emulating Underscore.js
Want to hack on journalism? Here’s how:
The MIT License
Copyright (c) 2014 Christopher Groskopf and contributers
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- Python 3.2, 3.3 and 3.4 support. (#52)
- Documented supported platforms.
- Cookbook: csvkit. (#36)
- Cookbook: glob syntax. (#28)
- Cookbook: filter to values in range. (#30)
- RowDoesNotExistError implemented. (#70)
- ColumnDoesNotExistError implemented. (#71)
- Cookbook: percent change. (#67)
- Cookbook: sampleing. (#59)
- Cookbook: random sort order. (#68)
- Eliminate Table.get_data.
- Use tuples everywhere. (#66)
- Fixes for Python 2.6 compatibility. (#53)
- Cookbook: multi-column sorting. (#13)
- Cookbook: simple sorting.
- Destructive Table ops now deepcopy row data. (#63)
- Non-destructive Table ops now share row data. (#63)
- Table.sort_by now accepts a function. (#65)
- Cookbook: pygal.
- Cookbook: Matplotlib.
- Cookbook: VLOOKUP. (#40)
- Cookbook: Excel formulas. (#44)
- Cookbook: Rounding to two decimal places. (#49)
- Better repr for Column and Row. (#56)
- Cookbook: Filter by regex. (#27)
- Cookbook: Underscore filter & reject. (#57)
- Table.limit implemented. (#58)
- Cookbook: writing a CSV. (#51)
- Kill Table.filter and Table.reject. (#55)
- Column.map removed. (#43)
- Column instance & data caching implemented. (#42)
- Table.select implemented. (#32)
- Eliminate repeated column index lookups. (#25)
- Precise DecimalColumn tests.
- Use Decimal type everywhere internally.
- FloatColumn converted to DecimalColumn. (#17)
- Added Eric Sagara to AUTHORS. (#48)
- NumberColumn.variance implemented. (#1)
- Cookbook: loading a CSV. (#37)
- Table.percent_change implemented. (#16)
- Table.compute implemented. (#31)
- Table.filter and Table.reject now take funcs. (#24)
- Column.count implemented. (#12)
- Column.counts implemented. (#8)
- Column.all implemented. (#5)
- Column.any implemented. (#4)
- Added Jeff Larson to AUTHORS. (#18)
- NumberColumn.mode implmented. (#18)
- Initial prototype