journalism 0.3.0 (alpha)¶
journalism is a Python library that takes the horror out of basic data analysis and manipulation. It is an alternative to numpy and pandas that optimizes for human performance, rather than CPU performance.
It is inspired by underscore.js and all the other libraries that know how to get the hell out of the way and let us do Journalism.
Why use journalism?
- A clean, readable API.
- Optimized for exploratory use in the shell.
- A full set of SQL-like operations.
- Full unicode support.
- Decimal precision everywhere.
- Pure Python. It works everywhere.
- 100% test coverage.
- Extensive user documentation.
- Access to the full power of Python in every command.
journalism is a intended to fill a very particular programming niche, that of non-professional data analysts who need to get shit done quickly. These are the principles of its development:
- Humans have less time than computers. Always optimize for humans.
- Most datasets are simple and small. Never optimize for quants.
- Text is data. It must be a first-class citizen.
- Python gets it right. Make it work like Python does.
- Humans are busy, stupid, lazy, etc. It must be easy.
- Mutability is confusion. Processes that alter data must create new copies.
But why not...
- numpy: It’s hard.
- pandas: It’s hard.
- R: Don’t even get me started.
- SAS: You have that kind of money?
- SQL: It’s not code.
- An ORM: Have you actually tried this?
I’m not reinventing the wheel, I’m just putting on the right size tires.
If you only want to use journalism, install it this way:
pip install journalism
Need more speed? If you’re running Python 2.6, 2.7 or 3.2, you can
pip install cdecimal for a significant speed boost. This isn’t installed automatically because it can create additional complications.
If you are a developer that also wants to hack on journalism, install it this way:
git clone git://github.com/onyxfish/journalism.git cd journalism mkvirtualenv --no-site-packages journalism pip install -r requirements.txt python setup.py develop nosetests --with-coverage --cover-package=journalism
Here is an example of how to use journalism, using financial aid data from data.gov:
#!/usr/bin/env python import csv from journalism import Table, TextType, NumberType text_type = TextType() number_type = NumberType() COLUMNS = ( ('state', text_type), ('state_abbr', text_type), ('9_11_gi_bill1', number_type), ('montogomery_gi_bill_active', number_type), ('montgomery_gi_bill_reserve', number_type), ('dependants', number_type), ('reserve', number_type), ('vietnam', number_type), ('total', number_type) ) COLUMN_NAMES = tuple(c for c in COLUMNS) COLUMN_TYPES = tuple(c for c in COLUMNS) with open('examples/realdata/Datagov_FY10_EDU_recp_by_State.csv') as f: # Skip headers next(f) next(f) next(f) rows = list(csv.reader(f)) # Trim cruft off end rows = rows[:-2] # Create the table table = Table(rows, COLUMN_TYPES, COLUMN_NAMES) # Remove Phillipines and Puerto Rico states = table.where(lambda r: r['state_abbr'] not in ('PR', 'PH')) # Sum total of all states print('Total of all states: %i' % states.columns['total'].sum()) # Sort state total, descending order_by_total_desc = states.order_by('total', reverse=True) # Grab just the top 5 states top_five = order_by_total_desc.rows[:5] for i, row in enumerate(top_five): print('# %i: %s %i' % (i, row['state'], row['total'])) with open('sorted.csv', 'w') as f: writer = csv.writer(f) writer.writerow(order_by_total_desc.get_column_names()) writer.writerows(order_by_total_desc.rows) # Grab just the bottom state last_place = order_by_total_desc.rows[-1] print('Lowest state: %(state)s %(total)i' % last_place) # Calculate the standard of deviation for the state totals stdev = states.columns['total'].stdev() print('Standard deviation of totals: %.2f' % stdev)
Need some more specific examples? Try these out:
- The basics
- Modifying data
- Emulating SQL
- Emulating Excel
- Emulating R
- Emulating Underscore.js
- Plotting with matplotlib
- Plotting with pygal
Want to hack on journalism? Here’s how:
The MIT License
Copyright (c) 2014 Christopher Groskopf and contributers
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- DateType.date_format implemented. (#112)
- Create ColumnType classes to simplify data parsing.
- DateColumn implemented. (#7)
- Cookbook: Excel pivot tables. (#41)
- Cookbook: statistics, including outlier detection. (#82)
- Cookbook: emulating Underscore’s any and all. (#107)
- Parameter documention for method parameters. (#108)
- Table.rank now accepts a column name or key function.
- Optionally use cdecimal for improved performance. (#106)
- Smart naming of aggregate columns.
- Duplicate columns names are now an error. (#92)
- BooleanColumn implemented. (#6)
- TextColumn.max_length implemented. (#95)
- Table.find implemented. (#14)
- Better error handling in Table.__init__. (#38)
- Collapse IntColumn and FloatColumn into NumberColumn. (#64)
- Table.mad_outliers implemented. (#93)
- Column.mad implemented. (#93)
- Table.stdev_outliers implemented. (#86)
- Table.group_by implemented. (#3)
- Cookbook: emulating R. (#81)
- Table.left_outer_join now accepts column names or key functions. (#80)
- Table.inner_join now accepts column names or key functions. (#80)
- Table.distinct now accepts a column name or key function. (#80)
- Table.order_by now accepts a column name or key function. (#80)
- Table.rank implemented. (#15)
- Reached 100% test coverage. (#76)
- Tests for Column._cast methods. (#20)
- Table.distinct implemented. (#83)
- Use assertSequenceEqual in tests. (#84)
- Docs: features section. (#87)
- Cookbook: emulating SQL. (#79)
- Table.left_outer_join implemented. (#11)
- Table.inner_join implemented. (#11)
- Python 3.2, 3.3 and 3.4 support. (#52)
- Documented supported platforms.
- Cookbook: csvkit. (#36)
- Cookbook: glob syntax. (#28)
- Cookbook: filter to values in range. (#30)
- RowDoesNotExistError implemented. (#70)
- ColumnDoesNotExistError implemented. (#71)
- Cookbook: percent change. (#67)
- Cookbook: sampleing. (#59)
- Cookbook: random sort order. (#68)
- Eliminate Table.get_data.
- Use tuples everywhere. (#66)
- Fixes for Python 2.6 compatibility. (#53)
- Cookbook: multi-column sorting. (#13)
- Cookbook: simple sorting.
- Destructive Table ops now deepcopy row data. (#63)
- Non-destructive Table ops now share row data. (#63)
- Table.sort_by now accepts a function. (#65)
- Cookbook: pygal.
- Cookbook: Matplotlib.
- Cookbook: VLOOKUP. (#40)
- Cookbook: Excel formulas. (#44)
- Cookbook: Rounding to two decimal places. (#49)
- Better repr for Column and Row. (#56)
- Cookbook: Filter by regex. (#27)
- Cookbook: Underscore filter & reject. (#57)
- Table.limit implemented. (#58)
- Cookbook: writing a CSV. (#51)
- Kill Table.filter and Table.reject. (#55)
- Column.map removed. (#43)
- Column instance & data caching implemented. (#42)
- Table.select implemented. (#32)
- Eliminate repeated column index lookups. (#25)
- Precise DecimalColumn tests.
- Use Decimal type everywhere internally.
- FloatColumn converted to DecimalColumn. (#17)
- Added Eric Sagara to AUTHORS. (#48)
- NumberColumn.variance implemented. (#1)
- Cookbook: loading a CSV. (#37)
- Table.percent_change implemented. (#16)
- Table.compute implemented. (#31)
- Table.filter and Table.reject now take funcs. (#24)
- Column.count implemented. (#12)
- Column.counts implemented. (#8)
- Column.all implemented. (#5)
- Column.any implemented. (#4)
- Added Jeff Larson to AUTHORS. (#18)
- NumberColumn.mode implmented. (#18)
- Initial prototype