dstats
About
There are many applications available for simple statistical analysis of data,
however none that I am aware of provide an easy to use command line interface
that is compatible with a standard shell toolchain. On one end, data files need
to be munged or transformed to import into an application such as R or Excel;
just to prep the data for import can be time consuming. On the other end, many
developers spend precious time writing throwaway scripts, sometimes leveraging
statistics libraries and sometimes not. The goal of this project is to provide
one or more command line tools to ease the process of how descriptive
statistics are obtained. For example, I would like to be able to issue a
command such as zcat somefile.dat | cut -f3 -d',' | dstats -t sd
to get the
third column of a compressed CSV datafile and compute the standard deviation of
that dataset.
This project is in a pre-alpha stage, currently taking high-level concepts and
munging some code around to try things out. Expect obvious bugs, input
validation and other issues at this time.
Design Goals
- Correctness: no matter what, the correct answers must be computed.
- Large values/dataset size: aribitrarily large input values and dataset sizes should be supported.
- When possible, use streaming computation: fast, low memory footprint.
What
For now there is a simple command-line tool named dstats
, which computes a simple set of summary statistics for a given input dataset.
dstats [-utqv?] [file ...]
If file argument(s) are not given, input from stdin is assumed. By default, the following datapoints are computed:
- count
- sum
- min
- max
- mean
- variance
- standard deviation
The -t T
option controls which datapoints are (possibly computed) and displayed. For example, the default setting is -t count,sum,min,max,mean,var,sd
, where each datapoint of interest is provided in a comma separated list.
By default, the statistic values are displayed as label=value
pairs. For example:
count=44 sum=2386.00 min=3.00 max=100.00 mean=54.23 var=722.60 sd=26.88
The -q
is used to provide 'quieter' output, suppressing the label for each datapoint. This may not be as advantageous for human consumption, but useful for post-processing.
The -u N
option provides updated feedback, printing intermediate statistics values after every N lines of input processed.
FAQ
1. Is it fast?
Computing the statistics for a file containing 10M random integers takes
approximately 5 seconds on a 2.8GHz c2d iMac. YMMV.
2. What about [median|quantile|quartiles]?
In a stream processing mode with no residual memory use, computing exact values for these datapoints is impossible. Exact values require runtime access to large amounts of, if not the entire dataset in trivial implementations.
Some thought has been put into providing a second command line tool alongs the lines of fivenum that can provide inexact or exact values, depending on the input dataset size and the computing resources available.
To Do
- Proper input validation
- unit tests
- man page
- GMP