dstats

[About] [News] [Download] [License] [Project Page]

About

There are many applications available for simple statistical analysis of data, however none that I am aware of provide an easy to use command line interface that is compatible with a standard shell toolchain. On one end, data files need to be munged or transformed to import into an application such as R or Excel; just to prep the data for import can be time consuming. On the other end, many developers spend precious time writing throwaway scripts, sometimes leveraging statistics libraries and sometimes not. The goal of this project is to provide one or more command line tools to ease the process of how descriptive statistics are obtained. For example, I would like to be able to issue a command such as zcat somefile.dat | cut -f3 -d',' | dstats -t sd to get the third column of a compressed CSV datafile and compute the standard deviation of that dataset.

This project is in a pre-alpha stage, currently taking high-level concepts and munging some code around to try things out. Expect obvious bugs, input validation and other issues at this time.

Design Goals

Correctness: no matter what, the correct answers must be computed.
Large values/dataset size: aribitrarily large input values and dataset sizes should be supported.
When possible, use streaming computation: fast, low memory footprint.

What

For now there is a simple command-line tool named dstats, which computes a simple set of summary statistics for a given input dataset.

dstats [-utqv?] [file ...]

If file argument(s) are not given, input from stdin is assumed. By default, the following datapoints are computed:

count
sum
min
max
mean
variance
standard deviation

The -t T option controls which datapoints are (possibly computed) and displayed. For example, the default setting is -t count,sum,min,max,mean,var,sd, where each datapoint of interest is provided in a comma separated list.

By default, the statistic values are displayed as label=value pairs. For example:

count=44 sum=2386.00 min=3.00 max=100.00 mean=54.23 var=722.60 sd=26.88

The -q is used to provide 'quieter' output, suppressing the label for each datapoint. This may not be as advantageous for human consumption, but useful for post-processing.

The -u N option provides updated feedback, printing intermediate statistics values after every N lines of input processed.

FAQ

1. Is it fast?

Computing the statistics for a file containing 10M random integers takes approximately 5 seconds on a 2.8GHz c2d iMac. YMMV.

2. What about [median|quantile|quartiles]?

In a stream processing mode with no residual memory use, computing exact values for these datapoints is impossible. Exact values require runtime access to large amounts of, if not the entire dataset in trivial implementations.

Some thought has been put into providing a second command line tool alongs the lines of fivenum that can provide inexact or exact values, depending on the input dataset size and the computing resources available.

To Do

Proper input validation
unit tests
man page
GMP