zcat somefile.dat | cut -f3 -d',' | dstats -t sd
to get the
third column of a compressed CSV datafile and compute the standard deviation of
that dataset.
This project is in a pre-alpha stage, currently taking high-level concepts and munging some code around to try things out. Expect obvious bugs, input validation and other issues at this time.
dstats
, which computes a simple set of summary statistics for a given input dataset.
dstats [-utqv?] [file ...]
If file argument(s) are not given, input from stdin is assumed. By default, the following datapoints are computed:
-t T
option controls which datapoints are (possibly computed) and displayed. For example, the default setting is -t count,sum,min,max,mean,var,sd
, where each datapoint of interest is provided in a comma separated list.
By default, the statistic values are displayed as label=value
pairs. For example:
count=44 sum=2386.00 min=3.00 max=100.00 mean=54.23 var=722.60 sd=26.88The
-q
is used to provide 'quieter' output, suppressing the label for each datapoint. This may not be as advantageous for human consumption, but useful for post-processing.
The -u N
option provides updated feedback, printing intermediate statistics values after every N lines of input processed.
1. Is it fast?
Computing the statistics for a file containing 10M random integers takes approximately 5 seconds on a 2.8GHz c2d iMac. YMMV.
2. What about [median|quantile|quartiles]?
In a stream processing mode with no residual memory use, computing exact values for these datapoints is impossible. Exact values require runtime access to large amounts of, if not the entire dataset in trivial implementations.
Some thought has been put into providing a second command line tool alongs the lines of fivenum that can provide inexact or exact values, depending on the input dataset size and the computing resources available.