8 Statistics
This chapter describes the statistical functions provided by the Science Collection. The basic statistical functions include functions to compute the mean, variance, and standard deviation, More advanced functions allow you to calculate absolute deviation, skewness, and kurtosis, as well as the median and arbitrary percentiles. The algorithms use recurrance relations to compute average quantities in a stable way, without large intermediate values that might overflow.
The functions described in this chapter are defined in the "statistics.rkt" file in the Science Collection and are made available using the form:
(require (planet williams/science/statistics)) |
8.1 Running Statistics
A running statistics object accumulates a minimal set of statistics (n, min, max, mean, variance, and standard deviation) for a set of data values. A running statistics object does not require that a sequence (e.g., list or vector) of the data value be maintained.
procedure
(statistics? x) → boolean?
x : any/c
procedure
procedure
(statistics-reset! s) → void?
s : statistics?
procedure
(statistics-tally! s x) → void?
s : statistics? x : real?
procedure
s : statistics?
procedure
(statistics-min s) → real?
s : statistics?
procedure
(statistics-max s) → real?
s : statistics?
procedure
(statistics-mean s) → real?
s : statistics?
procedure
(statistics-variance s) → real?
s : statistics?
procedure
s : statistics?
8.2 Running Statistics Example
This example generated 100 random numbers between 0.0 and 10.0 and maintains running statistics on the values.
#lang racket (require (planet williams/science/statistics) (planet williams/science/random-distributions)) (define (test) (let ((stat (make-statistics))) (for ((i (in-range 100))) (statistics-tally! stat (random-flat 0.0 10.0))) (printf "Running Statistics Example~n") (printf " n = ~a~n" (statistics-n stat)) (printf " min = ~a~n" (statistics-min stat)) (printf " max = ~a~n" (statistics-max stat)) (printf " mean = ~a~n" (statistics-mean stat)) (printf " variance = ~a~n" (statistics-variance stat)) (printf "standard deviation = ~a~n" (statistics-standard-deviation stat)))) (test)
Produces the following results.
Running Statistics Example |
n = 100 |
min = 0.11100957474903939 |
max = 9.938914540059452 |
mean = 5.466640451797567 |
variance = 8.677003172428925 |
standard deviation = 2.945675333846031 |
8.3 Mean, Standard Deviation, and Variance
procedure
data : sequence-of-real? (unchecked-mean data) → real? data : sequence-of-real?
procedure
(mean-and-variance data) →
real? (>=/c 0.0) data : sequence-of-real?
(unchecked-mean-and-variance data) →
real? (>=/c 0.0) data : sequence-of-real?
procedure
data : sequence-of-real? mean : real? (unchecked-variance data mean) → (>=/c 0.0) data : sequence-of-real? mean : real? (variance data) → (>=/c 0.0) data : sequence-of-real? (unchecked-variance data) → (>=/c 0.0) data : sequence-of-real?
procedure
(standard-deviation data mean) → (>=/c 0.0)
data : sequence-of-real? mean : real? (unchecked-standard-deviation data mean) → (>=/c 0.0) data : sequence-of-real? mean : real? (standard-deviation data) → (>=/c 0.0) data : sequence-of-real? (unchecked-standard-deviation data) → (>=/c 0.0) data : sequence-of-real?
procedure
(sum-of-squares data mean) → (>=/c 0.0)
data : sequence-of-real? mean : real? (unchecked-sum-of-squares data mean) → (>=/c 0.0) data : sequence-of-real? mean : real? (sum-of-squares data) → (>=/c 0.0) data : sequence-of-real? (unchecked-sum-of-squares data) → (>=/c 0.0) data : sequence-of-real?
procedure
(variance-with-fixed-mean data mean) → (>=/c 0.0)
data : sequence-of-real? mean : real?
(unchecked-variance-with-fixed-mean data mean) → (>=/c 0.0) data : sequence-of-real? mean : real?
procedure
(standard-deviation-with-fixed-mean data mean) → (>=/c 0.0) data : sequence-of-real? mean : real?
(unchecked-standard-deviation-with-fixed-mean data mean) → (>=/c 0.0) data : sequence-of-real? mean : real?
8.4 Absolute Deviation
procedure
(absolute-deviation data mean) → (>=/c 0.0)
data : sequence-of-real? mean : real? (unchecked-absolute-deviation data mean) → (>=/c 0.0) data : sequence-of-real? mean : real? (absolute-deviation data) → (>=/c 0.0) data : sequence-of-real? (unchecked-absolute-deviation data) → (>=/c 0.0) data : sequence-of-real?
8.5 Higher Moments (Skewness and Kurtosis)
procedure
data : sequence-of-real? mean : real? sd : (>=/c 0.0) (unchecked-skew data mean sd) → real? data : sequence-of-real? mean : real? sd : (>=/c 0.0) (skew data) → real? data : sequence-of-real? (unchecked-skew data) → real? data : sequence-of-real?
procedure
data : sequence-of-real? mean : real? sd : (>=/c 0.0) (unchecked-kurtosis data mean sd) → real? data : sequence-of-real? mean : real? sd : (>=/c 0.0) (kurtosis data) → real? data : sequence-of-real? (unchecked-kurtosis data) → real? data : sequence-of-real?
8.6 Autocorrelation
procedure
(lag-1-autocorrelation data mean) → real?
data : nonempty-sequence-of-real? mean : real? (unchecked-lag-1-autocorrelation data mean) → real? data : nonempty-sequence-of-real? mean : real? (lag-1-autocorrelation data) → real? data : nonempty-sequence-of-real? (unchecked-lag-1-autocorrelation data) → real? data : nonempty-sequence-of-real?
8.7 Covariance
procedure
(covariance data1 data2 mean1 mean2) → real?
data1 : nonempty-sequence-of-real? data2 : nonempty-sequence-of-real? mean1 : real? mean2 : real?
(unchecked-covariance data1 data2 mean1 mean2) → real? data1 : nonempty-sequence-of-real? data2 : nonempty-sequence-of-real? mean1 : real? mean2 : real? (covariance data1 data2) → real? data1 : nonempty-sequence-of-real? data2 : nonempty-sequence-of-real? (unchecked-covariance data1 data2) → real? data1 : nonempty-sequence-of-real? data2 : nonempty-sequence-of-real?
8.8 Correlation
procedure
(correlation data1 data2) → real?
data1 : nonempty-sequence-of-real? data2 : nonempty-sequence-of-real? (unchecked-correlation data1 data2) → real? data1 : nonempty-sequence-of-real? data2 : nonempty-sequence-of-real?
8.9 Weighted Samples
procedure
(weighted-mean weights data) → real?
weights : sequence-of-real? data : sequence-of-real? (unchecked-weighted-mean weights data) → real? weights : sequence-of-real? data : sequence-of-real?
procedure
(weighted-variance weights data wmean) → (>=/c 0.0)
weights : sequence-of-real? data : sequence-of-real? wmean : real?
(unchecked-weighted-variance weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real? (weighted-variance weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? (unchecked-weighted-variance weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
procedure
(weighted-standard-deviation weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real?
(unchecked-weighted-standard-deviation weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real? (weighted-standard-deviation weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
(unchecked-weighted-standard-deviation weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
procedure
(weighted-variance-with-fixed-mean weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-reals? wmean : real?
(unchecked-weighted-variance-with-fixed-mean weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-reals? wmean : real?
procedure
(weighted-standard-deviation-with-fixed-mean weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real?
(unchecked-weighted-standard-deviation-with-fixed-mean weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real?
procedure
(weighted-absolute-deviation weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real?
(unchecked-weighted-absolute-deviation weights data wmean) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real? (weighted-absolute-deviation weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
(unchecked-weighted-absolute-deviation weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
procedure
(weighted-skew weights data wmean wsd) → (>=/c 0.0)
weights : sequence-of-real? data : sequence-of-real? wmean : real? wsd : (>=/c 0.0)
(unchecked-weighted-skew weights data wmean wsd) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real? wsd : (>=/c 0.0) (weighted-skew weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? (unchecked-weighted-skew weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
procedure
(weighted-kurtosis weights data wmean wsd) → (>=/c 0.0)
weights : sequence-of-real? data : sequence-of-real? wmean : real? wsd : (>=/c 0.0)
(unchecked-weighted-kurtosis weights data wmean wsd) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? wmean : real? wsd : (>=/c 0.0) (weighted-kurtosis weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real? (unchecked-weighted-kurtosis weights data) → (>=/c 0.0) weights : sequence-of-real? data : sequence-of-real?
8.10 Maximum and Minimum
procedure
data : nonempty-sequence-of-real? (unchecked-maximum data) → real? data : nonempty-sequence-of-real?
procedure
data : nonempty-sequence-of-real? (unchecked-minimum data) → real? data : nonempty-sequence-of-real?
procedure
(minimum-maximum data) →
real? real? data : nonempty-sequence-of-real?
(unchecked-minimum-maximum data) →
real? real? data : nonempty-sequence-of-real?
procedure
(maximum-index data) → exact-nonnegative-integer?
data : nonempty-sequence-of-real? (unchecked-maximum-index data) → exact-nonnegative-integer? data : nonempty-sequence-of-real?
procedure
(minimum-index data) → exact-nonnegative-integer?
data : nonempty-sequence-of-real? (unchecked-minimum-index data) → exact-nonnegative-integer? data : nonempty-sequence-of-real?
procedure
(minimum-maximum-index data) →
exact-nonnegative-ineger? exact-nonnegative-integer? data : nonempty-sequence-of-real? (unchecked-minimum-maximum-index data)
→
exact-nonnegative-ineger? exact-nonnegative-integer? data : nonempty-sequence-of-real?
8.11 Median and Quantiles
Thw median and quantile functions described in this section operate on sorted data. The contracts for these functions enforce this. Also, for convenience we use quantiles measured on a scale of 0 to 1 instead of percentiles, which use a scale of 0 to 100).
procedure
(median-from-sorted-data sorted-data) → real?
sorted-data : nonempty-sorted-vector-of-real? (unchecked-median-from-sorted-data sorted-data) → real? sorted-data : nonempty-sorted-vector-of-real?
procedure
(quantile-from-sorted-data sorted-data f) → real?
sorted-data : nonempty-sorted-vector-of-real? f : (real-in 0.0 1.0)
(unchecked-quantile-from-sorted-data sorted-data f) → real? sorted-data : nonempty-sorted-vector-of-real? f : (real-in 0.0 1.0)
The quantile is found by interpolation using the formula:
quantile = 1 - delta(x[i]) + delta(x(i + 1))
where i is floor((n - 1) × f) and delta is (n - 1) × f - 1.
8.12 Statistics Example
This example generates two vectors from a unit Gaussian distribution and a vector of conse squared weighting data. All of the vectors are of length 1,000. Thes data are used to test all of the statistics functions.
#lang racket (require (planet williams/science/random-distributions/gaussian) (planet williams/science/statistics) (planet williams/science/math)) (define (naive-sort! data) (let loop () (let ((n (vector-length data)) (sorted? #t)) (do ((i 1 (+ i 1))) ((= i n) data) (when (< (vector-ref data i) (vector-ref data (- i 1))) (let ((t (vector-ref data i))) (vector-set! data i (vector-ref data (- i 1))) (vector-set! data (- i 1) t) (set! sorted? #f)))) (unless sorted? (loop))))) (let ((data1 (make-vector 1000)) (data2 (make-vector 1000)) (w (make-vector 1000))) (for ((i (in-range 1000))) (vector-set! data1 i (random-unit-gaussian)) (vector-set! data2 i (random-unit-gaussian)) (vector-set! w i (expt (cos (- (* 2.0 pi (/ i 1000.0)) pi)) 2))) (printf "Statistics Example~n") (printf " mean = ~a~n" (mean data1)) (printf " variance = ~a~n" (variance data1)) (printf " standard deviation = ~a~n" (standard-deviation data1)) (printf " variance from 0.0 = ~a~n" (variance-with-fixed-mean data1 0.0)) (printf " standard deviation from 0.0 = ~a~n" (standard-deviation-with-fixed-mean data1 0.0)) (printf " absolute deviation = ~a~n" (absolute-deviation data1)) (printf " absolute deviation from 0.0 = ~a~n" (absolute-deviation data1 0.0)) (printf " skew = ~a~n" (skew data1)) (printf " kurtosis = ~a~n" (kurtosis data1)) (printf " lag-1 autocorrelation = ~a~n" (lag-1-autocorrelation data1)) (printf " covariance = ~a~n" (covariance data1 data2)) (printf " weighted mean = ~a~n" (weighted-mean w data1)) (printf " weighted variance = ~a~n" (weighted-variance w data1)) (printf " weighted standard deviation = ~a~n" (weighted-standard-deviation w data1)) (printf " weighted variance from 0.0 = ~a~n" (weighted-variance-with-fixed-mean w data1 0.0)) (printf "weighted standard deviation from 0.0 = ~a~n" (weighted-standard-deviation-with-fixed-mean w data1 0.0)) (printf " weighted absolute deviation = ~a~n" (weighted-absolute-deviation w data1)) (printf "weighted absolute deviation from 0.0 = ~a~n" (weighted-absolute-deviation w data1 0.0)) (printf " weighted skew = ~a~n" (weighted-skew w data1)) (printf " weighted kurtosis = ~a~n" (weighted-kurtosis w data1)) (printf " maximum = ~a~n" (maximum data1)) (printf " minimum = ~a~n" (minimum data1)) (printf " index of maximum value = ~a~n" (maximum-index data1)) (printf " index of minimum value = ~a~n" (minimum-index data1)) (naive-sort! data1) (printf " median = ~a~n" (median-from-sorted-data data1)) (printf " 10% quantile = ~a~n" (quantile-from-sorted-data data1 0.1)) (printf " 20% quantile = ~a~n" (quantile-from-sorted-data data1 0.2)) (printf " 30% quantile = ~a~n" (quantile-from-sorted-data data1 0.3)) (printf " 40% quantile = ~a~n" (quantile-from-sorted-data data1 0.4)) (printf " 50% quantile = ~a~n" (quantile-from-sorted-data data1 0.5)) (printf " 60% quantile = ~a~n" (quantile-from-sorted-data data1 0.6)) (printf " 70% quantile = ~a~n" (quantile-from-sorted-data data1 0.7)) (printf " 80% quantile = ~a~n" (quantile-from-sorted-data data1 0.8)) (printf " 90% quantile = ~a~n" (quantile-from-sorted-data data1 0.9)))
Produces the following output:
Statistics Example |
mean = 0.03457693091555611 |
variance = 1.0285343857083435 |
standard deviation = 1.0141668431320083 |
variance from 0.0 = 1.028701415474174 |
standard deviation from 0.0 = 1.014249188056946 |
absolute deviation = 0.7987180852601665 |
absolute deviation from 0.0 = 0.7987898146946209 |
skew = 0.04340293467117837 |
kurtosis = 0.17722452271702993 |
lag-1 autocorrelation = 0.0029930889831972143 |
covariance = 0.005782911085590894 |
weighted mean = 0.05096139259270008 |
weighted variance = 1.0500293763787367 |
weighted standard deviation = 1.0247094107007786 |
weighted variance from 0.0 = 1.0510513958491579 |
weighted standard deviation from 0.0 = 1.0252079768755011 |
weighted absolute deviation = 0.8054378524718832 |
weighted absolute deviation from 0.0 = 0.8052440544958938 |
weighted skew = 0.046448729539282155 |
weighted kurtosis = 0.3050060704791675 |
maximum = 3.731148814104969 |
minimum = -3.327265864298485 |
index of maximum value = 502 |
index of minimum value = 476 |
median = 0.019281803306206644 |
10% quantile = -1.243869878615807 |
20% quantile = -0.7816243947573505 |
30% quantile = -0.4708703241429585 |
40% quantile = -0.2299309332835332 |
50% quantile = 0.019281803306206644 |
60% quantile = 0.30022966479982344 |
70% quantile = 0.5317978807508836 |
80% quantile = 0.832291888537874 |
90% quantile = 1.3061151234700463 |