# Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

## Weighted sum and mean

`Base.sum`

— Function`sum(v::AbstractArray, w::AbstractWeights{<:Real}; [dims])`

Compute the weighted sum of an array `v`

with weights `w`

, optionally over the dimension `dims`

.

`Base.sum!`

— Function```
sum!(R::AbstractArray, A::AbstractArray,
w::AbstractWeights{<:Real}, dim::Int;
init::Bool=true)
```

Compute the weighted sum of `A`

with weights `w`

over the dimension `dim`

and store the result in `R`

. If `init=false`

, the sum is added to `R`

rather than starting from zero.

`StatsBase.wsum`

— Function`wsum(v, w::AbstractVector, [dim])`

Compute the weighted sum of an array `v`

with weights `w`

, optionally over the dimension `dim`

.

`StatsBase.wsum!`

— Function```
wsum!(R::AbstractArray, A::AbstractArray,
w::AbstractVector, dim::Int;
init::Bool=true)
```

Compute the weighted sum of `A`

with weights `w`

over the dimension `dim`

and store the result in `R`

. If `init=false`

, the sum is added to `R`

rather than starting from zero.

`Statistics.mean`

— Function`mean(A::AbstractArray, w::AbstractWeights[, dims::Int])`

Compute the weighted mean of array `A`

with weight vector `w`

(of type `AbstractWeights`

). If `dim`

is provided, compute the weighted mean along dimension `dims`

.

**Examples**

```
n = 20
x = rand(n)
w = rand(n)
mean(x, weights(w))
```

`Statistics.mean!`

— Function`mean!(R::AbstractArray, A::AbstractArray, w::AbstractWeights[; dims=nothing])`

Compute the weighted mean of array `A`

with weight vector `w`

(of type `AbstractWeights`

) along dimension `dims`

, and write results to `R`

.

## Means

The package provides functions to compute means of different kinds.

`StatsBase.geomean`

— Function`geomean(a)`

Return the geometric mean of a collection.

`StatsBase.harmmean`

— Function`harmmean(a)`

Return the harmonic mean of a collection.

`StatsBase.genmean`

— Function`genmean(a, p)`

Return the generalized/power mean with exponent `p`

of a real-valued array, i.e. $\left( \frac{1}{n} \sum_{i=1}^n a_i^p \right)^{\frac{1}{p}}$, where `n = length(a)`

. It is taken to be the geometric mean when `p == 0`

.

## Moments

`Statistics.var`

— Function`var(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)`

Compute the variance of a real-valued array `x`

, optionally over a dimension `dim`

. Observations in `x`

are weighted using weight vector `w`

. The uncorrected (when `corrected=false`

) sample variance is defined as:

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when `corrected=true`

) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

`AnalyticWeights`

: $\frac{1}{\sum w - \sum {w^2} / \sum w}$`FrequencyWeights`

: $\frac{1}{\sum{w} - 1}$`ProbabilityWeights`

: $\frac{n}{(n - 1) \sum w}$ where $n$ equals`count(!iszero, w)`

`Weights`

:`ArgumentError`

(bias correction not supported)

`Statistics.std`

— Function`std(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)`

Compute the standard deviation of a real-valued array `x`

, optionally over a dimension `dim`

. Observations in `x`

are weighted using weight vector `w`

. The uncorrected (when `corrected=false`

) sample standard deviation is defined as:

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when `corrected=true`

) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

`AnalyticWeights`

: $\frac{1}{\sum w - \sum {w^2} / \sum w}$`FrequencyWeights`

: $\frac{1}{\sum{w} - 1}$`ProbabilityWeights`

: $\frac{n}{(n - 1) \sum w}$ where $n$ equals`count(!iszero, w)`

`Weights`

:`ArgumentError`

(bias correction not supported)

`StatsBase.mean_and_var`

— Function`mean_and_var(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, var)`

Return the mean and variance of collection `x`

. If `x`

is an `AbstractArray`

, `dim`

can be specified as a tuple to compute statistics over these dimensions. A weighting vector `w`

can be specified to weight the estimates. Finally, bias correction is be applied to the variance calculation if `corrected=true`

. See `var`

documentation for more details.

`StatsBase.mean_and_std`

— Function`mean_and_std(x, [w::AbstractWeights], [dim]; corrected=false) -> (mean, std)`

Return the mean and standard deviation of collection `x`

. If `x`

is an `AbstractArray`

, `dim`

can be specified as a tuple to compute statistics over these dimensions. A weighting vector `w`

can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if `corrected=true`

. See `std`

documentation for more details.

`StatsBase.skewness`

— Function`skewness(v, [wv::AbstractWeights], m=mean(v))`

Compute the standardized skewness of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.kurtosis`

— Function`kurtosis(v, [wv::AbstractWeights], m=mean(v))`

Compute the excess kurtosis of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

`StatsBase.moment`

— Function`moment(v, k, [wv::AbstractWeights], m=mean(v))`

Return the `k`

th order central moment of a real-valued array `v`

, optionally specifying a weighting vector `wv`

and a center `m`

.

## Measurements of Variation

`StatsBase.span`

— Function`span(x)`

Return the span of a collection, i.e. the range `minimum(x):maximum(x)`

. The minimum and maximum of `x`

are computed in one pass using `extrema`

.

`StatsBase.variation`

— Function`variation(x, m=mean(x))`

Return the coefficient of variation of collection `x`

, optionally specifying a precomputed mean `m`

. The coefficient of variation is the ratio of the standard deviation to the mean.

`StatsBase.sem`

— Function`sem(x)`

Return the standard error of the mean of collection `x`

, i.e. `sqrt(var(x, corrected=true) / length(x))`

.

`StatsBase.mad`

— Function`mad(x; center=median(x), normalize=true)`

Compute the median absolute deviation (MAD) of collection `x`

around `center`

(by default, around the median).

If `normalize`

is set to `true`

, the MAD is multiplied by `1 / quantile(Normal(), 3/4) ≈ 1.4826`

, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

`StatsBase.mad!`

— Function`StatsBase.mad!(x; center=median!(x), normalize=true)`

Compute the median absolute deviation (MAD) of array `x`

around `center`

(by default, around the median), overwriting `x`

in the process. `x`

must be able to hold values of generated by calling `middle`

on its elements (for example an integer vector is not appropriate since `middle`

can produce non-integer values).

If `normalize`

is set to `true`

, the MAD is multiplied by `1 / quantile(Normal(), 3/4) ≈ 1.4826`

, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

## Z-scores

`StatsBase.zscore`

— Function`zscore(X, [μ, σ])`

Compute the z-scores of `X`

, optionally specifying a precomputed mean `μ`

and standard deviation `σ`

. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

`μ`

and `σ`

should be both scalars or both arrays. The computation is broadcasting. In particular, when `μ`

and `σ`

are arrays, they should have the same size, and `size(μ, i) == 1 || size(μ, i) == size(X, i)`

for each dimension.

`StatsBase.zscore!`

— Function`zscore!([Z], X, μ, σ)`

Compute the z-scores of an array `X`

with mean `μ`

and standard deviation `σ`

. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array `Z`

is provided, the scores are stored in `Z`

and it must have the same shape as `X`

. Otherwise `X`

is overwritten.

## Entropy and Related Functions

`StatsBase.entropy`

— Function`entropy(p, [b])`

Compute the entropy of a collection of probabilities `p`

, optionally specifying a real number `b`

such that the entropy is scaled by `1/log(b)`

. Elements with probability 0 or 1 add 0 to the entropy.

`StatsBase.renyientropy`

— Function`renyientropy(p, α)`

Compute the Rényi (generalized) entropy of order `α`

of an array `p`

.

`StatsBase.crossentropy`

— Function`crossentropy(p, q, [b])`

Compute the cross entropy between `p`

and `q`

, optionally specifying a real number `b`

such that the result is scaled by `1/log(b)`

.

`StatsBase.kldivergence`

— Function`kldivergence(p, q, [b])`

Compute the Kullback-Leibler divergence from `q`

to `p`

, also called the relative entropy of `p`

with respect to `q`

, that is the sum `pᵢ * log(pᵢ / qᵢ)`

. Optionally a real number `b`

can be specified such that the divergence is scaled by `1/log(b)`

.

## Quantile and Related Functions

`StatsBase.percentile`

— Function`percentile(x, p)`

Return the `p`

th percentile of a collection `x`

, i.e. `quantile(x, p / 100)`

.

`StatsBase.iqr`

— Function`iqr(x)`

Compute the interquartile range (IQR) of collection `x`

, i.e. the 75th percentile minus the 25th percentile.

`StatsBase.nquantile`

— Function`nquantile(x, n::Integer)`

Return the n-quantiles of collection `x`

, i.e. the values which partition `v`

into `n`

subsets of nearly equal size.

Equivalent to `quantile(x, [0:n]/n)`

. For example, `nquantiles(x, 5)`

returns a vector of quantiles, respectively at `[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]`

.

`Statistics.quantile`

— Function`quantile(v, w::AbstractWeights, p)`

Compute the weighted quantiles of a vector `v`

at a specified set of probability values `p`

, using weights given by a weight vector `w`

(of type `AbstractWeights`

). Weights must not be negative. The weights and data vectors must have the same length. `NaN`

is returned if `x`

contains any `NaN`

values. An error is raised if `w`

contains any `NaN`

values.

With `FrequencyWeights`

, the function returns the same result as `quantile`

for a vector with repeated values. Weights must be integers.

With non `FrequencyWeights`

, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N} w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k} w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of `v`

such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} - v_k)$ with $\gamma = (h - S_k)/(S_{k+1} - S_k)$. In particular, when all weights are equal, the function returns the same result as the unweighted `quantile`

.

`Statistics.median`

— Method`median(v::RealVector, w::AbstractWeights)`

Compute the weighted median of `v`

with weights `w`

(of type `AbstractWeights`

). See the documentation for `quantile`

for more details.

## Mode and Modes

`StatsBase.mode`

— Function```
mode(a, [r])
mode(a::AbstractArray, wv::AbstractWeights)
```

Return the mode (most common number) of an array, optionally over a specified range `r`

or weighted via a vector `wv`

. If several modes exist, the first one (in order of appearance) is returned.

`StatsBase.modes`

— Function```
modes(a, [r])::Vector
mode(a::AbstractArray, wv::AbstractWeights)::Vector
```

Return all modes (most common numbers) of an array, optionally over a specified range `r`

or weighted via vector `wv`

.

## Summary Statistics

`StatsBase.summarystats`

— Function`summarystats(a)`

Compute summary statistics for a real-valued array `a`

. Returns a `SummaryStats`

object containing the mean, minimum, 25th percentile, median, 75th percentile, and maxmimum.

`DataAPI.describe`

— Function`describe(a)`

Pretty-print the summary statistics provided by `summarystats`

: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.

## Reliability Measures

`StatsBase.cronbachalpha`

— Function`cronbachalpha(covmatrix::AbstractMatrix{<:Real})`

Calculate Cronbach's alpha (1951) from a covariance matrix `covmatrix`

according to the formula:

where $k$ is the number of items, i.e. columns, $\sigma_i^2$ the item variance, and $\sigma_{ij}$ the inter-item covariance.

Returns a `CronbachAlpha`

object that holds:

`alpha`

: the Cronbach's alpha score for all items, i.e. columns, in`covmatrix`

; and`dropped`

: a vector giving Cronbach's alpha scores if a specific item, i.e. column, is dropped from`covmatrix`

.

**Example**

```
julia> using StatsBase
julia> cov_X = [10 6 6 6;
6 11 6 6;
6 6 12 6;
6 6 6 13];
julia> cronbachalpha(cov_X)
Cronbach's alpha for all items: 0.8136
Cronbach's alpha if an item is dropped:
item 1: 0.7500
item 2: 0.7606
item 3: 0.7714
item 4: 0.7826
```