May 12, 2023 - Testing for Multivariate Normality with a Henze-Zirkler Test (In Python)

If you’re just interested in the code, it’s super easy to use pingouin to do this:

import numpy as np
import pingouin as pg
data = np.random.normal(size=(100, 3))
output = pg.multivariate_normality(data, alpha=.05)
print(output.hz, output.pval, output.normal)

Sometimes, it’s useful to know when a sample looks a lot like a normal distribution. In a single dimension, we can use the function from scipy scipy.stats.normaltest to test whether a sample is normal. This test combines tests from D’Agostino and Pearson which measure the skew and kurtosis of the sample, and report when those differ from a similar normal population. Unfortunately, this test isn’t immediately generalizable to multiple dimensions, which makes a new test necessary. One example is a Henze-Zirkler test, which is based on a non-negative functional $D$ which measures the distance between two distribution functions, and has the property that $D(N_d(0, I_d), Q) = 0$ if and only if $Q$ is a multivariate normal distribution with identity covariance matrix. In practice, the Henze-zirkler test computes a weighted integral of the difference between the empirical characteristic function (ECF) and it’s pointwise normal approximation (in the limit).

The test statistic can be calculated as:

$$ T_{\alpha} = n\left(4I_{S_{singular}} + D_{n,\alpha}I_{S_{nonsingular}}\right) $$

where $S$ is the sample covariance matrix, I is an indicator function, and:

$$ D_{n,\alpha} = \int_{R^d} \left| \psi_n(t) - \exp(-\frac{1}{2} ||t||^2) \right|^2 \phi_{\alpha}(t) dt $$

with $\psi_n$ being the empirical characteristic function, $\phi_{\alpha}$ being a weighting function with parameter $\alpha$:

$$ \phi_{\alpha}(t) = (2\pi\alpha^2)^{-m/2}\exp(-\frac{||t||^2}{2\alpha^2}), t \sim R^m $$

Notice that because $D_{n,\alpha}$ is undefined with S is singular, we set it to 4 (the maximum value) in that case. Because $T_{\alpha}$ is approximately distributed as log-normal, we can use the log-normal distribution to compute the null hypothesis probability.