Title: | Graphical Univariate/Multivariate Assessments for Normality Assumption |
---|---|
Description: | Graphical methods testing multivariate normality assumption. Methods including assessing score function, and cumulant generating functions, independent transformations and linear transformations. |
Authors: | Huong Tran [aut, cre], Ravindra Khattree [aut] |
Maintainer: | Huong Tran <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2025-03-26 06:50:08 UTC |
Source: | https://github.com/huongtran53/plotnormtest |
Linear combination of third/fourth derivatives of CGF gives an asymptotically
univariate Gaussian process with mean 0 and covariance between two points
and
is defined.
We consider vector
and
as the form
and
.
mt3_covLtLs(l, p, bigt = seq(-1, 1, 0.05)/sqrt(p), sTtTs = NULL, seed = 1) mt4_covLtLs(l, p, bigt = seq(-1, 1, 0.05)/sqrt(p), sTtTs = NULL, seed = 1)
mt3_covLtLs(l, p, bigt = seq(-1, 1, 0.05)/sqrt(p), sTtTs = NULL, seed = 1) mt4_covLtLs(l, p, bigt = seq(-1, 1, 0.05)/sqrt(p), sTtTs = NULL, seed = 1)
l |
vector of linear combination of size equal to the number of distinct
derivatives, see |
p |
dimension of multivariate random vector which data are collected. |
bigt |
array of value |
sTtTs |
Covariance matrix of derivatives vector,
see |
seed |
Random seed to get the estimate of the supremum of the univariate Gaussian process obtained from the linear combination. |
sLtLs
covariance matrix of the linear combination of distinct
derivatives, which is a zero-mean Gaussian process.
m.supLt
Monte-Carlo estimates of supremum of this
Gaussian process
mt3_covLtLs
returns values related to the use of third derivatives.
mt4_covLtLs
returns values related to the use of fourth derivatives.
bigt <- seq(-1, 1, .4) p <- 3 # Third derivatives lT3 <- l_dhCGF(p)[[1]] l3 <- rep(1/sqrt(lT3), lT3) mt3_covLtLs(l = l3, p = p, bigt = bigt/sqrt(p), seed = 1) #fourth derivatives lT4 <- l_dhCGF(p)[[2]] l4 <- rep(1/sqrt(lT4), lT4) mt4_covLtLs(l = l4, p = p, bigt = bigt/sqrt(p), seed = 1)
bigt <- seq(-1, 1, .4) p <- 3 # Third derivatives lT3 <- l_dhCGF(p)[[1]] l3 <- rep(1/sqrt(lT3), lT3) mt3_covLtLs(l = l3, p = p, bigt = bigt/sqrt(p), seed = 1) #fourth derivatives lT4 <- l_dhCGF(p)[[2]] l4 <- rep(1/sqrt(lT4), lT4) mt4_covLtLs(l = l4, p = p, bigt = bigt/sqrt(p), seed = 1)
Stacking third/fourth derivatives of sample CGF together
to obtain a vector, which (under normality assumption on data) approaches
a normally distributed vector with zero mean and a covariance matrix.
More specifically, covTsTs
computes covariance between any two
points as the form and
.
mt3_covTtTs(bigt, p = 1, pos.matrix = NULL) mt4_covTtTs(bigt, p = 1, pos.matrix = NULL)
mt3_covTtTs(bigt, p = 1, pos.matrix = NULL) mt4_covTtTs(bigt, p = 1, pos.matrix = NULL)
bigt |
array contains value of |
p |
dimension of multivariate random vector which data are collected. |
pos.matrix |
matrix containing information of position of any
derivatives. Default is |
Number of distinct third derivatives is
Number of distinct fourth derivatives is
For each pairs of
,
covTsTt
results a covariance
matrix of size or
.
A 2 dimensional upper triangular array, with size equals to
length of bigt
. Each element contains a covariance matrix of
derivatives sequences between any two points and
.
mt3_covTsTt
returns the resulting third derivatives.
mt4_covTsTt
returns the resulting forth derivatives.
bigt <- seq(-1, 1, .4) p <- 2 # Third derivatives mt3_pos.matrix <- mt3_pos(p) sTsTt3 <- mt3_covTtTs(bigt = bigt, p = p, pos.matrix = mt3_pos.matrix) dim(sTsTt3) sTsTt3[1:5, 1:5] # Fourth derivatives mt4_pos.matrix <- mt4_pos(p) sTsTt4 <- mt4_covTtTs(bigt = bigt, p = p, pos.matrix = mt4_pos.matrix) dim(sTsTt4) sTsTt4[1:5, 1:5]
bigt <- seq(-1, 1, .4) p <- 2 # Third derivatives mt3_pos.matrix <- mt3_pos(p) sTsTt3 <- mt3_covTtTs(bigt = bigt, p = p, pos.matrix = mt3_pos.matrix) dim(sTsTt3) sTsTt3[1:5, 1:5] # Fourth derivatives mt4_pos.matrix <- mt4_pos(p) sTsTt4 <- mt4_covTtTs(bigt = bigt, p = p, pos.matrix = mt4_pos.matrix) dim(sTsTt4) sTsTt4[1:5, 1:5]
Stacking derivatives upto the third/fourth orders of sample MGF
together to obtain a vector, which (under normality assumption) approaches
a multivariate normally distributed vector
with zero mean and a covariance matrix.
covZtZs
calculates covariance between any two points
and
in
.
mt3_covZtZs(t, s, pos.matrix = NULL) mt4_covZtZs(t, s, pos.matrix = NULL)
mt3_covZtZs(t, s, pos.matrix = NULL) mt4_covZtZs(t, s, pos.matrix = NULL)
t , s
|
a vector of length |
pos.matrix |
matrix contains information of positions of derivatives.
Default is |
mt3_covZtZs
Covariance matrix relating to the use
of third derivatives.
mt4_covZtZs
Covariance matrix relating to the use of
fourth derivatives. This also contains information on the third
third derivatives mt3_covZtZs
.
set.seed(1) p <- 3 x <- MASS::mvrnorm(100, rep(0, p), diag(p)) t <- rep(0.2, p) s <- rep(-.3, p) # Using third derivatives pos.matrix3 <- mt3_pos(p) sZtZs3 <- mt3_covZtZs(t, s, pos.matrix = pos.matrix3) dim(sZtZs3) sZtZs3[1:5, 1:5] # Using fourth derivatives sZtZs4 <- mt4_covZtZs(t, s) dim(sZtZs4) sZtZs4[1:5, 1:5]
set.seed(1) p <- 3 x <- MASS::mvrnorm(100, rep(0, p), diag(p)) t <- rep(0.2, p) s <- rep(-.3, p) # Using third derivatives pos.matrix3 <- mt3_pos(p) sZtZs3 <- mt3_covZtZs(t, s, pos.matrix = pos.matrix3) dim(sZtZs3) sZtZs3[1:5, 1:5] # Using fourth derivatives sZtZs4 <- mt4_covZtZs(t, s) dim(sZtZs4) sZtZs4[1:5, 1:5]
Get the third/fortth derivatives of sample CGF at a given point.
d3hCGF(myt, x) d4hCGF(myt, x) l_dhCGF(p) dhCGF1D(t, x)
d3hCGF(myt, x) d4hCGF(myt, x) l_dhCGF(p) dhCGF1D(t, x)
myt , t
|
numeric vector of length |
x |
data matrix. |
p |
Dimension. |
Estimator of standardized cumulant function is
and its
order derivatives is defined as
where are the corresponding components
of vector
.
d3hCGF
returns the sequence of third derivatives of
empirical CGF, ordered by index of .
d4hCGF
returns the sequence of fourth derivatives of empirical
CGF ordered by index of .
l_dhCGF
returns number of distinct third and
fourth derivatives.
dhCGF1D
returns third/fourth derivatives of univariate
empirical CGF, which are d3hCGF
and d4hCGF
when .
p <- 3 # Number of distinct derivatives l_dhCGF(p) set.seed(1) x <- MASS::mvrnorm(100, rep(0, p), diag(p)) myt <- rep(.2, p) d3hCGF(myt = myt, x = x) d4hCGF(myt = myt, x = x) #Univariate data set.seed(1) x <- rnorm(100) t <- .3 dhCGF1D(t, x)
p <- 3 # Number of distinct derivatives l_dhCGF(p) set.seed(1) x <- MASS::mvrnorm(100, rep(0, p), diag(p)) myt <- rep(.2, p) d3hCGF(myt = myt, x = x) d4hCGF(myt = myt, x = x) #Univariate data set.seed(1) x <- rnorm(100) t <- .3 dhCGF1D(t, x)
Get the polynomial term in the expression of derivatives of moment
generating function of , with
respect to a given component and its exponent. Up to eighth order.
dMGF(tab, t, coef = TRUE)
dMGF(tab, t, coef = TRUE)
tab |
a dataframe with the first column contain indices of components
of a multivariate random vector |
t |
vector in |
coef |
take |
For a standard multivariate normal random variables
For example,
Value of derivatives.
#Calculation of above example t <- rep(.2, 7) tab <- data.frame(j = 2, exponent = 4) dMGF(tab, t = t) dMGF(tab, t = t, coef = FALSE)
#Calculation of above example t <- rep(.2, 7) tab <- data.frame(j = 2, exponent = 4) dMGF(tab, t = t) dMGF(tab, t = t, coef = FALSE)
Obtain necessary parameters to build a graphical test using the third/fourth derivatives of cumulant generating function.
mt3_get_param(p, bigt = seq(-1, 1, by = 0.05)/sqrt(p), l = NULL) mt4_get_param(p, bigt = seq(-1, 1, by = 0.05)/sqrt(p), l = NULL)
mt3_get_param(p, bigt = seq(-1, 1, by = 0.05)/sqrt(p), l = NULL) mt4_get_param(p, bigt = seq(-1, 1, by = 0.05)/sqrt(p), l = NULL)
p |
Dimension. |
bigt |
Array containing value of |
l |
Linear transformation of vector of third/fourth distinct derivatives, default is their average. |
p
Dimension.
lT
Number of distinct third/fourth order derivatives.
sTtTs
Two dimensional array, each element contains covariance
matrix of vector of derivatives, the function called
mt3_covTtTs()
, or
mt4_covTtTs()
.
l.sTtTs
Covariance matrix of linear combination of distinct
derivatives, the function called mt3_covLtLs()
,
or mt4_covLtLs()
.
m.supLT
The Monte Carlo estimate of expected value supremum of
the Gaussian process, see covLtLs()
.
mt3_get_param
returns necessary parameters for the 2D plot
relying on third derivatives.
mt4_get_param
returns necessary parameters for the 2D plot
relying on fourth derivatives.
covZtZs()
,
covLtLs()
, covTtTs()
p <- 2 mt3 <- mt3_get_param(p, bigt = seq(-1, 1, .4)/sqrt(p)) names(mt3) mt4 <- mt4_get_param(p, bigt = seq(-1, 1, .4)/sqrt(p)) names(mt4)
p <- 2 mt3 <- mt3_get_param(p, bigt = seq(-1, 1, .4)/sqrt(p)) names(mt3) mt4 <- mt4_get_param(p, bigt = seq(-1, 1, .4)/sqrt(p)) names(mt4)
Leave-one-out method gives approximately independent sample of standard multivariate normal distribution, which then produces sample of standard univariate normal distribution.
Multi.to.Uni(x)
Multi.to.Uni(x)
x |
multivariate data matrix |
Let and
are the sample mean sample variance
covariance matrix obtained by using all but
data point. Then
are approximately
independently distributed as
. Thus all
entries in the data matrix so constructed can be treated as
univariate samples of size
from
.
Data frame contains univariate data and the index from multivariate data.
set.seed(1) x <- MASS::mvrnorm(100, mu = rep(0, 5), diag(5)) df <- Multi.to.Uni(x) qqnorm(df$x.new); abline(0, 1)
set.seed(1) x <- MASS::mvrnorm(100, mu = rep(0, 5), diag(5)) df <- Multi.to.Uni(x) qqnorm(df$x.new); abline(0, 1)
The algorithm uses gradient descent algorithm to obtain the maximum of the square of sample skewness, of the kurtosis or of their average under any univariate linear transformation of the multivariate data.
linear_transform( x, l0 = rep(1, ncol(x)), method = "both", epsilon = 1e-10, iter = 5000, stepsize = 0.001 )
linear_transform( x, l0 = rep(1, ncol(x)), method = "both", epsilon = 1e-10, iter = 5000, stepsize = 0.001 )
x |
multivariate data matrix. |
l0 |
starting point for projection algorithm,
default is |
method |
character strings,
one of |
epsilon |
bounds on error of optimal solution, default is |
iter |
number of iteration of projection algorithm,
default is |
stepsize |
gradient descent stepsize, default is |
max_result
: The maximum value after linear transformation.
x_uni
: Univariate data after transformation.
vector_k
: Vector of the "best" linear transformation.
error
: Error of projection algorithm.
iteration
: Number of iteration.
set.seed(1) x <- MASS::mvrnorm(100, mu = rep(0, 2), diag(2)) linear_transform(x, method = "skewness")$max_result linear_transform(x, method = "kurtosis")$max_result linear_transform(x, method = "both")$max_result
set.seed(1) x <- MASS::mvrnorm(100, mu = rep(0, 2), diag(2)) linear_transform(x, method = "skewness")$max_result linear_transform(x, method = "kurtosis")$max_result linear_transform(x, method = "both")$max_result
Taylor expansion implies that vectors of derivatives of
can be approximated
by a linear combination of vectors of derivatives of
.
matrix_A
results the corresponding
linear combinations.
mt3_matrix_A(t) mt4_matrix_A(t)
mt3_matrix_A(t) mt4_matrix_A(t)
t |
vector of |
mt3_matrix_A
returns coefficient matrix relating to the use
of third derivatives.
mt4_matrix_A
returns coefficient matrix relating to the
use of fourth derivatives.
p <- 3 t <- rep(.2, p) A3 <- mt3_matrix_A(t) dim(A3) A3[1:5, 1:5] A4 <- mt4_matrix_A(t) dim(A4) A4[1:5, 1:5]
p <- 3 t <- rep(.2, p) A3 <- mt3_matrix_A(t) dim(A3) A3[1:5, 1:5] A4 <- mt4_matrix_A(t) dim(A4) A4[1:5, 1:5]
Given dimension , returns a dataframe containing the position of
all derivatives of
estimator of moment generating function
,
upto third/fourth order.
mt3_rev_pos(j1, j2, j3, p) mt3_pos(p) mt4_pos(p)
mt3_rev_pos(j1, j2, j3, p) mt3_pos(p) mt4_pos(p)
j1 |
Index of the first variables |
j2 |
Index of the first variables, should be at least |
j3 |
Index of the first variables, should be at least |
p |
Dimension |
The estimator of multivariate moment generating function is
The chain containing all derivatives up to the third order is
and
where is the number of
different from 0.
Similar notation is applied when fourth derivatives is used.
mt3_rev_pos
returns the position of this particular derivative
in the chain of all derivatives, up to third order.
mt3_pos
an array contaning all position with respect
to index of .
mt4_pos
an array contaning all position with respect to
the index of .
mt3_rev_pos(1, 2, 2, p = 3) p <- 3 mt3_pos(p) mt4_pos(p)
mt3_rev_pos(1, 2, 2, p = 3) p <- 3 mt3_pos(p) mt4_pos(p)
Cumulant generating functions of normally distributed
random variables has derivatives of order higher than 3 are all 0.
Hence, plots of empirical third/fourth order derivatives with large value
or high slope gives indication of non-normality.
Multivariate_CGF_PLot
estimates and provides confidence region for
average (or any linear combination) of third/fourth derivatives of empirical
cumulant function at the points . Plots for
will be faster to obtain, as confidence regions
and other necessary parameters are available in
mt3_lst_param.rda
and
mt4_lst_param.rda
.
Higher dimension requires expensive computational cost.
d3hCGF_plot(x, alpha = 0.05) d4hCGF_plot(x, alpha = 0.05)
d3hCGF_plot(x, alpha = 0.05) d4hCGF_plot(x, alpha = 0.05)
x |
Data matrix of size |
alpha |
Significant level (default is |
d3hCGF_plot
returns plot relying in third derivatives.
d4hCGF_plot
returns plot relying in forth derivatives.
set.seed(1234) p <- 3 x <- MASS::mvrnorm(500, rep(0, p), diag(p)) d3hCGF_plot(x) d4hCGF_plot(x)
set.seed(1234) p <- 3 x <- MASS::mvrnorm(500, rep(0, p), diag(p)) d3hCGF_plot(x) d4hCGF_plot(x)
Sample skewness and Sample Kurtosis.
kurtosis(x) skewness(x)
kurtosis(x) skewness(x)
x |
univariate data sample |
Sample kurtosis is
Sample skewness is
kurtosis
returns sample kurtosis.
skewness
returns sample skewness.
set.seed(123) y <- rnorm(100) kurtosis(y) set.seed(123) x <- rnorm(100) skewness(x)
set.seed(123) y <- rnorm(100) kurtosis(y) set.seed(123) x <- rnorm(100) skewness(x)
Plots the empirical third/fourth derivatives of cumulant generating function together with confidence probability region. Indication of non-normality is either violation of probability bands or curves with high slope.
dhCGF_plot1D(x, alpha = 0.05, method)
dhCGF_plot1D(x, alpha = 0.05, method)
x |
Univariate data |
alpha |
Significant level (default is |
method |
string, |
Plots
Ghosh S (1996). “A New Graphical Tool to Detect Non-Normality.” Journal of the Royal Statistical Society: Series B (Methodological), 58(4), 691-702. doi:10.1111/j.2517-6161.1996.tb02108.x.
set.seed(123) x <- rnorm(100) dhCGF_plot1D(x, method = "T3") dhCGF_plot1D(x, method = "T4")
set.seed(123) x <- rnorm(100) dhCGF_plot1D(x, method = "T3") dhCGF_plot1D(x, method = "T4")
Score function of a univariate normal distribution is a straight line. A non-linear graph of score function estimator shows evidence of non-normality.
Outliers are detected using the 2-sigma bands method.
cox(x, P = NULL, lambda = 0.5, x.dist = NULL) score_plot1D(x, P = NULL, lambda = 0.5, x.dist = NULL, ori.index = NULL)
cox(x, P = NULL, lambda = 0.5, x.dist = NULL) score_plot1D(x, P = NULL, lambda = 0.5, x.dist = NULL, ori.index = NULL)
x |
univariate data. |
P |
vector of weight. |
lambda |
smoothing parameter, default is |
x.dist |
the minimum distance between two data points in vector x. |
ori.index |
original index of vector x, default is |
To avoid the singularity of coefficient matrices in spline method, points
with distance less than x.dist
are merged and weight of the
representative points is updated by the summation of weight of
discarded points.
Under null hypothesis, a unbiased estimator score function of a
given data point is
and if is the estimate score from function
cox
at
the point , then
Hence points outside the 2-sigma bands are outliers.
cox
returns the estimate of score function.
x
: The updated univariate data if merging happens.
a
: Score value estimated at x
.
P
: Updated weight (if merging happens).
slt
: Index of merged data point
(is NULL
if x.dist = NULL
).
score_plot1D
returns score functions together with
2-sigma bands for outlier detection.
plot
: plot of estimate score function and its band.
outlier
: index of outliers.
Ng PT (1994). “Smoothing Spline Score Estimation.” SIAM Journal on Scientific Computing, 15(5), 1003-1025. doi:10.1137/0915061, https://doi.org/10.1137/0915061.
set.seed(1) x <- rnorm(100, 2, 4) re <- cox(sort(x)) plot(re$x, re$a, xlab = "x", ylab = "Estimated Score", main = "Estimator of score function") abline(0, 1) set.seed(1) x <- rnorm(100, 2, 4) score_plot1D(sort(x))
set.seed(1) x <- rnorm(100, 2, 4) re <- cox(sort(x)) plot(re$x, re$a, xlab = "x", ylab = "Estimated Score", main = "Estimator of score function") abline(0, 1) set.seed(1) x <- rnorm(100, 2, 4) score_plot1D(sort(x))