Missing data are common in applications and real life datasets especially in Economics, Political and Social sciences either because some entities do not want to share some statistics or because they are not collected. Technical reasons could also cause missing data such as machine failures or non-responses. Missing data can significantly affect the conclusions drown from a dataset and uncertainty about the conclusions increases as the missing data increase.
We now try to perform the same analysis starting from a complete dataset of unknown distribution. In fact, the EM algorithm relies on the assumption that the underlying distribution is known which is not the case for the other imputation (or inference relying on incomplete data) methods.
The dataset considered is not the source of interest and will not be described, the goal is simply to test the methods on data on which we cannot make any a priori assumption on the distribution. We extracted a subset of some dataset about wine quality, considering 300 observations of some variables (fixed acidity, volitile acidity, density and pH). The dataset is available here.
We apply the Shapiro-Wilk test for normality which rejected the assumption of normality with a high certainty (the p-value of order
#import data
df = pd.read_csv('data/winequality-white.csv', sep=';')
#extracting a subset of the dataset
df = df[['fixed acidity', 'volatile acidity', 'citric acid', 'pH']]
df = df.iloc[1000:1300]
from scipy.stats import shapiro, normaltest
#test for normality
# Assuming your data is stored in the variable 'data'
stat, p_value = shapiro(df)
# Print the test statistic and p-value
print("Shairo p-value:", p_value)
print("Normality rejected (at 5%)" if p_value < 0.05 else "Normality not rejected (at 5%)")
#the real mean and covariance are seen here as the empirical mean and covariance of the complete data
X = df.values
real_mean = np.mean(X, axis=0)
real_cov = np.cov(X, rowvar=False)
#plotting results with different percentages of missing data and missing data mechanisms
plot_all_differences_combined(X, real_mean, real_cov, verbose =True, plot_title = 'Comparison of imputation methods for different missing data mechanisms on the Wine dataset')
Quite surprisingly, we see that the imputation coming from the EM algorithm yields a satisfying result of the same quality as the other methods that do not rely on the assumption of Gaussianity (which is not satisfied here). For the mean estimation, we see a significance difference between MCAR for which EM is stable for every methods and for the percentage of missing data and MNAR and MAR for which we have instability given the different methods.
For the Covariance, we observe that the Iterative method outperforms the others for all missing data mechanisms and if the Median, KNN and Iterative methods seem to yield to a convergence, the EM algorithm does not at all (the normalized MSE being of order 1, testifying of a wrong convergence).
The multivariate Student-t distribution is useful for modeling datasets with heavy tails and is often used in finance. Quite often, especially in finance, the assumption of gaussianity on Student-t data can lead to bad estimations, we will therefore analyse this.
The multivariate Student-t has the following parameters
-
Mean vector
$\mu$ : A$d$ -dimensional vector representing the mean of the distribution. -
Scale matrix
$\Sigma$ : A positive definite$d \times d$ matrix. -
Degrees of freedom
$\nu > 2$ : A scalar value that determines the shape of the distribution's tails. As$\nu$ increases, the Student-t distribution approaches the normal distribution.
Its density function is given by:
where
# Generation of the multivariate t-distributed data
n = 200 # number of samples
d = 10 # number of features
df = 4 # degrees of freedom
mean = np.random.randn(d) # mean vector
scale_matrix = 2*np.random.rand(d,d) - 2*np.random.rand(d,d)
scale_matrix = scale_matrix.T.dot(scale_matrix) # make Scale matrix symmetric positive definite
real_cov = scale_matrix * df / (df - 2) # real covariance matrix
print(np.linalg.eigvals(scale_matrix)) # check eigenvalues
X = generate_student_t(df, mean = mean, scale_matrix = scale_matrix, n_samples = n)
plot_all_differences_combined(X, real_mean = mean, real_cov = real_cov, plot_title = "Comparison of the imputation of Student's-t data and estimation of parameters across different missingness levels and mechanisms", verbose =False)
For Student's-t data, EM converges as the other methods but but with no clear advantage of applying it knowing its computation cost compared to the other methods, which performed as well.
We now generate Student's-t data that passes normality tests. Indeed, as
# Generation of Student's-t distributed data that looks like normal data
n = 300 # number of samples
d = 10 # number of features
df = 12 # degrees of freedom. As df -> inf, Student's-t -> Normal
mean = np.random.randn(d) # mean vector
scale_matrix = 2*np.random.rand(d,d) - 2*np.random.rand(d,d) # scale matrix
scale_matrix = scale_matrix.T.dot(scale_matrix) # make scale matrix symmetric positive definite
real_cov_quasi_normal = scale_matrix * df / (df - 2) # real covariance matrix
X_quasi_normal = generate_student_t(df, mean = mean, scale_matrix = scale_matrix, n_samples = n, quasi_normal=True)
plot_all_differences_combined(X_quasi_normal, mean, real_cov_quasi_normal, plot_title = "Comparison of the imputation of quasi normal data and estimation of parameters across different missingness levels and mechanisms", verbose = False)
In this question, we will discuss about the different types of missingness and how we can generate them given a complete dataset.
Missing data are common in applications and real life datasets especially in Economics, Political and Social sciences either because some entities do not want to share some statistics or because they are not collected. Technical reasons could also cause missing data such as machine failures or non-responses. Missing data can significantly affect the conclusions drown from a dataset and uncertainty about the conclusions increases as the missing data increase.
Consider a data set
Consider the response matrix
In the code, the matrix
There exists different types of missing data mechanisms which are grouped in the following categories.
Observations are said to be be missing completely at random (MCAR) if
In the code, the mask corresponding to
Observations are said to be be missing at random (MAR) if
To generate such missingness, one needs again to generate a mask
If missingness is not MCAR or MAR, it is said to be missing not at random (MNAR). To generate such data, one has various options. The first one, is to consider a self-masked model which will apply the logistic model to all variables, meaning that every variable will potentially have missing values.
The second one is to select a certain proportion of variables which will be used as inputs for the logistic model and the remaining variables will have missing probabilities according to the logistic model. Then a MCAR mechanism is applied to the input variables. After this transformation one indeed has a dependence of the missingness of the two groups of variables and hence the mask
The code used to generate missing data and its description was taken from https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values ADD REFERENCE