Robust estimation of precision matrices under cellwise contamination

Garth Tarr,Samuel Müller,Neville C. Weber
DOI: https://doi.org/10.1016/j.csda.2015.02.005
2015-01-09
Abstract:There is a great need for robust techniques in data mining and machine learning contexts where many standard techniques such as principal component analysis and linear discriminant analysis are inherently susceptible to outliers. Furthermore, standard robust procedures assume that less than half the observation rows of a data matrix are contaminated, which may not be a realistic assumption when the number of observed features is large. This work looks at the problem of estimating covariance and precision matrices under cellwise contamination. We consider using a robust pairwise covariance matrix as an input to various regularisation routines, such as the graphical lasso, QUIC and CLIME. To ensure the input covariance matrix is positive semidefinite, we use a method that transforms a symmetric matrix of pairwise covariances to the nearest covariance matrix. The result is a potentially sparse precision matrix that is resilient to moderate levels of cellwise contamination. Since this procedure is not based on subsampling it scales well as the number of variables increases.
Methodology
What problem does this paper attempt to address?
This paper aims to solve the problem that standard techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) are sensitive to outliers in the context of data mining and machine learning, especially in high - dimensional data. In addition, standard robust estimation methods assume that less than half of the observed rows in the data matrix are contaminated, which may not be a realistic assumption when the number of features is large. Therefore, this paper focuses on the problem of how to robustly estimate the covariance matrix and the precision matrix in the case of cellwise contamination. Specifically, the paper proposes a method of using a robust pairwise covariance matrix as an input to various regularization procedures (such as graphical lasso, QUIC, and CLIME) to ensure that the input covariance matrix is positive semi - definite. Through this method, a potentially sparse precision matrix that is still robust under a moderate level of cellwise contamination can be obtained. Since this process is not based on sub - sampling, it has good scalability as the number of variables increases. The paper also conducts a detailed simulation study to evaluate the performance of multiple precision matrix estimators under different scenarios and contamination levels, and uses a series of performance indicators to comprehensively evaluate the results. The research shows that in the presence of cellwise contamination, the pairwise covariance estimation method can handle a higher level of cellwise contamination than existing classical robust estimators. This is an innovative result in this field and marks an important progress in dealing with cellwise contamination.