Multivariate Statistics: Classical Foundations and Modern Machine Learning

Multivariate Statistics: Classical Foundations and Modern Machine Learning

English | 2025 | ISBN: 978-1032758794 | 466 Pages | PDF, EPUB | 67 MB

This book explores multivariate statistics from both traditional and modern perspectives. The first section covers core topics like multivariate normality, MANOVA, discrimination, PCA, and canonical correlation analysis. The second section includes modern concepts such as gradient boosting, random forests, variable importance, and causal inference.

A key theme is leveraging classical multivariate statistics to explain advanced topics and prepare for contemporary methods. For example, linear models provide a foundation for understanding regu-larization with AIC and BIC, leading to a deeper analysis of regularization through generalization error and the VC theorem. Discriminant analysis introduces the weighted Bayes rule, which leads into modern classification techniques for class-imbalanced machine learning problems. Steepest descent serves as a precursor to matching pursuit and gradient boosting. Axis-aligned trees like CART, a classical tool, set the stage for more recent methods like super greedy trees.

Another central theme is training error. Introductory courses often caution that reducing training error too aggressively can lead to overfitting. At the same time, training error, also referred to as empirical risk, is a foundational concept in statistical learning theory. In regression, training error corresponds to the residual sum of squares, and minimizing it results in the least squares solution, which can lead to overfitting. Regardless of this concern, empirical risk plays a pivotal role in evaluating the potential for effective learning. The principle of empirical risk minimization demonstrates that minimizing training error can be advantageous when paired with regularization. This idea is further examined through techniques such as penalization, matching pursuit, gradient boosting, and super greedy tree constructions.

Key Features:

  • Covers both classical and contemporary multivariate statistics.
  • Each chapter includes a carefully selected set of exercises that vary in degree of difficulty and are both applied and theoretical.
  • The book can also serve as a reference for researchers due to the diverse topics covered, including new material on super greedy trees, rule-based variable selection, and machine learning for causal inference.
  • Extensive treatment on trees that provides a comprehensive and unified approach to understanding trees in terms of partitions and empirical risk minimization.
  • New content on random forests, including random forest quantile classifiers for class-imbalanced problems, multivariate random forests, subsampling for confidence regions, super greedy forests. An entire chapter is dedicated to random survival forests, featuring new material on random hazard forests extending survival forests to time-varying covariates.
Homepage