A machine learning classifier for assigning individual patients with systemic sclerosis to intrinsic molecular subsets

JM Franks, V Martyanov, G Cai, Y Wang… - Arthritis & …, 2019 - Wiley Online Library
Arthritis & rheumatology, 2019Wiley Online Library
Objective High‐throughput gene expression profiling of tissue samples from patients with
systemic sclerosis (SS c) has identified 4 “intrinsic” gene expression subsets: inflammatory,
fibroproliferative, normal‐like, and limited. Prior methods required agglomerative clustering
of many samples. In order to classify individual patients in clinical trials or for diagnostic
purposes, supervised methods that can assign single samples to molecular subsets are
required. We undertook this study to introduce a novel machine learning classifier as a …
Objective
High‐throughput gene expression profiling of tissue samples from patients with systemic sclerosis (SSc) has identified 4 “intrinsic” gene expression subsets: inflammatory, fibroproliferative, normal‐like, and limited. Prior methods required agglomerative clustering of many samples. In order to classify individual patients in clinical trials or for diagnostic purposes, supervised methods that can assign single samples to molecular subsets are required. We undertook this study to introduce a novel machine learning classifier as a robust accurate intrinsic subset predictor.
Methods
Three independent gene expression cohorts were curated and merged to create a data set covering 297 skin biopsy samples from 102 unique patients and controls, which was used to train a machine learning algorithm. We performed external validation using 3 independent SSc cohorts, including a gene expression data set generated by an independent laboratory on a different microarray platform. In total, 413 skin biopsy samples from 213 individuals were analyzed in the training and testing cohorts.
Results
Repeated cross‐fold validation identified consistent and discriminative markers using multinomial elastic net, performing with an average classification accuracy of 87.1% with high sensitivity and specificity. In external validation, the classifier achieved an average accuracy of 85.4%. Reanalyzing data from a previous study, we identified subsets of patients that represent the canonical inflammatory, fibroproliferative, and normal‐like subsets.
Conclusion
We developed a highly accurate classifier for SSc molecular subsets for individual patient samples. The method can be used in SSc clinical trials to identify an intrinsic subset on individual samples. Our method provides a robust data‐driven approach to aid clinical decision‐making and interpretation of heterogeneous molecular information in SSc patients.
Wiley Online Library