Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Datasets

Abstract

The ability to interpret the predictions made by quantitative structure activity relationships (QSARs) offers a number of advantages. Whilst QSARs built using non-linear modelling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modelling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting non-linear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to two widely used linear modelling approaches: linear Support Vector Machines (SVM), or Support Vector Regression (SVR), and Partial Least Squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions, using novel scoring schemes for assessing Heat Map images of substructural contributions. We critically assess different approaches to interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed, public domain benchmark datasets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modelling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpreting non-linear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using Open Source programs, which we have made available to the community. These programs are the rfFC package [https://r-forge.r-project.org/R/?group_id=1725] for the R Statistical Programming Language, along with a Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for Heat Map generation.

Metadata

Item Type:	Article
Authors/Creators:	Marchese Robinson, RL Palczewska, A Palczewski, JA https://orcid.org/0000-0003-0235-8746 Kidley, N
Copyright, Publisher and Additional Information:	© 2017 American Chemical Society. This document is the Accepted Manuscript version of a Published Work that appeared in final form in Journal of Chemical Information and Modeling, copyright © American Chemical Society after peer review and technical editing by the publisher. To access the final edited and published work see https://doi.org/10.1021/acs.jcim.6b00753. Uploaded in accordance with the publisher's self-archiving policy.
Keywords:	quantitative structure-activity relationships; model interpretation; Machine Learning; Heat Map; Random Forest; Partial Least Squares; Support Vector Machines; Support Vector Regression
Dates:	Published: 28 August 2017 Published (online): 17 July 2017 Accepted: 17 July 2017
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Environment (Leeds) > School of Geography (Leeds) > Centre for Spatial Analysis & Policy (Leeds) The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	19 Jul 2017 09:11
Last Modified:	17 Jul 2018 00:38
Status:	Published
Publisher:	American Chemical Society
Identification Number:	10.1021/acs.jcim.6b00753
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:119210

CORE (COnnecting REpositories)

Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Datasets

Abstract

Metadata

Download

Accepted Version

Export

Statistics