Fernández del Castillo, Alberto, Garibay, Marycarmen Verduzco, Díaz-Vázquez, Diego et al. (5 more authors) (2024) Improving river water quality prediction with hybrid machine learning and temporal analysis. Ecological Informatics, 82. 102655. ISSN 1574-9541
Abstract
River systems provide multiple ecosystem services to society globally, but these are already degraded or threatened in many areas of the world due to water quality issues linked to diffuse and point-source pollutant inputs. Water quality evaluation is essential to develop remediation and management strategies. Computational tools such as machine learning based predictive models have been developed to improve monitoring network capabilities. The model's performance is reduced when datasets composed of reductant information are used for training, on the other hand, the selection of most representative and variable water quality scenarios could result in higher precision. This study analyzed historical water quality behavior in the Santiago River, Mexico, to identify the most variable and representative data available to train machine learning models (Adaptive Neuro Fuzzy Inference System – ANFIS, Artificial Neural Network – ANN, and Support Vector Machine - SVM). Thirteen monitoring sites were clustered according to their water quality variability from 2009 to 2022. Subsequently, a Time Series Analysis (TSA) was used to select the most representative monitoring station from each cluster. Data for 6/13 monitoring sites were retained for the Best Training Subset (BTS) used to train restricted models that performed with similar (ANN and SMV) or higher (ANFIS) prediction accuracy (in terms of RMSE, MAE, MSE and R2) for both training and testing. This study provides evidence of water quality data containing redundant information that is not useful to improve machine learning model performance, in turn leading to overtraining. Combined analytical approaches can maximize the representativeness and variability of data selected for machine learning applications, leading to improved prediction.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2024 The Authors. This is an open access article under the terms of the Creative Commons Attribution License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. |
Keywords: | Water Quality Index, Highly polluted river, Time series analysis, Cluster analysis, Monitoring network, Data Science |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Environment (Leeds) > School of Geography (Leeds) > River Basin Processes & Management (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 31 May 2024 12:06 |
Last Modified: | 29 Jul 2024 14:42 |
Status: | Published |
Publisher: | Elsevier |
Identification Number: | 10.1016/j.ecoinf.2024.102655 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:212981 |