Linton, A.-G. orcid.org/0000-0001-6541-160X, Dimitrova, V.G., Downing, A. et al. (2 more authors) (2025) Weakly supervised text classification on free-text comments in patient-reported outcome measures. Frontiers in Digital Health, 7. 1345360. ISSN 2673-253X
Abstract
Background: Free-text comments in patient-reported outcome measures (PROMs) data provide insights into health-related quality of life (HRQoL). However, these comments are typically analysed using manual methods, such as content analysis, which is labour-intensive and time-consuming. Machine learning analysis methods are largely unsupervised, necessitating post-analysis interpretation. Weakly supervised text classification (WSTC) can be a valuable analytical method of analysis for classifying domain-specific text data, especially when limited labelled data are available. In this paper, we applied five WSTC techniques to PROMs comment data to explore the extent to which they can be used to identify HRQoL themes reported by patients with prostate and colorectal cancer.
Methods: The main HRQoL themes and associated keywords were identified from a scoping review. They were used to classify PROMs comments with these themes from two national PROMs datasets: colorectal cancer (n = 5,634) and prostate cancer (n = 59,768). Classification was done using five keyword-based WSTC methods (anchored CorEx, BERTopic, Guided LDA, WeSTClass, and X-Class). To evaluate these methods, we assessed the overall performance of the methods and by theme. Domain experts reviewed the interpretability of the methods using the keywords extracted from the methods during training.
Results: Based on the 12 papers identified in the scoping review, we determined six main themes and corresponding keywords to label PROMs comments using WSTC methods. These themes were: Comorbidities, Daily Life, Health Pathways and Services, Physical Function, Psychological and Emotional Function, and Social Function. The performance of the methods varied across themes and between the datasets. While the best-performing model for both datasets, CorEx, attained weighted F1 scores of 0.57 (colorectal cancer) and 0.61 (prostate cancer), methods achieved an F1 score of up to 0.92 (Social Function) on individual themes. By evaluating the keywords extracted from the trained models, we saw that the methods that can utilise expert-driven seed terms and extrapolate based on limited data performed the best.
Conclusions: Overall, evaluating these WSTC methods provided insight into their applicability for analysing PROMs comments. Evaluating the classification performance illustrated the potential and limitations of keyword-based WSTC in labelling PROMs comments when labelled data are limited.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 Linton, Dimitrova, Downing, Wagland and Glaser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
Keywords: | free-text, text classification, patient-reported data, short text, weakly supervised, natural language processing, PROMS, patient-generated data |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Medicine and Health (Leeds) > School of Medicine (Leeds) The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 20 May 2025 10:03 |
Last Modified: | 20 May 2025 10:03 |
Published Version: | https://www.frontiersin.org/journals/digital-healt... |
Status: | Published |
Publisher: | Frontiers Media |
Identification Number: | 10.3389/fdgth.2025.1345360 |
Sustainable Development Goals: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:226783 |