Al-Gawwam, S., Zaitcev, A., Eissa, M.R. orcid.org/0000-0002-5584-5815 et al. (2 more authors) (2026) Multi-time scale feature extraction and attention networks for automatic depression level prediction. Applied Soft Computing, 186. 114052. ISSN: 1568-4946
Abstract
Depression impairs functioning across personal and professional domains, and early detection is essential for timely intervention. Existing clinical assessments rely on specialists, limiting accessibility and scalability. This paper proposes an automated, video-based approach that estimates depression severity directly from full-length interviews. Facial markers evolve over micro- to macro- timescales; therefore, focusing solely on short clips risks missing long-range cues. This paper introduces a Multi-Timescale Feature Extraction and Channel-Temporal Attention network (MSFE–CTA) that learns dependencies across milliseconds, seconds, and minutes from complete recordings. The MSFE module employs stacks of Inception-TCN blocks with logarithmically scaled dilations to efficiently capture long-range structure, while the CTA module integrates dilated channel attention with multi-kernel depthwise temporal attention to highlight salient features. Window-level predictions are aggregated into a video-level score without requiring manual annotations at inference. Evaluated on the AVEC2013, AVEC2014 datasets, MSFE-CTA achieves MAE/RMSE of 5.75/6.23 and 5.72/6.91, respectively, with only 0.85 M parameters and 1.85 GFLOPs. To assess generalizability across benchmarks, the framework was evaluated on AVEC2017 and AVEC2019 using the official splits, reaching MAE/RMSE of 4.85/5.20 and 5.30/6.44, respectively. Ablation studies confirm that multi-timescale extraction and channel-temporal attention contribute to accuracy, and that dilated operations outperform fixed-scale or squeeze-and-excitation alternatives. The results demonstrate state-of-the-art performance at substantially lower computational cost, enabling practical, full-video depression assessment on standard frame-rates. The method is robust to short occlusions through median aggregation and may support scalable screening in clinical and community settings.
Metadata
| Item Type: | Article |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2026 The Authors. Except as otherwise noted, this author-accepted version of a journal article published in Applied Soft Computing is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
| Keywords: | Information and Computing Sciences; Human-Centred Computing; Depression; Brain Disorders; Mental Health |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > School of Electrical and Electronic Engineering |
| Date Deposited: | 30 Apr 2026 08:50 |
| Last Modified: | 30 Apr 2026 08:50 |
| Status: | Published |
| Publisher: | Elsevier BV |
| Refereed: | Yes |
| Identification Number: | 10.1016/j.asoc.2025.114052 |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:240616 |
Download
Filename: Applied_Soft_Computing_journal-7.pdf
Licence: CC-BY 4.0

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)