Liu, X., Zhou, S., Lei, T. et al. (3 more authors) (2023) First-person video domain adaptation with multi-scene cross-site datasets and attention-based methods. IEEE Transactions on Circuits and Systems for Video Technology, 33 (12). pp. 7774-7788. ISSN 1051-8215
Abstract
Unsupervised Domain Adaptation (UDA) can transfer knowledge from labeled source data to unlabeled target data of the same categories. However, UDA for first-person video action recognition is an under-explored problem, with a lack of benchmark datasets and limited consideration of first-person video characteristics. Existing benchmark datasets provide videos with a single activity scene, e.g. kitchen, and similar global video statistics. However, multiple activity scenes and different global video statistics are still essential for developing robust UDA networks for real-world applications. To this end, we first introduce two first-person video domain adaptation datasets: ADL-7 and GTEA_KITCHEN-6. To the best of our knowledge, they are the first to provide multi-scene and cross-site settings for UDA problem on first-person video action recognition, promoting diversity. They provide five more domains based on the original three from existing datasets, enriching data for this area. They are also compatible with existing datasets, ensuring scalability. First-person videos have unique challenges, i.e. actions tend to occur in hand-object interaction areas. Therefore, networks paying more attention to such areas can benefit common feature learning in UDA. Attention mechanisms can endow networks with the ability to allocate resources adaptively for the important parts of the inputs and fade out the rest. Hence, we introduce channel-temporal attention modules to capture the channel-wise and temporal-wise relationships and model their inter-dependencies important to this characteristic. Moreover, we propose a Channel-Temporal Attention Network (CTAN) to integrate these modules into existing architectures. CTAN outperforms baselines on the new datasets and one existing dataset, EPIC-8.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy. |
Keywords: | Action recognition; unsupervised domain adaptation; first-person vision; channel-temporal attention |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 06 Jun 2023 14:48 |
Last Modified: | 04 Oct 2024 12:05 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers |
Refereed: | Yes |
Identification Number: | 10.1109/TCSVT.2023.3281671 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:199718 |