First-person video domain adaptation with multi-scene cross-site datasets and attention-based methods

Abstract

Unsupervised Domain Adaptation (UDA) can transfer knowledge from labeled source data to unlabeled target data of the same categories. However, UDA for first-person video action recognition is an under-explored problem, with a lack of benchmark datasets and limited consideration of first-person video characteristics. Existing benchmark datasets provide videos with a single activity scene, e.g. kitchen, and similar global video statistics. However, multiple activity scenes and different global video statistics are still essential for developing robust UDA networks for real-world applications. To this end, we first introduce two first-person video domain adaptation datasets: ADL-7 and GTEA_KITCHEN-6. To the best of our knowledge, they are the first to provide multi-scene and cross-site settings for UDA problem on first-person video action recognition, promoting diversity. They provide five more domains based on the original three from existing datasets, enriching data for this area. They are also compatible with existing datasets, ensuring scalability. First-person videos have unique challenges, i.e. actions tend to occur in hand-object interaction areas. Therefore, networks paying more attention to such areas can benefit common feature learning in UDA. Attention mechanisms can endow networks with the ability to allocate resources adaptively for the important parts of the inputs and fade out the rest. Hence, we introduce channel-temporal attention modules to capture the channel-wise and temporal-wise relationships and model their inter-dependencies important to this characteristic. Moreover, we propose a Channel-Temporal Attention Network (CTAN) to integrate these modules into existing architectures. CTAN outperforms baselines on the new datasets and one existing dataset, EPIC-8.

Metadata

Item Type:	Article
Authors/Creators:	Liu, X. Zhou, S. Lei, T. Jiang, P. Chen, Z. Lu, H. https://orcid.org/0000-0002-0349-2181
Copyright, Publisher and Additional Information:	© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy.
Keywords:	Action recognition; unsupervised domain adaptation; first-person vision; channel-temporal attention
Dates:	Accepted: 28 May 2023 Published (online): 31 May 2023 Published: December 2023
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	06 Jun 2023 14:48
Last Modified:	04 Oct 2024 12:05
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers
Refereed:	Yes
Identification Number:	10.1109/TCSVT.2023.3281671
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:199718

CORE (COnnecting REpositories)

First-person video domain adaptation with multi-scene cross-site datasets and attention-based methods

Abstract

Metadata

Download

Accepted Version

Export

Statistics