Peng, B. and Chen, B. (2025) Bayesian prototypical pruning for transformers in human–robot collaboration. Mathematics, 13 (9). 1411. ISSN 2227-7390
Abstract
Action representations are essential for developing mutual cognition toward efficient human–AI collaboration, particularly in human–robot collaborative (HRC) workspaces. As such, it has become an emerging research direction for robots to understand human intentions with video Transformers. Despite their remarkable success in capturing long-range dependencies, local redundancy in video frames can add up to the inference latency of Transformers due to overparameterization. Recently, token pruning has become a computationally efficient solution that selectively removes input tokens with minimal impact on task performance. However, existing sparse coding methods often have an exhaustive threshold searching process, leading to intensive hyperparameter search. In this paper, Bayesian Prototypical Pruning (ProtoPrune), a novel end-to-end Bayesian framework, is proposed for token pruning in video understanding. To improve robustness, ProtoPrune leverages prototypical contrastive learning for fine-grained action representations, bringing sub-action level supervision to the video token pruning task. With variational dropout, our method bypasses the exhaustive threshold searching process. Experiments show that the proposed method can achieve a pruning rate of 37.2% while retaining 92.9% of task performance using Uniformer and ActionCLIP, which significantly improves computational efficiency. Convergence analysis ensures the stability of our method. The proposed efficient video understanding method offers a theoretically grounded and hardware-friendly solution for deploying video Transformers in real-world HRC environments.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 The Authors. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
Keywords: | spatial–temporal modeling; sparse coding; human–robot collaboration; action recognition; inference optimization |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > School of Electrical and Electronic Engineering |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 28 Apr 2025 10:05 |
Last Modified: | 28 Apr 2025 10:05 |
Status: | Published |
Publisher: | MDPI |
Refereed: | Yes |
Identification Number: | 10.3390/math13091411 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:225815 |