Transformers with learnable activation functions

This is the latest version of this eprint.

Fang, H., Lee, J.-U., Moosavi, N.S. orcid.org/0000-0002-8332-307X et al. (1 more author) (2023) Transformers with learnable activation functions. In: Vlachos, A. and Augenstein, I., (eds.) Findings of the Association for Computational Linguistics: EACL 2023. 17th Conference of the European Chapter of the Association for Computational Linguistics, 02-06 May 2023, Dubrovnik, Croatia. Association for Computational Linguistics , pp. 2382-2398. ISBN 9781959429470

Abstract

Activation functions can have a significant impact on reducing the topological complexity of input data and therefore, improving a model’s performance. However, the choice of activation functions is seldom discussed or explored in Transformer-based language models. As a common practice, commonly used activation functions like Gaussian Error Linear Unit (GELU) are chosen beforehand and then remain fixed from pre-training to fine-tuning. In this paper, we investigate the impact of activation functions on Transformer-based models by utilizing rational activation functions (RAFs). In contrast to fixed activation functions (FAF), RAFs are capable of learning the optimal activation functions from data. Our experiments show that the RAF-based Transformer model (RAFT) achieves a better performance than its FAF-based counterpart ({baseline). For instance, we find that RAFT outperforms {baseline on the GLUE benchmark by 5.71 points when using only 100 training examples and by 2.05 points on SQuAD with all available data. Analyzing the shapes of the learned RAFs further unveils that they vary across different layers and different tasks; opening a promising way to better analyze and understand large, pre-trained language models.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Fang, H. Lee, J.-U. Moosavi, N.S. https://orcid.org/0000-0002-8332-307X Gurevych, I.
Editors:	Vlachos, A. Augenstein, I.
Copyright, Publisher and Additional Information:	© 2023 Association for Computational Linguistics (ACL). This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Dates:	Published: 1 May 2023
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	12 May 2023 10:17
Last Modified:	07 Jun 2023 14:44
Published Version:	https://aclanthology.org/2023.findings-eacl.181
Status:	Published
Publisher:	Association for Computational Linguistics
Refereed:	Yes
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:199095

Available Versions of this Item

Transformers with learnable activation functions. (deposited 07 Jun 2023 14:42)
- Transformers with learnable activation functions. (deposited 12 May 2023 10:17) [Currently Displayed]

Download

Published Version

Filename: 2023.findings-eacl.181.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

[thumbnail of 2023.findings-eacl.181.pdf]

CORE (COnnecting REpositories)