Salle, A. and Villavicencio, A. orcid.org/0000-0002-3731-9168 (2022) Understanding the effects of negative (and positive) pointwise mutual information on word vectors. Journal of Experimental and Theoretical Artificial Intelligence, 35 (8). pp. 1161-1199. ISSN 0952-813X
Abstract
Despite the recent popularity of contextual word embeddings, static word embeddings still dominate lexical semantic tasks, making their study of continued relevance. A widely adopted family of such static word embeddings is derived by explicitly factorising the Pointwise Mutual Information (PMI) weighting of the co-occurrence matrix. As unobserved co-occurrences lead PMI to negative infinity, a common workaround is to clip negative PMI at 0. However, it is unclear what information is lost by collapsing negative PMI values to 0. To answer this question, we isolate and study the effects of negative (and positive) PMI on the semantics and geometry of models adopting factorisation of different PMI matrices. Word and sentence-level evaluations show that only accounting for positive PMI in the factorisation strongly captures both semantics and syntax, whereas using only negative PMI captures little of semantics but a surprising amount of syntactic information. Results also reveal that incorporating negative PMI induces stronger rank invariance of vector norms and directions, as well as improved rare word representations.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2022 Informa UK Limited, trading as Taylor & Francis Group. This is an author-produced version of a paper accepted for publication in Journal of Experimental and Theoretical Artificial Intelligence. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | word embedding; lexical semantics; pointwise mutual information |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number ENGINEERING AND PHYSICAL SCIENCE RESEARCH COUNCIL EP/T02450X/1 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 15 Mar 2022 11:09 |
Last Modified: | 10 Jul 2024 14:09 |
Status: | Published |
Publisher: | Taylor & Francis |
Refereed: | Yes |
Identification Number: | 10.1080/0952813X.2022.2072004 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:184560 |