Improving unsupervised keyphrase extraction by modeling hierarchical multi-granularity features

Abstract

Existing unsupervised keyphrase extraction methods typically emphasize the importance of the candidate keyphrase itself, ignoring other important factors such as the influence of uninformative sentences. We hypothesize that the salient sentences of a document are particularly important as they are most likely to contain keyphrases, especially for long documents. To our knowledge, our work is the first attempt to exploit sentence salience for unsupervised keyphrase extraction by modeling hierarchical multi-granularity features. Specifically, we propose a novel position-aware graph-based unsupervised keyphrase extraction model, which includes two model variants. The pipeline model first extracts salient sentences from the document, followed by keyphrase extraction from the extracted salient sentences. In contrast to the pipeline model which models multi-granularity features in a two-stage paradigm, the joint model accounts for both sentence and phrase representations of the source document simultaneously via hierarchical graphs. Concretely, the sentence nodes are introduced as an inductive bias, injecting sentence-level information for determining the importance of candidate keyphrases. We compare our model against strong baselines on three benchmark datasets including Inspec, DUC 2001, and SemEval 2010. Experimental results show that the simple pipeline-based approach achieves promising results, indicating that keyphrase extraction task benefits from the salient sentence extraction task. The joint model, which mitigates the potential accumulated error of the pipeline model, gives the best performance and achieves new state-of-the-art results while generalizing better on data from different domains and with different lengths. In particular, for the SemEval 2010 dataset consisting of long documents, our joint model outperforms the strongest baseline UKERank by 3.48%, 3.69% and 4.84% in terms of F1@5, F1@10 and F1@15, respectively. We also conduct qualitative experiments to validate the effectiveness of our model components.

Metadata

Item Type:	Article
Authors/Creators:	Zhang, Z. https://orcid.org/0000-0002-8860-0881 Liang, X. Zuo, Y. Lin, C. https://orcid.org/0000-0003-3454-2468
Copyright, Publisher and Additional Information:	© 2023 The Authors. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Keywords:	Unsupervised keyphrase extraction; Graph-based ranking algorithm; Hierarchical Multi-granularity features
Dates:	Accepted: 13 March 2023 Published (online): 3 April 2023 Published: July 2023
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	19 Sep 2025 13:52
Last Modified:	19 Sep 2025 13:52
Status:	Published
Publisher:	Elsevier BV
Refereed:	Yes
Identification Number:	10.1016/j.ipm.2023.103356
Related URLs:	Author
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:231912

CORE (COnnecting REpositories)

Improving unsupervised keyphrase extraction by modeling hierarchical multi-granularity features

Abstract

Metadata

Download

Published Version

Export

Statistics