Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Abstract

Objectives

Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types.

Methods

We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking.

Results

Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text.

Conclusions

Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.

Metadata

Item Type:	Article
Authors/Creators:	Vashishth, S. Newman-Griffis, D. https://orcid.org/0000-0002-0473-4226 Joshi, R. Dutt, R. Rosé, C.P.
Copyright, Publisher and Additional Information:	© 2021 The Author(s). This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Keywords:	Natural language processing; Information extraction; Medical concept normalization; Medical entity linking; Distant supervision; Entity typing
Dates:	Accepted: 31 July 2022 Published (online): 12 August 2022 Published: September 2021
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	07 Dec 2022 13:52
Last Modified:	07 Dec 2022 13:52
Status:	Published
Publisher:	Elsevier BV
Refereed:	Yes
Identification Number:	10.1016/j.jbi.2021.103880
Related URLs:	PubMed URL
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:194186

CORE (COnnecting REpositories)

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Abstract

Metadata

Download

Published Version

Export

Statistics