Preiss, J. orcid.org/0000-0002-2158-5832 (2023) Avoiding background knowledge: literature based discovery from important information. In: BMC Bioinformatics. 15th International Conference on Data and Text Mining in Biomedical Informatics (DTMBIO 2021), 22 Oct 2021, Online. Springer Science and Business Media LLC , p. 570.
Abstract
Background
Automatic literature based discovery attempts to uncover new knowledge by connecting existing facts: information extracted from existing publications in the form of A→B and B→C relations can be simply connected to deduce A→C. However, using this approach, the quantity of proposed connections is often too vast to be useful. It can be reduced by using subject→(predicate)→object triples as the A→B relations, but too many proposed connections remain for manual verification.
Results
Based on the hypothesis that only a small number of subject–predicate–object triples extracted from a publication represent the paper’s novel contribution(s), we explore using BERT embeddings to identify these before literature based discovery is performed utilizing only these, important, triples. While the method exploits the availability of full texts of publications in the CORD-19 dataset—making use of the fact that a novel contribution is likely to be mentioned in both an abstract and the body of a paper—to build a training set, the resulting tool can be applied to papers with only abstracts available. Candidate hidden knowledge pairs generated from unfiltered triples and those built from important triples only are compared using a variety of timeslicing gold standards.
Conclusions
The quantity of proposed knowledge pairs is reduced by a factor of 103, and we show that when the gold standard is designed to avoid rewarding background knowledge, the precision obtained increases up to a factor of 10. We argue that the gold standard needs to be carefully considered, and release as yet undiscovered candidate knowledge pairs based on important triples alongside this work.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © The Author(s) 2023. Open Access: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
Keywords: | Literature based discovery; Machine learning; Subject–predicate–object triples; Timeslicing gold standard; Knowledge Discovery; Knowledge |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 21 Mar 2023 14:42 |
Last Modified: | 21 Mar 2023 14:42 |
Status: | Published |
Publisher: | Springer Science and Business Media LLC |
Refereed: | Yes |
Identification Number: | 10.1186/s12859-022-04892-8 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:197496 |