Zhang, Z. orcid.org/0000-0002-8587-8618, Nuzzolese, A.G. and Gentile, A.L. (2017) Entity deduplication on ScholarlyData. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P. and Hartig, O., (eds.) The Semantic Web. ESWC: European Semantic Web Conference, 28 May - 01 Jun 2017, Portorož, Slovenia. Springer International Publishing , pp. 85-100. ISBN 9783319580678
Abstract
ScholarlyData is the new and currently the largest reference linked dataset of the Semantic Web community about papers, people, organisations, and events related to its academic conferences. Originally started from the Semantic Web Dog Food (SWDF), it addressed multiple issues on data representation and maintenance by (i) adopting a novel data model and (ii) establishing an open source workflow to support the addition of new data from the community. Nevertheless, the major issue with the current dataset is the presence of multiple URIs for the same entities, typically in persons and organisations. In this work we: (i) perform entity deduplication on the whole dataset, using supervised classification methods; (ii) devise a protocol to choose the most representative URI for an entity and deprecate duplicated ones, while ensuring backward compatibilities for them; (iii) incorporate the automatic deduplication step in the general workflow to reduce the creation of duplicate URIs when adding new data. Our early experiment focused on the person and organisation URIs and results show significant improvement over state-of-the-art solutions. We managed to consolidate, on the entire dataset, over 100 and 800 pairs of duplicate person and organisation URIs and their associated triples (over 1,800 and 5,000) respectively, hence significantly improving the overall quality and connectivity of the data graph. Integrated into the ScholarlyData data publishing workflow, we believe that this serves a major step towards the creation of clean, high-quality scholarly linked data on the Semantic Web.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2017 Springer International Publishing. |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 27 Nov 2019 15:45 |
Last Modified: | 27 Nov 2019 15:45 |
Status: | Published |
Publisher: | Springer International Publishing |
Refereed: | Yes |
Identification Number: | 10.1007/978-3-319-58068-5_6 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:153931 |