Yang, H. and Garibaldi, J.M. (2015) Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics, 58. S. S30-S38. ISSN 1532-0464
Abstract
This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub-categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule-based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task-specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F-measure of 93.6%, which was the winner of this de-identification challenge.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2015 Elsevier Inc. Made available under a Creative Commons license. http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Keywords: | Protected Health Information (PHI); De-identification; Hybrid model; Natural language processing; Clinical text mining |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 30 Mar 2017 13:36 |
Last Modified: | 30 Mar 2017 13:36 |
Published Version: | https://doi.org/10.1016/j.jbi.2015.06.015 |
Status: | Published |
Publisher: | Elsevier |
Refereed: | Yes |
Identification Number: | 10.1016/j.jbi.2015.06.015 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:108935 |