Principle-Based Approach for the De-Identification of Code-Mixed Electronic Health Records

Wang, CK; Wang, FD; Lee, YQ; Chen, PT; Wang, BH; Su, CH; Kuo, JCC; CHI-SHIN WU; YI-LING CHIEN; Dai, HJ; Tseng, VS; Hsu, WL

doi:10.1109/ACCESS.2022.3148396

Principle-Based Approach for the De-Identification of Code-Mixed Electronic Health Records

Journal

IEEE ACCESS

Journal Volume

10

Pages

22875

Date Issued

2022

Author(s)

Wang, CK

Wang, FD

Lee, YQ

Chen, PT

Wang, BH

Su, CH

Kuo, JCC

CHI-SHIN WU

YI-LING CHIEN

Dai, HJ

Tseng, VS

Hsu, WL

DOI

10.1109/ACCESS.2022.3148396

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/627053

URL

https://api.elsevier.com/content/abstract/scopus_id/85124199713

Abstract

Code-mixing is a phenomenon where at least two languages are combined in a hybrid manner in the con of a single conversation. The use of mixed language is widespread in multilingual and multicultural countries and poses significant challenges for the development of automated language processing tools. In Taiwan's electronic health record (EHR) systems, unstructured EHR s are usually represented in a mixture of English and Chinese which increases the difficulty for de-identification and synthetization of protected health information (PHI). We explored this problem by applying several state-of-the-art pre-trained mono- and multilingual language models and propose to exploit the principle-based approach (PBA) for the tasks of PHI recognition and resynthesis on a code-mixed EHR corpus annotated with 6 main categories and 25 subcategories of PHIs. A hierarchical principle slot schema is defined in the PBA to encode knowledge of code-mixed PHIs and utilize slots to learn from the training set to assemble principles for recognizing PHI mentions and synthesizing surrogates simultaneously. In addition, a semantic disambiguation process is implemented to disambiguate ambiguous PHI categories in the de-identification process and to dynamically extend the knowledge encoded in PBA during the knowledge augmentation process. The experiment results demonstrate that the proposed method can achieve the best micro- and macro-F-scores in comparison to the other mono- and multilingual language models fine-tuned on our code-mixed corpus.

Subjects

Hospitals; Training; Electronic medical records; Task analysis; Semantics; Knowledge engineering; Electrical engineering; Electronic health record; data anonymization; code-mixing; principle; named entity recognition; deep learning

SDGs

[SDGs]SDG3

[SDGs]SDG4

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Type

journal article

Principle-Based Approach for the De-Identification of Code-Mixed Electronic Health Records

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)