CrowNER at ROCLING 2023 MultiNER-Health Task: Enhancing NER Task with GPT Paraphrase Augmentation on Sparsely Labeled Data

Wang, Yin ChiehYin ChiehWangKuo, Feng YuFeng YuKuoChi, Te YuTe YuChiChen, ShehShehChenWu, Wen HongWen HongWuWu, Han ChunHan ChunWuYang, Te LunTe LunYangJYH-SHING JANG2024-03-072024-03-072023-01-019789869576963https://scholars.lib.ntu.edu.tw/handle/123456789/640559In this research, we utilized the training dataset from the ROCLING 2023 Chinese Multi-genre Named Entity Recognition in the Healthcare Domain, which comprises the Chinese HealthNER Corpus (Lee and Lu, 2021) and the ROCLING 2022 CHNER Dataset (Lee et al., 2022), along with the test set (Lee et al., 2023). The objective was to address the named entity recognition task within the Chinese healthcare domain. Our initial step involved preprocessing the training dataset. We identified instances in the training set where sentences with identical structural patterns exhibited ambiguities and errors in named entity definitions. Prioritizing data validation, we manually excluded erroneous entries. In specialized domains such as medicine, domainspecific terminologies and proprietary names are often defined within sentences as merged labels, rather than separate ones. Thus, we employed the 'Entity Relationship Construction and Merging Strategies' approach to consolidate related named entities. Subsequently, we computed the frequencies of sentence and entity occurrences. We extracted sparsely labeled data and applied two techniques for data augmentation: GPT Paraphrase and entity replacement while preserving sentence structure. These steps resulted in an augmented training set. Finally, we conducted fine-tuning experiments on various state-of-the-art BERT-based models to obtain a model suitable for the ROCLING Shared Task.Data augmentation | Entity Relationship Construction and Merging Strategies | GPT 3.5 | GPT paraphraseCrowNER at ROCLING 2023 MultiNER-Health Task: Enhancing NER Task with GPT Paraphrase Augmentation on Sparsely Labeled Dataconference paper2-s2.0-85184838112https://api.elsevier.com/content/abstract/scopus_id/85184838112