CrowNER at ROCLING 2023 MultiNER-Health Task: Enhancing NER Task with GPT Paraphrase Augmentation on Sparsely Labeled Data
Journal
ROCLING 2023 - Proceedings of the 35th Conference on Computational Linguistics and Speech Processing
ISBN
9789869576963
Date Issued
2023-01-01
Author(s)
Abstract
In this research, we utilized the training dataset from the ROCLING 2023 Chinese Multi-genre Named Entity Recognition in the Healthcare Domain, which comprises the Chinese HealthNER Corpus (Lee and Lu, 2021) and the ROCLING 2022 CHNER Dataset (Lee et al., 2022), along with the test set (Lee et al., 2023). The objective was to address the named entity recognition task within the Chinese healthcare domain. Our initial step involved preprocessing the training dataset. We identified instances in the training set where sentences with identical structural patterns exhibited ambiguities and errors in named entity definitions. Prioritizing data validation, we manually excluded erroneous entries. In specialized domains such as medicine, domainspecific terminologies and proprietary names are often defined within sentences as merged labels, rather than separate ones. Thus, we employed the 'Entity Relationship Construction and Merging Strategies' approach to consolidate related named entities. Subsequently, we computed the frequencies of sentence and entity occurrences. We extracted sparsely labeled data and applied two techniques for data augmentation: GPT Paraphrase and entity replacement while preserving sentence structure. These steps resulted in an augmented training set. Finally, we conducted fine-tuning experiments on various state-of-the-art BERT-based models to obtain a model suitable for the ROCLING Shared Task.
Subjects
Data augmentation | Entity Relationship Construction and Merging Strategies | GPT 3.5 | GPT paraphrase
Type
conference paper