Defense Date

2024

Document Type

Thesis

Degree Name

Master of Science

Department

Computer Science

First Advisor

Dr. Bridget McInnes

Abstract

With the ever-growing amount of textual data, the task of Named Entity Recognition (NER) is vital to Natural Language Processing (NLP), a field which focuses on enabling computers to understand and manipulate human language. NER enables the extraction of information from unstructured text. Accurate information extraction is crucial for applications ranging from information retrieval to systems for question-answering. To ensure that NER models are robust to changes in data distributions and capable of recognizing new entity types, one may consider expanding the capabilities of an existing model. Continual learning is a paradigm within machine learning. It studies the objective of learning new information incrementally without forgetting previously learned knowledge. A central concern with continual learning is the phenomenon of catastrophic forgetting, where training a neural network on new information leads to significant degradation in performance on previously learned information. Reannotation of existing data for new information and then training a new model proves costly and time-consuming, prompting the need for better strategies. Generating and using synthetic data to combat forgetting has been studied in continual learning for vision models and, to a limited extent, with long short-term memory unit (LSTM) generators or inverted models for NER models. One way to achieve this is to use generative large-language models to create synthetic data. This work focuses on building the foundation for a generative replay approach. We aim to determine the efficacy of using Open AI's GPT-4 model to generate synthetic data to supplement the training of NER systems. We aim to answer the following questions: Can synthetic data be generated to mimic the format of authentic NER training data? Is synthetic data similar to the authentic data? Does the addition of synthetic data improve model performance? Is solely using synthetic data enough to achieve performance on par with a baseline? How do different prompting strategies for generating synthetic data affect model performance? We conducted experiments using the 2018 TAC SRIE dataset and a DeBERTa-V3-based model with broadcast linear and softmax classification layers. We successfully generated synthetic data using GPT-4 and two different prompting strategies. We found improved performance when supplementing authentic data with synthetic data, even when only supplementing with small amounts. This work contributes a novel finding concerning NER and synthetic data generation with generative large-language models and lays the foundation for a novel generative-replay approach to continual NER.

Rights

© Charles Cutler, May 2024

Is Part Of

VCU University Archives

Is Part Of

VCU Theses and Dissertations

Date of Submission

5-7-2024

Included in

Data Science Commons

Share

COinS