An exploration of semi-supervised text classification
Original version
Lien, H. (2021) An exploration of semi-supervised text classification (Master's thesis). University of Agder, Grimstad.Abstract
Obtaining labeled data to train natural language machine learning algorithms is often expensive and time-consuming, while unlabeled data usually is free and easy to get. Frequently a large amount of labeled data is required by supervised learning to achieve good text classification performance. Semi-supervised learning (SSL) for text classification is an exciting area of research. SSL is a technique exploiting unlabeled and labeled data to achieve better classification performance than using labeled data alone and is particularly useful with limited labeled data. This thesis explores the impact of different parameters on SSL with unsupervised pre-training and supervised fine-tuning for a text classification task. Key to this work is the study of hyperparameters, including the amount of preprocessing data and model size. We examine smaller and larger models, including feed-forward, recurrent, and seq2seq models, used for experimentation. This thesis uses SSL performance as a performance metric. It measures the difference in text classification performance of a model when using the SSL compared to the supervised learning approach. Thus, the SSL performance is an intuitive measure for investigating the benefits of SSL.
Description
Master's thesis in Information- and communication technology (IKT590)