Purpose Developing and comparing a model to classify the topic of research paper using abstract text.
Methods Abstract data from 120,000 papers on arXiv was collected, and classification models were developed using ensemble algorithms and BERT. For feature extraction in the ensemble algorithm, TF-IDF, LDA, and Doc2Vec methods were used to create seven feature sets. A total of 22 models were developed using various feature sets and algorithms, and their performance was compared.
Results The BERT model exhibited the highest performance with an accuracy of 0.848 and an f1-score of 0.808. Among the ensemble algorithms, LightGBM performed exceptionally well, and the direct reflection of word importance through the TF-IDF vectorization method proved to be effective.
Conclusion Developing a model that automatically classifies paper topics by analyzing text offers researchers the opportunity to swiftly access the latest information and identify their research interests. This enhances accessibility to information in research fields and presents the possibility for researchers across diverse domains to gain new insights.