18.97.14.86
18.97.14.86
close menu
Accredited
일본 IT정책 텍스트 분석을 위한 텍스트 전처리 및 임베딩에 관한 연구
A Study on Text Preprocessing and Embedding for the Text analysis of Japan’s IT Policy
김민호 ( Kim Minho ) , 최상옥 ( Choi Sang Ok )
DOI 10.37793/ITPR.31.1.3

본 논문은 자연어로 작성된 일본 IT전략을 분석데이터로 변환하고 텍스트 네트워크로 구축하는 방법을 연구하며, 일본 IT정책 텍스트 분석에 적합한 텍스트 전처리방법과 단어 임베딩 알고리즘 탐색을 목적으로 한다. 본 연구는 텍스트 전처리 방법과 임베딩 알고리즘을 평가하기 위해 다중분류 성능평가를 실시하였다. 실험결과, 일본 IT정책 텍스트의 특징으로 인해, 형태소분석, 불용어제거, 단어 인코딩을 수행하지 않은 경우, 다중분류 평가지표가 낮게 나타났다. 또한, Skip-gram 알고리즘이 CBOW 알고리즘에 비해 상대적으로 높은 성능을 보였다. 형태소분석, 불용어제거, 단어 인코딩 등이 제대로 수행되지 않는다면 중심단어가 주변단어와 적절하게 상호작용되지 않고 모델이 부적절 하게 학습된다고 볼 수 있다. 실험결과를 종합하면, 일본어로 작성된 IT정책 텍스트를 대상으로 텍스트 분석을 하는 경우에는 텍스트의 언어적 특성과 일본식 한자와 히라가나·가타카나가 혼합된 문장(Kanji-Kana mixed sentence)으로 구성된 점 등을 고려하여 적절한 텍스트 전처리 방법과 임베딩 알고리즘을 선택하고 활용해야 함을 알 수 있다.

The research focuses on constructing Japan’s IT strategy, written in natural language, through text networks and transforming it into analytical data. Additionally, it explores suitable text preprocessing and word embedding methods for text analysis in Japan’s IT strategy. In this study, We measured the Classification Evaluation Metrics on the Japan’s IT strategy after undergoing the text preprocessing process. The experimental results indicated that due to the characteristics of Japan’s IT policy texts, Classification Evaluation Metrics appeared lower when morphological analysis, stopword removal, and word encoding were not conducted. Japan’s IT strategy consists of a significant number of words composed in Japanese Kanji characters. However, when integrating policy texts spanning long periods, differences in the encoding methods of Japanese Kanji characters across texts have resulted in computers failing to recognize identical words, leading to errors. Furthermore, it was noted that without performing morphological analysis and stopword removal, the Classification Evaluation Metrics showed low performance. These outcomes are deemed to stem from the characteristics of word embedding algorithms. If morphological analysis, stopword removal, and word encoding are not properly performed, it can be considered that the given central word does not appropriately interact with surrounding words, leading to inadequate model training. Additionally, Skip-gram algorithm demonstrated relatively higher performance compared to CBOW algorithm. For Japan’s IT strategy, it is determined that Skip-gram algorithm can better discern the semantic similarity between words compared to CBOW algorithm. Consequently, this leads to higher performance in Classification Evaluation Metrics. These findings highlight the significance of selecting an appropriate word embedding algorithm based on text type. Summarizing the experimental results, when conducting text analysis on IT policy documents written in Japanese, the linguistic characteristics of the document and the fact that it is composed of Japanese Kanji-Kana mixed sentences are taken into consideration. Therefore, appropriate text preprocessing and embedding algorithms methods should be selected and utilized.

Ⅰ. 서론
Ⅱ. 선행연구
Ⅳ. 실험결과
V. 결론
참고문헌
[자료제공 : 네이버학술정보]
×