RandomForest와 XGBoost를 활용한 한국어 텍스트 분류: 서울특별시 응답소 민원 데이터를 중심으로

하지은; 신현철; 이준기

(사)한국빅데이터학회 한국빅데이터학회지 RandomForest와 XGBoost를 활용한 한국어 텍스트 분류: 서울특별시 응답소 민원 데이터를 중심으로

RandomForest와 XGBoost를 활용한 한국어 텍스트 분류: 서울특별시 응답소 민원 데이터를 중심으로

Korean Text Classification Using Randomforest and XGBoost Focusing on Seoul Metropolitan Civil Complaint Data

하지은 ( Ji-eun Ha ) , 신현철 ( Hyun-chul Shin ) , 이준기 ( Zoon-ky Lee )

(사)한국빅데이터학회 2017.12

한국빅데이터학회지 2권 2호 95-104(10pages)

UCI I410-ECN-0102-2019-500-001350405

인용하기 URL 복사 보관함 담기

미리보기

초록

2014년 서울시는 시민의 목소리에 신속한 응대를 목표로 ‘서울특별시 응답소’ 서비스를 시작하였다. 접수된 민원은 내용을 바탕으로 카테고리 확인 및 담당부서로 분류 되는데, 이 부분을 자동화시킬 수 있다면 시간 및 인력 비용이 감소될 것이다. 본 연구는 2010년 6월 1일부터 2017년 5월 31일까지 7년치 민원 사례 17,700건의 데이터를 수집하여, 최근 화두가 되고 있는 XGBoost 모델을 기존 RandomForest 모델과 비교하여 한국어 텍스트 분류의 적합성을 확인하였다. 그 결과 RandomForest에 대비 XGBoost의 정확도가 전반적으로 높게 나타났다. 동일한 표본을 활용하여 업 샘플링과 다운 샘플링 시행 후에는 RandomForest의 정확도가 불안정하게 나타난 반면, XGBoost는 전반적으로 안정적인 정확도를 보였다.

In 2014, Seoul Metropolitan Government launched a response service aimed at responding promptly to civil complaints. The complaints received are categorized based on their content and sent to the department in charge. If this part can be automated, the time and labor costs will be reduced. In this study, we collected 17,700 cases of complaints for 7 years from June 1, 2010 to May 31, 2017. We compared the XGBoost with RandomForest and confirmed the suitability of Korean text classification. As a result, the accuracy of XGBoost compared to RandomForest is generally high. The accuracy of RandomForest was unstable after upsampling and downsampling using the same sample, while XGBoost showed stable overall accuracy.

키워드

Civil Complaint Classification

Ⅰ. 서 론
Ⅱ. 이론적 배경
Ⅲ. 연구 방법
Ⅳ. 데이터 분석 및 결과
Ⅴ. 결론 및 시사점
참 고 문 헌

참고문헌 (0)

[자료제공 : 네이버학술정보]