이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교

박근우; 정인경

한국통계학회 응용통계연구 이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교

KCI 등재

Comparison of resampling methods for dealing with imbalanced data in binary classification problem

박근우 ( Geun U Park ) , 정인경 ( Inkyung Jung )

한국통계학회 2019.06

응용통계연구 32권 3호 349-374(26pages)

UCI I410-ECN-0102-2019-300-001407558

인용하기 URL 복사 보관함 담기

* 발행 기관의 요청으로 이용이 불가한 자료입니다.

초록

이분형 자료의 분류에서 자료의 불균형 정도가 심한 경우 분류 결과가 좋지 않을 수 있다.이런 문제 해결을 위해 학습 자료를 변형시키는 등의 연구가 활발히 진행되고 있다. 본 연구에서는 이러한 이분형 자료의 분류문제에서 불균형을 다루기 위한 방법들 중 표본재추출 방법들을 비교하였다. 이를 통해 자료에서 희소계급의 탐지를 보다 효과적으로 하는 방법을 찾고자 하였다. 모의실험을 통하여 여러 오버샘플링, 언더샘플링, 오버샘플링과 언더샘플링 혼합방법의 총 20가지를 비교하였다. 분류문제에서 대표적으로 쓰이는 로지스틱 회귀분석, support vector machine, 랜덤포레스트 모형을 분류기로 사용하였다. 모의실험 결과, 정확도가 0.5 이상이면서 민감도가 높았던 표본재추출방법은 random under sampling (RUS)였다. 그 다음으로 민감도가 높았던 방법은 오버샘플링 ADASYN (adaptive synthetic sampling approach)이었다. 이를 통해 RUS 방법이 희소계급값을 찾기 위한 방안으로는 적합했다는 것을 알 수 있었다. 몇 가지 실제 자료에 적용한 결과도 모의실험의 결과와 비슷한 양상을 보였다.

A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

키워드

imbalanced-learn

imbalanced binary data

1. 서론
2. 표본재추출 방법
3. 모의실험
4. 실제 자료 분석
5. 결론 및 고찰
References

참고문헌 (0)

[자료제공 : 네이버학술정보]