[특집. 물결 21 코퍼스: 공유와 확산] : 추세의 유사도를 이용한 중심어와 관련어의 상관성 연구: 통계적 방법과 활용

홍정하

고려대학교 민족문화연구원 민족문화연구 [특집. 물결 21 코퍼스: 공유와 확산] : 추세의 유사도를 이용한 중심어와 관련어의 상관성 연구: 통계적 방법과 활용

KCI 등재

[특집. 물결 21 코퍼스: 공유와 확산] : 추세의 유사도를 이용한 중심어와 관련어의 상관성 연구: 통계적 방법과 활용

A Study of Correlations Between/Among Nodes and Related Words Using Trend Similarities: Statistical Methods and their Applications

홍정하 ( Jung Ha Hong )

고려대학교 민족문화연구원 2014.08

민족문화연구 64권 25-58(34pages)

UCI I410-ECN-0102-2015-900-000221541

인용하기 URL 복사 보관함 담기

미리보기

초록

[물결 21]은 2000년부터 2012년까지 한국의 4대 신문사에서 발간된 뉴스기사 텍스트로 구성된 형태분석 코퍼스이다. 최근 들어 [물결 21]에서 추세를 나타내는 키워드를 탐색하고자 하는 여러 연구가 진행되어 왔다. 비록 이러한 목적을 수행하는 과정에서 부분적으로 추세 유사도에 주목하는 일부 연구가 있었으나, 그 유사도를 체계적으로 분석할 수 있는 통계적 논의는 찾아보기 힘들다. 이 논문은 중심어와 관련어 사이의 추세 유사도를 효과적으로 포착할 수 있는 통계적 방법론을 소개하고, 그 활용가능성을 논의하는 것이 목적이다. 이를 위해 [물결 21] 코퍼스의 동일 문서 내에서 중심어 ‘정치’ 또는 ‘경제’에 대해 공기 경향이 높은 상위 각 50개 관련어를 분석한다. 중심어와 관련어 사이의 통계적 거리는 t-점수와 상대빈도의 통시적 변화에 대한 중심화 Pearson 거리(centered Pearson distance)로 산출되며, 이들의 상관성은 최소신장수형도( mi n i m u m s p a n n i n g t r e e ) 와 Reingold-Tilford 수형도를 통해 통계적으로 분류된다. 이 논문에서는 (i) 추세 유사도에 대한 통계적 접근을 통해 중심어와 관련어 사이의 자세하고 체계적인 상관성 관찰을 위한 근거를 제시하며, (ii) 전반적인 상관성을 적절하게 나타낸다는 측면에서 t-점수보다 상대빈도에 기반한 추세 유사도 측정이 보다 효과적인 방법임을 논의하며, (iii) 관련 연구에서 추세 유세도 탐색에 관심을 가져야 하는 논거를 제시한다.

The Trends 21 Corpus is a morphologically annotated collection of Korean newspaper texts, which were issued by four major Korean newspapers from the year 2000 to 2012. Recently there have been many studies that explore key-words reflecting trends in the Trends 21 Corpus. Some of the research has partly paid attention to trend similarities in the pursuit of this purpose, yet there has been few statistical approaches to how they can be systematically detected. This paper aims to introduce statistical methods for effectively exploring trend similarities between/among nodes and related words and to show their applicability. To do so, we analyze trend similarities among the most highly-ranked 50 related words for each of the nodes ‘politics’ and ‘economy``, which have co-occurred with nodes within the same documents of the Trends 21 corpora. Statistical distances between/among nodes and related words are estimated by centered Person distances for the diachronic changes of their t-scores or relative frequency data, and their correlations are statistically classified by the minimum spanning tree and Reingold-Tilford tree. This paper shows that (i) statistical approaches to trend similarities provide evidence for investigating detailed and systematic correlations between/among nodes and related words, (ii) that calculating trend similarities based on relative frequency data, not t-scores, is a more effective approach in that their overall correlations are more resonable, and (iii) why we should be interested in exploring trend similarities.

키워드

중심어