말뭉치를 이용한 한국어 단어 개수 측정

김성기; 한근식

한국정보처리학회 정보처리학회논문지 말뭉치를 이용한 한국어 단어 개수 측정

Estimating the Number of Korean Words Based on Corpus

김성기(Kim Sung Ki),한근식(Han Geun shik)

한국정보처리학회 1998.01

정보처리학회논문지 5권 7호 1774-1782(9pages)

UCI I410-ECN-0102-2009-000-007507982

인용하기 URL 복사 보관함 담기

미리보기

초록

한 언어에서 사용되는 단어의 총 개수를 추정하는 것은 매우 어려운 작업이다. 최근 한 언어를 대표하는 것으로 생각되는 원문, 발화, 또는 기타 표본들의 뭉치인 말뭉치가 대규모로 구축됨으로 말뭉치를 기반으로 하여 한 언어의 총 단어 개수를 추정할 수 있게 되었다. 본 논문에서는 한국어 말뭉치에 나타난 단어를 기반으로 한국어 단어의 총 개수를 추정하는 방법을 제시하고 한국어 단어의 총 개수를 추정한다. 이와 더불어 한국어에서 가장 많은 수의고유명사를 차지하는 한국사람 이름의 총 개수도 함께 추정한다. 단어 개수와 이름 개수의 추정방법은 빈도를 이용한 일반화된 선형모형을 적용하였다. 1000만 어절의 말뭉치를 이용하여 한국어의 총 단어를 추정한 결과 1,062,392개로 추정되었으며 한국사람 이름의 개수는 1,493,003개로 추정되었다.

It is very hard to estimate the number of total words in a language. Recently large corpus which is the body of written, spoken or other material and which is thought as the representative of a language is under construction. So, it is possible to estimate the number of words in a language based on the corpus. In this paper we propose the method for estimating the number of Korean words using Korean corpus and estimate the number of words. We also estimate the number of Korean names which occupy the large part of proper nouns. To estimate the number of total different Korean words and names we applied a generalized linear estimation method. 1,062,392 is the number of estimated Korean words using the corpus of 10 million phrases and 1,493,003 is the estimated number of Korean names.

키워드

단어

한국어

말뭉치

참고문헌 (0)

[자료제공 : 네이버학술정보]