딥러닝 기반 특허 검색 성능 개선을 위한 한국어 특허 검색 데이터셋 자동구축 방법론

이동욱; 심우철; 박진우; 이봉건

doi:10.34122/jip.2026.21.1.151

한국지식재산연구원 지식재산연구 딥러닝 기반 특허 검색 성능 개선을 위한 한국어 특허 검색 데이터셋 자동구축 방법론

KCI 등재

딥러닝 기반 특허 검색 성능 개선을 위한 한국어 특허 검색 데이터셋 자동구축 방법론

Constructing Korean Patent Retrieval Datasets to Improve Deep Learning-Based Patent Retrieval Performance: An Automated Methodology

이동욱 ( Dong-uk Lee ) , 심우철 ( Woo-chul Sim ) , 박진우 ( Jin-woo Park ) , 이봉건 ( Bong-gun Lee )

한국지식재산연구원 2026.03

지식재산연구 21권 1호 151-180(30pages)

DOI 10.34122/jip.2026.21.1.151

인용하기 URL 복사 보관함 담기

* 발행 기관의 요청으로 무료로 이용 가능한 자료입니다.

미리보기

초록

최근 딥러닝 기반 특허 검색 연구에서는 대규모 데이터셋 구축의 어려움과 한국어 데이터셋 부족으로 인해 모델 성능 향상에 한계가 존재한다. 본 연구는 이러한 한계를 극복하기 위해, 한국어 특허 문헌을 대상으로 한 대규모 특허 검색 데이터셋을 자동으로 구축하는 방법론을 제안한다. 제안 방법은 의견제출통지서 내 구성대비표 데이터를 활용하여 출원 특허와 인용 선행기술 간 의미적으로 연관된 기술 구성요소 쌍을 자동 추출한다. 또한 기술 구성요소와 가장 유사한 문장을 출원 특허와 인용 선행기술 특허 문헌에서 추출한다. 이를 위해 한국 특허 XML 파싱 기법과 KorPatBERT 기반 CPC 분류 모델을 결합하였으며, 문장 임베딩 기반 의미 유사도와 어휘 유사도를 결합한 하이브리드 유사도 계산 방식을 사용하였다. 본 방법론을 통해 전문가 수작업 데이터셋 대비 약 19배 규모의 대규모 고품질 데이터셋을 구축하였으며, 실제 검색 환경을 모사한 대규모 실험을 통해 품질을 검증하였다. 실험 결과, 제안한 자동 구축데이터셋을 활용하여 학습한 검색 모델은 전문가 구축 데이터셋 대비 Top-70 정확도가 유사하거나 우수한 검색 성능을 달성하였다. 본 연구는 대규모 고품질 한국어 특허 검색 데이터셋을 비용 효율적으로 구축할 수 있는 실용적인 방법을 제시하며, 한국어 특허 검색 모델의 성능 향상 및 실무적 활용 가능성을 동시에 확보했다는 점에서 의의가 있다.

Owing to the difficulty of constructing large-scale datasets and the scarcity of Korean-language resources, recent deep learning-based patent retrieval research gaces limitaions in improving model performance. To address these challenges, this study proposes a methodology for automatically building a large-scale patent retrieval dataset from Korean patent documents. The method automatically extracts semantically related pairs of technical components between patent applications and cited prior art using claim comparison tables in office action notices. In addition, the sentences that are most similar to each technical component are extracted from both the patent application and the cited prior art documents. Korean patent XML parsing techniques are combined with a KorPatBERT-based CPC classification model, and a hybrid similarity measure integrating sentence embedding-based semantic similarity with lexical similarity is employed. Subsequently, a large-scale, high-quality dataset approximately 19 times larger than a manually constructed expert dataset was built and validated through large-scale experiments simulating real-world retrieval environments. Experimental results indicate that models trained on the automatically constructed dataset achieved Top-70 accuracy comparable to or better than those trained on expert-built datasets. Accordingly, this study presents a practical and cost-effective approach for constructing high-quality Korean patent retrieval datasets and demonstrates improved performance and real-world applicability.

키워드

Automatic Dataset Construction

CPC Classification

Patent Retrieval

Semantic Similarity

1. 서론
2. 연구 배경
3. 관련 연구
4. 한국어 특허 검색 데이터셋 자동구축 방법론 제안
5. 실험
6. 평가 및 분석
7. 결론

참고문헌 (0)

[자료제공 : 네이버학술정보]