A Study on Keyword Extraction From a Single Document Using Term Clustering

이 연구에서는 용어 클러스터링을 이용하여 단일문서의 키워드를 추출하는 알고리즘을 제안하고자 한다. 단락단위로 분할한 단일문서를 대상으로 1차 유사도와 2차 분포 유사도를 산출하여 용어 클러스터링을 수행한 결과, 50단어 단락에서 2차 분포 유사도를 적용했을 때 가장 우수한 성능을 나타냈다. 이후, 용어 클러스터링 결과를 이용하여 단일문서의 키워드를 추출하기 위해 단순빈도와 상대빈도의 조합을 통해 다양한 키워드 추출 공식을 도출, 적용한 결과, 단락빈도와 단어빈도×역단락빈도 조건에서 가장 우수한 결과를 나타냈다. 이 결과를 통해, 본 연구에서 제안한 알고리즘은 좋은 키워드가 가져야 할 두 가지 조건인 주제성과 고른 빈도분포라는 측면에서 단일문서를 대상으로 효과적으로 키워드를 추출할 수 있음을 확인하였다.

keywords: 용어 클러스터링, 키워드 추출, 단일문서, 2차 분포 유사도, 텍스트 마이닝, Term Clustering, Keyword Extraction, Single Document, Second-order Similarity, Text Mining, Term Clustering, Keyword Extraction, Single Document, Second-order Similarity, Text Mining

Abstract

In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, (paragraph frequency) and (term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.

keywords: 용어 클러스터링, 키워드 추출, 단일문서, 2차 분포 유사도, 텍스트 마이닝, Term Clustering, Keyword Extraction, Single Document, Second-order Similarity, Text Mining, Term Clustering, Keyword Extraction, Single Document, Second-order Similarity, Text Mining

참고문헌

김수연, 정영미. 2006. 텍스트 마이닝 기법을 이용한 연관용어 선정에 관한 실험적 연구. ꡔ정보관리학회지ꡕ, 23(3): 147-165.

서은경. 1984. 용어의 자동분류에 관한 연구. ꡔ정보관리학회지ꡕ, 1(1): 78-99.

유사라. 1999. ꡔ정보학연구와 분석방법론ꡕ. 서울: 나남출판.

이성직, 김한준. 2009. TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법. ꡔ한국전자거래학회지ꡕ, 14(4): 59-73.

이재윤. 2007. 분포 유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구. ꡔ정보관리학회지ꡕ, 24(4): 267-283.

이주호, 김학수. 2009. 의존관계를 이용한 단일문서의 키워드 추출. ꡔ2009 한국컴퓨터종합학술대회논문집ꡕ, 36(1): 293-296.

정영미. 2005. ꡔ정보검색연구ꡕ. 서울: 구미무역.

정영미. 1993. ꡔ정보검색론ꡕ. 서울: 구미무역.

한승희, 정영미. 2004. 클러스터링 기법을 이용한 개별문서의 지식구조 자동 생성에 관한 연구. ꡔ정보관리학회지ꡕ, 21(3): 251-267.

10.

Al-Khalifa, Hend S., & Hugh C. Davis. 2006. “Folksonomies versus automatic keyword extraction: an empirical study." Proceedings of IADIS Web Applications and Research, 2: 132-143.

11.

Callan, James P. 1994. “Passage-level evidence on document retrieval." Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 302-310.

12.

Dagan, Ido, Lillian Lee, & Fernando Pereira. 1999. “Similarity-based models of cooccurrence probabilities." Machine Learning, 34(1-3): 43-69.

13.

Hulth, A., Jussi Karlgren, Anna Jonsson, Henrik Bostrom, & Lars Asker. 2010. “Automatic Keyword Extraction Using Domain Knowledge." Lecture Notes in Computer Science, 2004/2010: 472-482.

14.

Kullback, Solomon. 1968. Information Theory and Statistics, 2nd ed. New York: Dover Books.

15.

Lee, Lillan. 1999. “Measures of distributional similarity." Proceedings of 37th Annual Meeting of the Association for Computational Linguistics, 25-32.

16.

Leweis, David D., & W. Bruce Croft. 1990. “Term clustering of syntactic phrases." Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 385-404.

17.

Lin, J. 1991. “Divergence measures based on the Shannon entropy." IEEE Transactions on Information Theory, 37(1): 145-151.

18.

Liu, M., Li, W., Wu Mingli, & Qin Lu. 2007. “Extractive summarization based on event term clustering." Proceedings of the ACL 2007, 185-188.

19.

Matzuo, Y., & M. Ishizuka. 2004. “Keyword extraction from a single document using word co-occurrence statistical information." International Journal on artificial Intelligence Tool, 13(1): 157-169.

20.

Pereira, F., Naftali Tishby, & Lillian Lee. 1993. “Distributional clustering of English words." Proceedings of the 31st Annual Meeting of the ACL, 183-190.

21.

Plas, L. van der, V. Pallotta, M. Rajman, & H. Ghorbel. 2004. “Automatic keyword extraction from spoken text." Proceedings of the 4th International Conference on Language Resources and Evaluation 2004, 2205-2208.

22.

Sneath, P. H. A., and R. R. Sokal. 1973. Numerical Taxonomy. SF: Freeman.

23.

Sparck Jones, K. 1971. Automatic Keyword Classification for Information Retrieval. London: Butterworth&Co.

24.

Sparck Jones, K. 1972. “Automatic indexing." Journal of Documentation, 30(4): 393-432.

25.

Strehl, Alexander, Joydeep Ghosh, & Raymond Mooney. 2000. “Impact of similarity measures on web-page clustering." Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search(AAAI 2000), 58-64.

26.

Suzuki, Y., F. Fukumoto, Y. Sekiguchi. 1998. “Keyword extraction of radio news using term weighting with an encyclopedia and newspaper articles." Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 373-374.

27.

Tombros, Anastasios. 2002. The Effects of Query-based Hierarchical Clustering of Documents for Information Retrieval. Ph.D. diss., Cornell University.

28.

Turney, Peter D. 2000. “Learning algorithm for keyphrase extraction." Information Retrieval, 2(4): 303-36.

29.

Weeds, J. E. 2003. Measures and Applications of Lexical Distributional Similarity. Ph. D. diss., University of Sussex.

30.

White, H. D., & B. C. Griffith. 1981. “Author cocitation: a literature measure of intellectual structure." Journal of the American Society for Information Science, 32: 163-171.

31.

Witten, Ian H., Paynter, Gordon W., Frank, Eibe., Gutwin, Carl., & Nevill-Manning, Craig G. 1999. “KEA: practical automatic keyphrase extraction.” Proceedings of the 4th ACM Conference on Digital Library, 254-255.

32.

Zobel, J., A. Moffat, R. Wilkinson, & R. Sacks-Davis. 1995. “Efficient Retrieval of Partial Documents." Information Processing and Management, 31(3): 36-377.

바로가기메뉴

논문 상세

Vol.44 No.3

용어 클러스터링을 이용한 단일문서 키워드 추출에 관한 연구

A Study on Keyword Extraction From a Single Document Using Term Clustering

초록

Abstract

참고문헌

한국문헌정보학회지