首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于特征词群的新闻类重复网页和近似网页识别算法
引用本文:程芃森,安俊秀.基于特征词群的新闻类重复网页和近似网页识别算法[J].成都信息工程学院学报,2012,27(4):374-379.
作者姓名:程芃森  安俊秀
作者单位:1. 成都信息工程学院计算机学院,四川成都,610225
2. 成都信息工程学院软件工程学院,四川成都,610225
基金项目:四川省科技厅软科学计划资助项目(2011ZR0058);成都信息工程学院自然科学与技术发展基金项目(CSRF201002)对本文的资助
摘    要:新闻类网页是互联网上冗余信息的重灾区。冗余网页不仅会加剧搜索引擎的处理负担,并且会降低用户体验,因此有必要对互联网上的冗余新闻网页实施消重处理。该算法依据新闻报道的自然语法特点将一篇新闻报道分解到词,从7类词性类别中提取该类别最高词频的词组成新闻报道的特征词群;通过词级倒排索引的建立,完成不同网页间特征词群的检索和对比;通过类型倒排索引的建立,完成重复和近似网页的识别和分类管理。本算法在实施过程借助于搜索引擎系统原有模块,避免新模块的引入保持了系统的简洁性;实验表明该算法是有效的,在测试的网页中召回率达93.5%,准确率达88.4%。冗余网页小粒度分类识别上具有的缺陷,在很大程度上影响了准确率的提高。

关 键 词:计算机应用  网页消重  词性分类  特征词群

The Duplicate News Web Page Detection Algorithm Based on Feature Words Group
CHENG Peng-sen , AN Jun-xiu.The Duplicate News Web Page Detection Algorithm Based on Feature Words Group[J].Journal of Chengdu University of Information Technology,2012,27(4):374-379.
Authors:CHENG Peng-sen  AN Jun-xiu
Institution:1.School of Computer Science,Chengdu University of Information Technology,Chengdu 610225,China;2.School of Software Engineering,Chengdu University of Information Technology,Chengdu 610225,China)
Abstract:News pages are always nightmares of the redundant messages on the internet.On one hand,redundant messages could increase searching burden of search engine.On the other hand,they would lower user’s experience.So it is necessary to deal with these news pages.The algorithm will decompose a news report into words according to grammar.It will constitute feature words group by picking up the highest frequency words from 7 categories of part-of-speech.It finishes retrieving feature words group and comparing them between different web pages by building word-level inverted index.It finishes detecting and managing duplicate or near-duplicate web pages by building class inverted index.This algorithm utilizes the original module of the search engine in the implementation process, and it keeps simplicity of the system avoiding the introduction of the new module.The algorithm is proven efficient in our experiment testing: the recall rate of web pages reaches 93.5%,and the precision rate reaches 88.4%.The redundant pages that have defects in their classification and identification largely influence the improvement of accuracy rate.
Keywords:computer application  elimination of duplicated web pages  part-of-speech classification  feature words group
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号