


分類: 論文>計算機論文


此文檔由會員 bfxqt 發(fā)布


本文中提出的預處理框架和方法已經應用到了“天網”搜索引擎和網頁自動分類系統中。通過使用預處理后應用系統質量的提高,驗證了該預處理方法的有效性。不難看出,通過這樣一個預處理過程,可以在任何一個網頁集上(包括World Wide Web)搭建一個組織良好的、凈化的、更易使用的信息層。

With the rapid expansion of the Web, the content of the Web become richer and richer. People are increasingly using Web to find their wanted information because of the Web’s convenience and its abundance of information. In order to make better use of Web information, technologies that can automatically re-organize and manipulate web pages are pursued such as Web information retrieval, Web page classification and other Web mining work. However, there are many noises in the Web such as the noise content in the Web page (local noise) and near replica Web pages in the Web (global noise), which decrease the quality of the information on the Web, and consequently descrease the quality of the Web information systems seriously. Also, meta data of the Web pages are widely used in Web information systems, but they are not described explicitly. Some of these problems are never met in the traditional work.
In this thesis, we propose a new preprocessing framework and the corresponding approach to meet the common requirements of several typical web information systems. The framework includes three parts: Web page cleaning, replica removal and meta data extraction. After the preprocessing stage, redundant Web pages are deleted, then, reserved Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content and relevant hyperlinks. Most of them are meta data, while the latter two are content data. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work.
The preprocessing framework and approach have been applied to our search engine [TW] and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web information systems.

Keywords: World Wide Web, Data preprocessing, Data cleaning, Near replica detection, Meta data extraction

目 錄

第1章 引言 1
1.1 研究背景 1
1.2 本文研究內容 2
1.3 本文貢獻 3
1.4 本文組織 3
第2章 相關研究 4
2.1 搜索引擎 4
2.2 網頁自動分類 7
2.3 信息提取 9
2.4 元數據提取 10
第3章 Web信息系統面臨的問題及共性需求 12
第4章 預處理方法與技術 14
4.1 預處理框架及結果描述 14
4.1.1 預處理框架 14
4.1.2 預處理結果描述 14
4.2 網頁表示 15
4.2.1 網頁標簽樹表示 16
4.2.2 網頁量化表示 19
4.3 網頁凈化 24
4.3.1 網頁類型判斷 24
4.3.2 主題網頁凈化 25
4.3.3 目錄網頁凈化 25
4.3.4 圖片網頁凈化 26
4.3.5 網頁凈化時空效率分析 26
4.4 近似網頁的發(fā)現 27
4.4.1 近似網頁發(fā)現算法 27
4.4.2 性能分析 29
4.5 網頁元數據提取 29
4.5.1 網頁元數據提取流程描述 30
4.5.2 正文提取 30
4.5.3 關鍵詞提取 30
4.5.4 內容類別判斷 31
4.5.5 標題提取 32
4.5.6 摘要提取 32
4.5.7 主題相關超鏈提取 33
4.6 本章小結 35
第5章 應用與評測 36
5.1 網頁凈化在網頁自動分類系統中的應用與評測 36
5.1.1 應用 36
5.1.2 評測標準 37
5.1.3 評測結果與分析 37
5.2 近似網頁消除在搜索引擎中的應用與評測 38
5.2.1 實驗設計 38
5.2.2 評測標準 39
5.2.3 評測結果與分析 40
5.3 網頁元數據在搜索引擎的索引過程中的應用與評測 41
5.3.1 檢索效率評測 41
5.3.2 檢索精度評測 42
5.4 本章小結 44
第6章 總結與展望 45
6.1 總結 45
6.2 展望 45
參考資料 47

關鍵詞:萬維網, 數據預處理,數據凈化,近似網頁識別,元數據提取
[ACMP] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 2001
[APE] Allison Woodruff, Paul M. Aoki, Eric Brewer, Paul Gauthier, and Lawrence A. Rowe. An Investigation of Documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, pages 963--979, Paris, France, May 1996.

[Fabrizio] Sebastinai Fabrizio. A tutorial on Automated text categorization.
[FSC] 馮是聰,中文網頁自動分類技術研究及其在搜索引擎中的應用,北京大學,博士學位研究生畢業(yè)論文。
[Google] Google Inc. http://www.google.com .
[HCB] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HCBG] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HD98] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521-538, 1998.
[HITS] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
[HMC] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, May 1997.
[JW] Cowie, Jim and Lehnert, Wendy. Information Extraction. Communications of the ACM, January 1996/Vol. 39, No. 1, pp 80 – 91.
[LD] Lewis D et al. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298-306
[LG98] Steve Lawrence and C.Lee Giles. Searching the World Wide Web. Science, 280(5360): 98~100, Apr. 1998.
[LH02] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD, 2002.
[LS] L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of Computer Science and Technology, 17(1), January 2002.
[Manber94] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.
[PR] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
[Ralph97] Grishman, Ralph. Information Extraction: Techniques and Challenges. Lecture Notes In Artificial Intelligence, Vol. 1299, pp 10 – 27, Springer-Verlag, Berlin Heidelberg, 1997. ISBN 3-540-63438-X
[SB] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
[SCAM] N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
[SM99] N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.