大規(guī)模可擴(kuò)展索引技術(shù)的研究和系統(tǒng)實(shí)現(xiàn).doc

約69頁(yè)DOC格式手機(jī)打開(kāi)展開(kāi)

大規(guī)?？蓴U(kuò)展索引技術(shù)的研究和系統(tǒng)實(shí)現(xiàn), 碩士論文 69頁(yè)共計(jì)28835字摘要隨著互聯(lián)網(wǎng)的發(fā)展，原始的數(shù)據(jù)庫(kù)系統(tǒng)無(wú)法滿足大數(shù)據(jù)量相關(guān)性檢索的需求。從而基于倒排表的索引系統(tǒng)越來(lái)越多的應(yīng)用在各項(xiàng)服務(wù)中。但是索引系統(tǒng)和數(shù)據(jù)庫(kù)系統(tǒng)一樣，有著較為復(fù)雜的內(nèi)部邏輯和外部行為，如何創(chuàng)建我們需要的索引系統(tǒng)，如何優(yōu)化我們的索引系統(tǒng)，是困擾很...
編號(hào):150-34500大小:1.00M
分類: 論文>計(jì)算機(jī)論文

內(nèi)容介紹

此文檔由會(huì)員 bfxqt 發(fā)布

大規(guī)模可擴(kuò)展索引技術(shù)的研究和系統(tǒng)實(shí)現(xiàn) 碩士論文

69頁(yè)共計(jì)28835字
摘要
隨著互聯(lián)網(wǎng)的發(fā)展，原始的數(shù)據(jù)庫(kù)系統(tǒng)無(wú)法滿足大數(shù)據(jù)量相關(guān)性檢索的需求。從而基于倒排表的索引系統(tǒng)越來(lái)越多的應(yīng)用在各項(xiàng)服務(wù)中。但是索引系統(tǒng)和數(shù)據(jù)庫(kù)系統(tǒng)一樣，有著較為復(fù)雜的內(nèi)部邏輯和外部行為，如何創(chuàng)建我們需要的索引系統(tǒng)，如何優(yōu)化我們的索引系統(tǒng)，是困擾很多索引系統(tǒng)構(gòu)建者和使用者的難題。
本文的研究范疇是用于信息檢索的索引系統(tǒng)，通過(guò)一個(gè)真實(shí)的索引系統(tǒng)——Paradise索引系統(tǒng)，本文從三個(gè)方面進(jìn)行分析和研究：對(duì)索引系統(tǒng)進(jìn)行功能模塊上的分析；對(duì)索引系統(tǒng)開(kāi)發(fā)和使用中的性能問(wèn)題的研究和分析；對(duì)一個(gè)實(shí)際系統(tǒng)的系統(tǒng)實(shí)現(xiàn)的詳細(xì)。具體為：
1) 索引系統(tǒng)的模塊分析
本文詳細(xì)分析了作為一個(gè)復(fù)雜系統(tǒng)的索引系統(tǒng)，其創(chuàng)建和使用都受到很多條件的制約。本文分析了索引系統(tǒng)的常見(jiàn)的需求，比如如何對(duì)原始的文檔集合進(jìn)行分析，如何設(shè)計(jì)索引內(nèi)部文檔的表示能力，索引如何創(chuàng)建，如何存儲(chǔ)等，劃分了一系列基本的功能模塊。
2) 索引系統(tǒng)的性能分析
因?yàn)樗饕到y(tǒng)的目的是快速的響應(yīng)檢索需求，所以效率問(wèn)題一直是索引技術(shù)的核心問(wèn)題。在模塊功能分析的基礎(chǔ)之上，本文進(jìn)一步分析了索引創(chuàng)建和檢索中常見(jiàn)的性能問(wèn)題，提出了基本的解決方案。同時(shí)，對(duì)于如何對(duì)索引系統(tǒng)進(jìn)行整體的和局部的量化分析，引入了DQ法則，嘗試給出一個(gè)指導(dǎo)實(shí)踐的經(jīng)驗(yàn)公式。
3) Paradise索引系統(tǒng)的實(shí)現(xiàn)分析
對(duì)于問(wèn)題的分析，需要一個(gè)具體的系統(tǒng)進(jìn)行實(shí)踐。在深入研究天網(wǎng)搜索引擎已有的索引系統(tǒng)和相關(guān)索引系統(tǒng)基礎(chǔ)上，同時(shí)在大量閱讀了相關(guān)專業(yè)文獻(xiàn)之后，我們進(jìn)行了分析和研究，設(shè)計(jì)實(shí)現(xiàn)了863課題支持的Paradise項(xiàng)目的索引系統(tǒng)。本文以系統(tǒng)的基本模塊和重要接口為核心，分析了系統(tǒng)的基本框架能力以及如何進(jìn)一步對(duì)系統(tǒng)進(jìn)行擴(kuò)充。
目錄
摘要 3
Abstract 4
目錄 5
第一章緒論 6
1.1 索引系統(tǒng)背景 6
1.1.1 互聯(lián)網(wǎng)服務(wù)系統(tǒng) 6
1.1.2 Data / Query 法則 7
1.1.3 需求快速查詢的場(chǎng)景 7
1.2 和數(shù)據(jù)庫(kù)系統(tǒng)的比較 8
1.2.1 數(shù)據(jù)庫(kù)系統(tǒng)的能力 8
1.2.2 索引系統(tǒng)和數(shù)據(jù)庫(kù)系統(tǒng)的異同 8
1.3 索引系統(tǒng)的簡(jiǎn)單用例 9
1.3.1 索引系統(tǒng)基本模塊 10
1.3.2 索引系統(tǒng)基本流程 10
1.4 本文的主要工作 11
1.4.1 本文的主要研究點(diǎn) 11
1.4.2 本文的主要?jiǎng)?chuàng)新點(diǎn) 11
1.5 本文組織結(jié)構(gòu) 11
第二章索引系統(tǒng)分析 12
2.1 索引核心模塊分析 12
2.1.1 分析模塊 12
2.1.2 文檔表示模塊 12
2.1.3 存儲(chǔ)模塊 15
2.1.4 索引創(chuàng)建 15
2.1.5 存儲(chǔ)鏡像的邏輯視圖 18
2.1.6 索引檢索 19
2.2 大規(guī)模索引專有模塊 20
2.2.1 壓縮模塊 20
2.2.2 緩存模塊 26
2.3 動(dòng)態(tài)索引 28
2.3.1 版本控制 29
2.3.2 內(nèi)存索引 31
2.4 分布式索引 31
2.4.1 可用性和可靠性與DQ法則 32
2.4.2 索引切分和索引冗余 34
2.4.3 針對(duì)可用性和效率的解決方案 35
2.4 本章小結(jié) 35
第三章索引優(yōu)化 36
3.1 索引創(chuàng)建的優(yōu)化 36
3.1.1 創(chuàng)建時(shí)期內(nèi)存壓縮 36
3.1.2 動(dòng)態(tài)索引二路歸并的塊合并時(shí)機(jī) 37
3.1.3 大索引多路歸并方法 38
3.2 索引檢索效率的優(yōu)化 39
3.2.1 檢索時(shí)效率分析 39
3.2.2 使用塊狀壓縮和跳查來(lái)降低CPU使用 39
3.2.3 使用緩存機(jī)制來(lái)降低IO 43
3.3 索引常數(shù) 44
3.3.1 單機(jī)的服務(wù)能力上限 44
3.3.2 整體服務(wù)優(yōu)化 45
3.3.3 單機(jī)索引常數(shù) 46
3.4 本章小結(jié) 46
第4章 Paradise索引系統(tǒng)框架 48
4.1 系統(tǒng)目的 48
4.1.1 面向搜索引擎服務(wù) 48
4.1.2 可擴(kuò)展性，可實(shí)驗(yàn)性 48
4.2 基本模塊詳細(xì)分析 48
4.2.1 分析模塊 49
4.2.2 文檔表述模塊 50
4.2.3 存儲(chǔ)模塊 51
4.2.4 壓縮模塊 52
4.2.5 緩存模塊 53
4.2.6 索引模塊 54
4.3 索引創(chuàng)建流程用例分析 57
4.4 索引提供的檢索接口用例分析 59
第五章總結(jié)和展望 63
5.1 本文總結(jié) 63
5.2 進(jìn)一步工作 63
參考文獻(xiàn) 65
致謝 68

關(guān)鍵詞：信息檢索，索引系統(tǒng)，索引優(yōu)化，倒排表
參考文獻(xiàn)
[1] 彭波. (2004). 搜索引擎檢索系統(tǒng)的效率優(yōu)化與效果評(píng)估研究. 北京大學(xué).
[2] Zobel, J. and Moffat, A, Inverted files for text search engines, ACM Computing Surveys (CSUR), 2006, ACM New York, NY, USA
[3] Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes: Compressing and Indexing Documents and Images, second edn, Morgan Kaufmann, San Francisco, California.
[4] Zhu, M. and Shi, S. and Li, M. and Wen, J.R., Effective top-k computation in retrieving structured documents with term-proximity support, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, ACM New York, NY, USA
[5] Strohman, T., DYNAMIC COLLECTIONS IN INDRI, Technical report, Center for Intelligent Information Retrieval, 2005
[6] Strohman, T. and Croft, W.B. Low Latency Index Maintenance in Indri, OSIR 2006
[7] Frakes, W.B. and Baeza-Yates, R., Information retrieval: data structures and algorithms, 1992, Prentice-Hall, Inc. Upper Saddle River, NJ, USA
[8] Brown, E.W. and Callan, J.P. and Croft, W.B. Fast incremental indexing for full-text information retrieval, Proceedings of the 20th International Conference on Very Large Data Bases, 1994
[9] Scholer, F. and Williams, H.E. and Yiannis, J. and Zobel, J., Compression of inverted indexes For fast query evaluation, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, ACM Press New York, NY, USA
[10] Zobel, J. and Moffat, A. Adding Compression to a Full-text Retrieval System, Software - Practice and Experience, 1995
[11] Williams, H.E. and Zobel, J., Compressing Integers for Fast File Access, 1999, Br Computer Soc
[12] Scholer, F. and Williams, H.E. and Yiannis, J. and Zobel, J., Compression of inverted indexes For fast query evaluation, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, ACM Press New York, NY, USA
[13] Zukowski, M. and Heman, S. and Nes, N. and Boncz, P., Super-Scalar RAM-CPU Cache Compression, Proceedings of the International Conference of Data Engineering (IEEE ICDE), Atlanta, GA, USA, 2006
[14] Baeza-Yates, R. and Gionis, A. and Junqueira, F. and Murdock, V. and Plachouras, V. and Silvestri, F., The impact of caching on search engines, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, ACM Press New York, NY, USA
[15] Long, X. and Suel, T., Three-Level Caching for Efficient Query Processing in Large Web Search Engines, World Wide Web, 2006, Springer
[16] Lempel, R. and Moran, S., Predictive caching and prefetching of query results in search engines, Proceedings of the 12th international conference on World Wide Web, 2003, ACM Press New York, NY, USA
[17] Saraiva, P.C. and de Moura, E.S. and Ziviani, N. and Meira, W. and Fonseca, R. and Riberio-Neto, B., Rank-preserving two-level caching for scalable search engines, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001, ACM Press New York, NY, USA
[18] Eric W. Brown, Fast evaluation of structured queries for information retrieval,SIGIR’95:Proceedings of the 18th annual international ACM SIGIRconference onResearch and development in information retrieval (NewYork, NY, USA),ACM Press, 1995, pp. 30–38.A
[19] Nicholas Lester, Justin Zobel, and Hugh E. Williams, In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems, Proceedings of the 27th conference on Australasian computer science, Australian Computer Society, Inc., 2004, pp. 15–23.A
[20] N. Lester, A. Moffat, and J. Zobel, "Fast on-line index construction by geometric partitioning",Proceedings of the ACM CIKM Conference on Information and Knowledge Management, A. Chowdhury, N. Fuhr, M. Ronthaler, H.-J. Schek, and W. Teiken (eds), Bremen, Germany, November 2005. pp. 776-783.
[21] Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. Hybrid Index Maintenance for Growing Text Collections. Proceedings of the 29th ACM Conference on Research and Development on Information Retrieval (SIGIR 2006). Seattle, USA, August 2006.
[22] Zobel, J., Moffat, A. & Ramamohanarao, K. (1998), “Inverted files versus signature files for text indexing”, ACM Transactions on Database Systems 23(4), 453–490.
[[29] Carmel, D. and Cohen, D. and Fagin, R. and Farchi, E. and Herscovici, M. and Maarek, Y.S. and Soffer, A., Static index pruning for information retrieval systems, 2001, ACM Press New York, NY, USA
[30] Carmel, D. and Amitay, E. and Herscovici, M. and Maarek, Y. and Petruschka, Y. and Soffer, A., Juru at TREC 10-Experiments with Index Pruning, Proceedings of NIST TREC, 2001
[31] S Büttcher, CLA Clarke, A document-centric approach to static index pruning in text retrieval systems, Proceedings of the 15th ACM international conference on Information and knowledge management, 2006, ACM Press New York, NY, USA
[32] Performance of Compressed Inverted List Caching in Search Engines, 2008, World Wide Web.
[33] Brewer, E.A., Lessons from Giant-Scale Services, IEEE INTERNET COMPUTING, 2001
[34] Guo, R. and Cheng, X. and Xu, H. and Wang, B., Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007

国产精品婷婷久久久久久,国产精品美女久久久浪潮av,草草国产,人妻精品久久无码专区精东影业

大規(guī)模可擴(kuò)展索引技術(shù)的研究和系統(tǒng)實(shí)現(xiàn).doc

內(nèi)容介紹

TA們正在看...

相關(guān)文檔

官方微信

支付寶紅包