国产精品婷婷久久久久久,国产精品美女久久久浪潮av,草草国产,人妻精品久久无码专区精东影业

基于字符編輯的字符串匹配算法的實(shí)現(xiàn).doc

約25頁(yè)DOC格式手機(jī)打開(kāi)展開(kāi)

基于字符編輯的字符串匹配算法的實(shí)現(xiàn),本文25頁(yè)共計(jì)近2萬(wàn)余字。摘要隨著信息技術(shù)的迅猛發(fā)展,各種數(shù)據(jù)生成以及數(shù)據(jù)采集設(shè)備的廣泛使用,人們獲取到的數(shù)據(jù)量指數(shù)級(jí)增長(zhǎng),但是人們從海量數(shù)據(jù)中獲取信息的方便性并沒(méi)有得到有效的改善,究其原因,其一就是數(shù)據(jù)質(zhì)量大大下降,不足以滿(mǎn)足應(yīng)用的需求。本文介紹了對(duì)數(shù)據(jù)質(zhì)量研究的必要性以及目前數(shù)據(jù)質(zhì)...
編號(hào):36-71709大小:643.10K
分類(lèi): 論文>通信/電子論文

內(nèi)容介紹

此文檔由會(huì)員 bshhty 發(fā)布

本文25頁(yè)共計(jì)近2萬(wàn)余字。


基于字符編輯的字符串匹配算法的實(shí)現(xiàn)


摘 要
隨著信息技術(shù)的迅猛發(fā)展,各種數(shù)據(jù)生成以及數(shù)據(jù)采集設(shè)備的廣泛使用,人們獲取到的數(shù)據(jù)量指數(shù)級(jí)增長(zhǎng),但是人們從海量數(shù)據(jù)中獲取信息的方便性并沒(méi)有得到有效的改善,究其原因,其一就是數(shù)據(jù)質(zhì)量大大下降,不足以滿(mǎn)足應(yīng)用的需求。
本文介紹了對(duì)數(shù)據(jù)質(zhì)量研究的必要性以及目前數(shù)據(jù)質(zhì)量研究的熱點(diǎn),并著重介紹通過(guò)記錄連接來(lái)改善數(shù)據(jù)質(zhì)量問(wèn)題。通過(guò)匹配技術(shù)中的編輯距離算法、Jaro-Winkler算法達(dá)到進(jìn)行記錄連接的目的,并對(duì)算法的原理及其實(shí)現(xiàn)作了闡述,通過(guò)計(jì)算兩個(gè)記錄的相似度來(lái)解決基于字符編輯的字符串匹配問(wèn)題,實(shí)現(xiàn)對(duì)重復(fù)相似記錄的檢測(cè)以達(dá)到數(shù)據(jù)連接的目的,最后對(duì)匹配技術(shù)對(duì)數(shù)據(jù)質(zhì)量研究的展望。

關(guān)鍵詞:數(shù)據(jù)質(zhì)量; 記錄連接; 匹配; 編輯距離; Levenshtein算法; Jaro-Winkler算法
String Matching Algorithm and its RealizationBased on Character Editor
ABSTRACT
With the rapid development of information technology and various data generation and data acquisition equipment widely used ,the amount of data which people get is increasing by exponential,however, the huge amounts of data which people get in the convenience of access to information has not been effective improvement, one of reseaons is that data quality significantly decreased and insufficient to meet the application requirements.
This paper introduces the necessarity of researching data quality and describes the current hot topic of data quality ,then puts an emphasis on introducing through the records to improve data quality problems. Through the matching technology in the edit distance, Jaro-Winkler algorithm to achieve the purpose of record linkage,then describe the Principles and implementation of the algorithm .Through Introduces the useage of the edit distance algorithm, Jaro-Winkler algorithm of matching technology and how to realize them ,through calculating the similarity of two records to solve the character-based string matching editor to achieve detection of duplicate records ,finally looks forward to the research on matching technology for data quality.

Keywords:Data Quality; Record Linkage; Matching; Edit distance; Levenshtein Algorithm; Jaro-Winkler Algorithm
目 錄
摘 要 i
ABSTRACT ii
第一章 緒論 - 1 -
第二章 編輯距離 (Edit distance) - 3 -
2.1 Levenshtein算法思想 - 3 -
2.2 Levenshtein算法原理 - 3 -
2.3 算法的實(shí)現(xiàn) - 4 -
2.3.1 Levenshtein算法 - 4 -
2.3.2 Levenshtein算法實(shí)現(xiàn) - 5 -
2.4 正確性說(shuō)明 - 6 -
2.5 Levenshtein算法補(bǔ)充說(shuō)明 - 6 -
第三章 Jaro-Winkler距離(Jaro-Winkler Distance) - 7 -
3.1 Jaro算法 - 7 -
3.1.1 Jaro算法原理 - 7 -
3.1.2 Jaro算法實(shí)現(xiàn) - 7 -
3.2 Jaro-winkler算法 - 10 -
3.2.1 Jaro-winkler原理 - 10 -
3.2.2 Jaro-winkler實(shí)現(xiàn) - 10 -
3.2.3 算法相關(guān)補(bǔ)充說(shuō)明 - 12 -
結(jié)束語(yǔ) - 13 -
致謝 - 14 -
參考文獻(xiàn) - 15 -
附錄 - 16 -
參考文獻(xiàn)
[1] 周曉方. 數(shù)據(jù)質(zhì)量. 澳大利亞昆士蘭大學(xué),2009.
[2] 郭志懋, 周傲英. 數(shù)據(jù)質(zhì)量和數(shù)據(jù)清洗研究綜述. 計(jì)算機(jī)科學(xué)與工程系, 智能信息處理開(kāi)放實(shí)驗(yàn)室,2002
[3] 徐潔磐,張剡,封玲. 現(xiàn)代數(shù)據(jù)庫(kù)系統(tǒng)實(shí)用教程. 北京:人民郵電出版社,2006
[8] 繆嘉嘉. 異構(gòu)數(shù)據(jù)映射技術(shù)的研究. 國(guó)防科學(xué)技術(shù)大學(xué)研究院,2008
[9] 孫海霞,成穎. 信息集成中的字符串匹配技術(shù)研究. 南京大學(xué)信息管理系,2007