基于半監(jiān)督的文本分類算法.doc

約51頁(yè)DOC格式手機(jī)打開(kāi)展開(kāi)

基于半監(jiān)督的文本分類算法, 摘要隨著internet的出現(xiàn)，大量的文字信息開(kāi)始以計(jì)算機(jī)可讀的形式存在，以傳統(tǒng)的手工方式對(duì)這些信息進(jìn)行組織整理既費(fèi)時(shí)費(fèi)力且效果不理想。文本分類作為處理和組織大量文本數(shù)據(jù)的關(guān)鍵技術(shù)，可以利用機(jī)器來(lái)對(duì)文本進(jìn)行分析整理，使用戶從繁瑣的文檔處理工作中解放出來(lái)，并能極大地提高了信息的利用率。文本分類...
編號(hào):30-194732大小:1.24M
分類: 論文>通信/電子論文

內(nèi)容介紹

此文檔由會(huì)員 lanxin520 發(fā)布

基于半監(jiān)督的文本分類算法

摘要

隨著Internet的出現(xiàn)，大量的文字信息開(kāi)始以計(jì)算機(jī)可讀的形式存在，以傳統(tǒng)的手工方式對(duì)這些信息進(jìn)行組織整理既費(fèi)時(shí)費(fèi)力且效果不理想。文本分類作為處理和組織大量文本數(shù)據(jù)的關(guān)鍵技術(shù)，可以利用機(jī)器來(lái)對(duì)文本進(jìn)行分析整理，使用戶從繁瑣的文檔處理工作中解放出來(lái)，并能極大地提高了信息的利用率。文本分類是指分析文本內(nèi)容并按一定的策略把文本歸入一個(gè)或多個(gè)合適的類別的應(yīng)用技術(shù)。而作為信息過(guò)濾、信息檢索、搜索引擎、文本數(shù)據(jù)庫(kù)、數(shù)字化圖書館等領(lǐng)域的技術(shù)基礎(chǔ)，文本分類技術(shù)有著廣泛的應(yīng)用前景。
本文首先介紹了文本分類的背景，文本分類所用的半監(jiān)督算法及文本分類的幾個(gè)關(guān)鍵技術(shù)。然后鑒于高分類精度需要大規(guī)模己標(biāo)記訓(xùn)練集而已標(biāo)記文檔缺乏，利用未標(biāo)識(shí)文檔進(jìn)行學(xué)習(xí)的半監(jiān)督學(xué)習(xí)算法己成為文本分類的研究重點(diǎn)這一情況，著重研究了半監(jiān)督分類算法。最后本文設(shè)計(jì)了一個(gè)文本分類原型系統(tǒng)，為保證分類的準(zhǔn)確性，采用了不同的標(biāo)準(zhǔn)數(shù)據(jù)集進(jìn)行測(cè)試，并評(píng)價(jià)了其分類的性能。通過(guò)以上實(shí)驗(yàn)表明，當(dāng)有足夠的己標(biāo)識(shí)文檔時(shí)，本算法與其它算法性能相當(dāng)，但當(dāng)已標(biāo)識(shí)文檔很少時(shí)，本算法優(yōu)于現(xiàn)有的其它算法。
關(guān)鍵詞:文本分類；半監(jiān)督學(xué)習(xí)；聚類；EM；KNN

ABSTRACT

With the emergence of Internet, a large number of text messages began to exist in the form of computer-readable, to the traditional manual way for organizations to collate the information is time-consuming effort and the result is not satisfactory. As the key technology in organizing and processing large mount of document data, Text classification can use the machine to collate the text analysis, allowing users from the tedious work of document processing liberated and can greatly improve the utilization of information. Text classification is a supervised leaning task of assigning natural language text documents to one or more predefined categories or classes according to their contents. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieva l, search engine, text database, and digital library and so on..
This thesis firstly introduces the background of the text classification, text classification using semi-supervised algorithm and a few key technologies about text classification. Secondly considering the contradiction of deadly need for large labeled train-set to obtain high classification accuracy and the scarcity of labeled documents，this thesis emphasizes on improvement of Semi-supervised classification algorithms， Finally we design a document classification system. In order to ensure the accuracy of classification, using a data set different standards for texting and eva luation of the performance of their classification. The experiments above showed the superior performance of our method over existing methods when labeled data size is extremely small. When there is sufficient labeled data，our method is comparable to other existing algorithms.
Keywords: text classification; semi-supervised leaning; clustering; EM; KNN

1 引言 1
1.1課題背景 1
1.2本文的內(nèi)容組織 2
2 半監(jiān)督學(xué)習(xí) 3
2.1半監(jiān)督學(xué)習(xí)的概念及意義 3
2.2半監(jiān)督學(xué)習(xí)的研究進(jìn)展 4
2.3半監(jiān)督學(xué)習(xí)的方法 5
2.3.1協(xié)同訓(xùn)練(Co-training) 5
2.3.2自訓(xùn)練 6
2.3.3半監(jiān)督支持向量機(jī)（S3VMs） 7
2.3.4基于圖的方法（Graph-Based Methods） 8
2.4本章小結(jié) 9
3 文本分類 10
3.1文本分類的概念及意義 10
3.2文本分類的國(guó)內(nèi)外研究情況 10
3.3文本分類的關(guān)鍵技術(shù) 11
3.3.1文本特征生成 12
3.3.2特征選擇與降維 14
3.3.3權(quán)重計(jì)算 16
3.3.4文本分類技術(shù) 17
3.3.5文本分類技術(shù)性能評(píng)價(jià) 22
3.4本章小結(jié) 25
4 基于EM和KNN的半監(jiān)督文本分類 27
4.1引言 27
4.2相關(guān)工作 27
4.2.1聚類分析 27
4.2.2 EM算法 30
4.2.3 KNN算法 31
4.3基于EM和KNN的半監(jiān)督文本分類算法 31
4.3.1問(wèn)題描述 32
4.3.2算法思想 32
4.3.3基于EM算法的聚類分析 33
4.3.4基于Knn算法的分類 35
4.3.5算法步驟 36
4.4算法效率分析 37
4.5本章小結(jié) 38
5 實(shí)驗(yàn)與分析 39
5.1實(shí)現(xiàn)EM-KNN算法 39
5.1.1實(shí)驗(yàn)平臺(tái) 39
5.1.2算法實(shí)現(xiàn)及流程圖 39
5.2實(shí)驗(yàn)結(jié)果與分析 43
5.3小結(jié) 43
總結(jié) 44
參考文獻(xiàn) 45
致謝 46

国产精品婷婷久久久久久,国产精品美女久久久浪潮av,草草国产,人妻精品久久无码专区精东影业

基于半監(jiān)督的文本分類算法.doc

內(nèi)容介紹

TA們正在看...

相關(guān)文檔

官方微信

支付寶紅包