當(dāng)前位置: 首頁(yè)編程開發(fā)Java → Java實(shí)現(xiàn)影視劇搜索中中文文本自動(dòng)糾錯(cuò)

Java實(shí)現(xiàn)影視劇搜索中中文文本自動(dòng)糾錯(cuò)

更多

1.背景

  這周由于項(xiàng)目需要對(duì)搜索框中輸入的錯(cuò)誤影片名進(jìn)行校正處理,以提升搜索命中率和用戶體驗(yàn),研究了一下中文文本自動(dòng)糾錯(cuò)(專業(yè)點(diǎn)講是校對(duì),proofread),并初步實(shí)現(xiàn)了該功能,特此記錄。

2.簡(jiǎn)介

  中文輸入錯(cuò)誤的校對(duì)與更正是指在輸入不常見或者錯(cuò)誤文字時(shí)系統(tǒng)提示文字有誤,最簡(jiǎn)單的例子就是在word里打字時(shí)會(huì)有紅色下劃線提示。實(shí)現(xiàn)該功能目前主要有兩大思路:

(1)  基于大量字典的分詞法:主要是將待分析的漢字串與一個(gè)很大的“機(jī)器詞典”中的詞條進(jìn)行匹配,若在詞典中找到則匹配成功;該方法易于實(shí)現(xiàn),比較適用于輸入的漢字串

      屬于某個(gè)或某幾個(gè)領(lǐng)域的名詞或名稱;

(2)  基于統(tǒng)計(jì)信息的分詞法:常用的是N-Gram語(yǔ)言模型,其實(shí)就是N-1階Markov(馬爾科夫)模型;在此簡(jiǎn)介一下該模型:

上式是Byes公式,表明字符串X1X2……Xm出現(xiàn)的概率是每個(gè)字單獨(dú)出現(xiàn)的條件概率之積,為了簡(jiǎn)化計(jì)算假設(shè)字Xi的出現(xiàn)僅與前面緊挨著的N-1個(gè)字符有關(guān),則上面的公式變?yōu)椋?/p>

這就是N-1階Markov(馬爾科夫)模型,計(jì)算出概率后與一個(gè)閾值對(duì)比,若小于該閾值則提示該字符串拼寫有誤。

3.實(shí)現(xiàn)

由于本人項(xiàng)目針對(duì)的輸入漢字串基本上是影視劇名稱以及綜藝動(dòng)漫節(jié)目的名字,語(yǔ)料庫(kù)的范圍相對(duì)穩(wěn)定些,所以這里采用2-Gram即二元語(yǔ)言模型與字典分詞相結(jié)合的方法;

先說(shuō)下思路:

對(duì)語(yǔ)料庫(kù)進(jìn)行分詞處理 —> 計(jì)算二元詞條出現(xiàn)概率(在語(yǔ)料庫(kù)的樣本下,用詞條出現(xiàn)的頻率代替) —> 對(duì)待分析的漢字串分詞并找出最大連續(xù)字符串和第二大連續(xù)字符串 —>

利用最大和第二大連續(xù)字符串與語(yǔ)料庫(kù)的影片名稱匹配 —> 部分匹配則現(xiàn)實(shí)拼寫有誤并返回更正的字符串(所以字典很重要)

備注:分詞這里用ICTCLAS Java API

 上代碼:

創(chuàng)建類ChineseWordProofread

3.1 初始化分詞包并對(duì)影片語(yǔ)料庫(kù)進(jìn)行分詞處理

 1  public ICTCLAS2011 initWordSegmentation(){
 2         
 3         ICTCLAS2011 wordSeg = new ICTCLAS2011();
 4         try{
 5             String argu = "F:\\Java\\workspace\\wordProofread"; //set your project path
 6             System.out.println("ICTCLAS_Init");
 7             if (ICTCLAS2011.ICTCLAS_Init(argu.getBytes("GB2312"),0) == false)
 8             {
 9                 System.out.println("Init Fail!");
10                 //return null;
11             }
12 
13             /*
14              * 設(shè)置詞性標(biāo)注集
15                     ID            代表詞性集 
16                     1            計(jì)算所一級(jí)標(biāo)注集
17                     0            計(jì)算所二級(jí)標(biāo)注集
18                     2            北大二級(jí)標(biāo)注集
19                     3            北大一級(jí)標(biāo)注集
20             */
21             wordSeg.ICTCLAS_SetPOSmap(2);
22             
23         }catch (Exception ex){
24             System.out.println("words segmentation initialization failed");
25             System.exit(-1);
26         }
27         return wordSeg;
28     }
29     
30     public boolean wordSegmentate(String argu1,String argu2){
31         boolean ictclasFileProcess = false;
32         try{
33             //文件分詞
34             ictclasFileProcess = wordSeg.ICTCLAS_FileProcess(argu1.getBytes("GB2312"), argu2.getBytes("GB2312"), 0);
35             
36             //ICTCLAS2011.ICTCLAS_Exit();
37             
38         }catch (Exception ex){
39             System.out.println("file process segmentation failed");
40             System.exit(-1);
41         }
42         return ictclasFileProcess;
43     }

3.2 計(jì)算詞條(tokens)出現(xiàn)的頻率

 1 public Map<String,Integer> calculateTokenCount(String afterWordSegFile){
 2         Map<String,Integer> wordCountMap = new HashMap<String,Integer>();
 3         File movieInfoFile = new File(afterWordSegFile);
 4         BufferedReader movieBR = null;
 5         try {
 6             movieBR = new BufferedReader(new FileReader(movieInfoFile));
 7         } catch (FileNotFoundException e) {
 8             System.out.println("movie_result.txt file not found");
 9             e.printStackTrace();
10         }
11         
12         String wordsline = null;
13         try {
14             while ((wordsline=movieBR.readLine()) != null){
15                 String[] words = wordsline.trim().split(" ");
16                 for (int i=0;i<words.length;i++){
17                     int wordCount = wordCountMap.get(words[i])==null ? 0:wordCountMap.get(words[i]);
18                     wordCountMap.put(words[i], wordCount+1);
19                     totalTokensCount += 1;
20                     
21                     if (words.length > 1 && i < words.length-1){
22                         StringBuffer wordStrBuf = new StringBuffer();
23                         wordStrBuf.append(words[i]).append(words[i+1]);
24                         int wordStrCount = wordCountMap.get(wordStrBuf.toString())==null ? 0:wordCountMap.get(wordStrBuf.toString());
25                         wordCountMap.put(wordStrBuf.toString(), wordStrCount+1);
26                         totalTokensCount += 1;
27                     }
28                     
29                 }                
30             }
31         } catch (IOException e) {
32             System.out.println("read movie_result.txt file failed");
33             e.printStackTrace();
34         }
35         
36         return wordCountMap;
37     }

3.3 找出待分析字符串中的正確tokens

 1 public Map<String,Integer> calculateTokenCount(String afterWordSegFile){
2         Map<String,Integer> wordCountMap = new HashMap<String,Integer>();
3         File movieInfoFile = new File(afterWordSegFile);
4         BufferedReader movieBR = null;
5         try {
6             movieBR = new BufferedReader(new FileReader(movieInfoFile));
7         } catch (FileNotFoundException e) {
8             System.out.println("movie_result.txt file not found");
9             e.printStackTrace();
10         }
11        
12         String wordsline = null;
13         try {
14             while ((wordsline=movieBR.readLine()) != null){
15                 String[] words = wordsline.trim().split(" ");
16                 for (int i=0;i<words.length;i++){
17                     int wordCount = wordCountMap.get(words[i])==null ? 0:wordCountMap.get(words[i]);
18                     wordCountMap.put(words[i], wordCount+1);
19                     totalTokensCount += 1;
20                    
21                     if (words.length > 1 && i < words.length-1){
22                         StringBuffer wordStrBuf = new StringBuffer();
23                         wordStrBuf.append(words[i]).append(words[i+1]);
24                         int wordStrCount = wordCountMap.get(wordStrBuf.toString())==null ? 0:wordCountMap.get(wordStrBuf.toString());
25                         wordCountMap.put(wordStrBuf.toString(), wordStrCount+1);
26                         totalTokensCount += 1;
27                     }
28                    
29                 }               
30             }
31         } catch (IOException e) {
32             System.out.println("read movie_result.txt file failed");
33             e.printStackTrace();
34         }
35        
36         return wordCountMap;
37     }

3.4 得到最大連續(xù)和第二大連續(xù)字符串(也可能為單個(gè)字符)

 1 public String[] getMaxAndSecondMaxSequnce(String[] sInputResult){
 2         List<String> correctTokens = getCorrectTokens(sInputResult);
 3         //TODO
 4         System.out.println(correctTokens);
 5         String[] maxAndSecondMaxSeq = new String[2];
 6         if (correctTokens.size() == 0) return null;
 7         else if (correctTokens.size() == 1){
 8             maxAndSecondMaxSeq[0]=correctTokens.get(0);
 9             maxAndSecondMaxSeq[1]=correctTokens.get(0);
10             return maxAndSecondMaxSeq;
11         }
12         
13         String maxSequence = correctTokens.get(0);
14         String maxSequence2 = correctTokens.get(correctTokens.size()-1);
15         String littleword = "";
16         for (int i=1;i<correctTokens.size();i++){
17             //System.out.println(correctTokens);
18             if (correctTokens.get(i).length() > maxSequence.length()){
19                 maxSequence = correctTokens.get(i);
20             } else if (correctTokens.get(i).length() == maxSequence.length()){
21                 
22                 //select the word with greater probability for single-word
23                 if (correctTokens.get(i).length()==1){
24                     if (probBetweenTowTokens(correctTokens.get(i)) > probBetweenTowTokens(maxSequence)) {
25                         maxSequence2 = correctTokens.get(i);
26                     }
27                 }
28                 //select words with smaller probability for multi-word, because the smaller has more self information
29                 else if (correctTokens.get(i).length()>1){
30                     if (probBetweenTowTokens(correctTokens.get(i)) <= probBetweenTowTokens(maxSequence)) {
31                         maxSequence2 = correctTokens.get(i);
32                     }
33                 }
34                 
35             } else if (correctTokens.get(i).length() > maxSequence2.length()){
36                 maxSequence2 = correctTokens.get(i);
37             } else if (correctTokens.get(i).length() == maxSequence2.length()){
38                 if (probBetweenTowTokens(correctTokens.get(i)) > probBetweenTowTokens(maxSequence2)){
39                     maxSequence2 = correctTokens.get(i);
40                 }
41             }
42         }
43         //TODO
44         System.out.println(maxSequence+" : "+maxSequence2);
45         //delete the sub-word from a string
46         if (maxSequence2.length() == maxSequence.length()){
47             int maxseqvaluableTokens = maxSequence.length();
48             int maxseq2valuableTokens = maxSequence2.length();
49             float min_truncate_prob_a = 0 ;
50             float min_truncate_prob_b = 0;
51             String aword = "";
52             String bword = "";
53             for (int i=0;i<correctTokens.size();i++){
54                 float tokenprob = probBetweenTowTokens(correctTokens.get(i));
55                 if ((!maxSequence.equals(correctTokens.get(i))) && maxSequence.contains(correctTokens.get(i))){
56                     if ( tokenprob >= min_truncate_prob_a){
57                         min_truncate_prob_a = tokenprob ;
58                         aword = correctTokens.get(i);
59                     }
60                 }
61                 else if ((!maxSequence2.equals(correctTokens.get(i))) && maxSequence2.contains(correctTokens.get(i))){
62                     if (tokenprob >= min_truncate_prob_b){
63                         min_truncate_prob_b = tokenprob;
64                         bword = correctTokens.get(i);
65                     }
66                 }
67             }
68             //TODO
69             System.out.println(aword+" VS "+bword);
70             System.out.println(min_truncate_prob_a+" VS "+min_truncate_prob_b);
71             if (aword.length()>0 && min_truncate_prob_a < min_truncate_prob_b){
72                 maxseqvaluableTokens -= 1 ;
73                 littleword = maxSequence.replace(aword,"");
74             }else {
75                 maxseq2valuableTokens -= 1 ;
76                 String temp = maxSequence2;
77                 if (maxSequence.contains(temp.replace(bword, ""))){
78                     littleword =  maxSequence2;
79                 }
80                 else littleword =  maxSequence2.replace(bword,"");
81                 
82             }
83             
84             if (maxseqvaluableTokens < maxseq2valuableTokens){
85                 maxSequence = maxSequence2;
86                 maxSequence2 = littleword;
87             }else {
88                 maxSequence2 = littleword;
89             }
90             
91         }
92         maxAndSecondMaxSeq[0] = maxSequence;
93         maxAndSecondMaxSeq[1] = maxSequence2;
94         
95         return maxAndSecondMaxSeq ;
96     }

3.5 返回更正列表

 1 public List<String> proofreadAndSuggest(String sInput){
 2         //List<String> correctTokens = new ArrayList<String>();
 3         List<String> correctedList = new ArrayList<String>();
 4         List<String> crtTempList = new ArrayList<String>();
 5 
 6         //TODO 
 7         Calendar startProcess = Calendar.getInstance();
 8         char[] str2char = sInput.toCharArray();
 9         String[] sInputResult = new String[str2char.length];//cwp.wordSegmentate(sInput);
10         for (int t=0;t<str2char.length;t++){
11             sInputResult[t] = String.valueOf(str2char[t]);
12         }
13         //String[] sInputResult = cwp.wordSegmentate(sInput);
14         //System.out.println(sInputResult);
15         //float re = probBetweenTowTokens("非","誠(chéng)");
16         String[] MaxAndSecondMaxSequnce = getMaxAndSecondMaxSequnce(sInputResult);
17         
18         // display errors and suggest correct movie name
19         //System.out.println("hasError="+hasError);
20         if (hasError !=0){
21             if (MaxAndSecondMaxSequnce.length>1){
22                 String maxSequence = MaxAndSecondMaxSequnce[0];
23                 String maxSequence2 = MaxAndSecondMaxSequnce[1];
24                 for (int j=0;j<movieName.size();j++){
25                     //boolean isThisMovie = false;
26                     String movie = movieName.get(j);
27                     
28                     
29                     //System.out.println("maxseq is "+maxSequence+", maxseq2 is "+maxSequence2);
30                     
31                     //select movie
32                     if (maxSequence2.equals("")){
33                         if (movie.contains(maxSequence)) correctedList.add(movie);
34                     }
35                     else {
36                         if (movie.contains(maxSequence) && movie.contains(maxSequence2)){
37                             //correctedList.clear();
38                             crtTempList.add(movie);
39                             //correctedList.add(movie);
40                             //break;
41                         }
42                         //else if (movie.contains(maxSequence) || movie.contains(maxSequence2)) correctedList.add(movie);
43                         else if (movie.contains(maxSequence)) correctedList.add(movie);
44                     }
45                     
46                 }
47                 
48                 if (crtTempList.size()>0){
49                     correctedList.clear();
50                     correctedList.addAll(crtTempList);
51                 }
52                 
53                 //TODO 
54                 if (hasError ==1) System.out.println("No spellig error,Sorry for having no this movie,do you want to get :"+correctedList.toString()+" ?");
55                 //TODO 
56                 else System.out.println("Spellig error,do you want to get :"+correctedList.toString()+" ?");
57             } //TODO 
58             else System.out.println("there are spellig errors, no anyone correct token in your spelled words,so I can't guess what you want, please check it again");
59             
60         } //TODO 
61         else System.out.println("No spelling error");
62         
63         //TODO
64         Calendar endProcess = Calendar.getInstance();
65         long elapsetime = (endProcess.getTimeInMillis()-startProcess.getTimeInMillis()) ;
66         System.out.println("process work elapsed "+elapsetime+" ms");
67         ICTCLAS2011.ICTCLAS_Exit();
68         
69         return correctedList ;
70     }

3.6 顯示校對(duì)結(jié)果

 1 public static void main(String[] args) {
 2         
 3         String argu1 = "movie.txt";          //movies name file
 4         String argu2 = "movie_result.txt";   //words after segmenting name of all movies
 5         
 6         SimpleDateFormat sdf=new SimpleDateFormat("HH:mm:ss");
 7         String startInitTime = sdf.format(new java.util.Date()); 
 8         System.out.println(startInitTime+" ---start initializing work---");
 9         ChineseWordProofread cwp = new ChineseWordProofread(argu1,argu2);
10     
11         String endInitTime = sdf.format(new java.util.Date());
12         System.out.println(endInitTime+" ---end initializing work---");
13         
14         Scanner scanner = new Scanner(System.in);
15         while(true){
16             System.out.print("請(qǐng)輸入影片名:");
17             
18             String input = scanner.next();
19             
20             if (input.equals("EXIT")) break;
21             
22             cwp.proofreadAndSuggest(input);
23             
24         }
25         scanner.close();
26     }

在我的機(jī)器上實(shí)驗(yàn)結(jié)果如下:

最后要說(shuō)的是我用的語(yǔ)料庫(kù)沒有做太多處理,所以最后出來(lái)的有很多正確的結(jié)果,比如非誠(chéng)勿擾會(huì)有《非誠(chéng)勿擾十二月合集》等,這些只要在影片語(yǔ)料庫(kù)上處理下即可;

還有就是該模型不適合大規(guī)模在線數(shù)據(jù),比如說(shuō)搜索引擎中的自動(dòng)校正或者叫智能提示,即使在影視劇、動(dòng)漫、綜藝等影片的自動(dòng)檢測(cè)錯(cuò)誤和更正上本模型還有很多提升的地方,若您不吝惜鍵盤,請(qǐng)敲上你的想法,讓我知道,讓我們開源、開放、開心,最后源碼在github上,可以自己點(diǎn)擊ZIP下載后解壓,在eclipse中創(chuàng)建工程wordproofread并將解壓出來(lái)的所有文件copy到該工程下,即可運(yùn)行。

熱門評(píng)論
最新評(píng)論
發(fā)表評(píng)論 查看所有評(píng)論(0)
昵稱:
表情: 高興 可 汗 我不要 害羞 好 下下下 送花 屎 親親
字?jǐn)?shù): 0/500 (您的評(píng)論需要經(jīng)過(guò)審核才能顯示)