相似度算法(一)------编辑距离
2008-08-26 10:46
369 查看
搞自然语言处理的应该不会对这个概念感到陌生,编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
The Levenshtein distance algorithm has been used in:
Spell checking
Speech recognition
DNA analysis
Plagiarism detection
由于,我在实际应用中要处理中文,每个汉字在内存中占两个字节,如果单纯用上述程序进行比较,就会有一些微小错误容易让人忽视,如汉字的“啊”和“阿”他们就有一个字节是相同的,一个字节是不同的,利用上述程序统计出的更改数除以2就会出现半个字,所以,对于汉英混合文本统计更改数时,需先判断当前进行比较的两个字是汉字还是西文字母,然后填写一个代价矩阵,在填写时,如果是汉字,要把其相邻的两个字节对应的代价矩阵赋为同一个值,具体做法,请看代码:
当然java不存在这种情况,因为java本来就是双字节的
public int getDistance(String str1, String str2) {
int cost=0;
int[][] matrix = new int[str1.length()+1][str2.length()+1];
for (int i = 0; i < str1.length()+1; i++) {
matrix[i][0] = i;
}
for (int j = 0; j < str2.length()+1; j++) {
matrix[0][j] = j;
}
for (int i=1; i < str1.length()+1; i++) {
for (int j=1; j < str2.length()+1; j++) {
if(str1.charAt(i-1)==str2.charAt(j-1)){
cost=0;
}else{
cost=1;
}
matrix[i][j]=min(matrix[i-1][j]+1,matrix[i][j-1]+1,matrix[i-1][j-1]+cost);
}
}
display(matrix);
return matrix[str1.length()][str2.length()];
}
private int min(int i1,int i2,int i3){
int min;
min=i1>i2?i2:i1;
min=min>i3?i3:min;
return min;
}
private void display(int[][] it) {
for (int[] i : it) {
for (int j : i) {
System.out.print(j + " ");
}
System.out.println();
}
}
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
The Levenshtein distance algorithm has been used in:
Spell checking
Speech recognition
DNA analysis
Plagiarism detection
The Algorithm
Steps
Step | Description |
---|---|
1 | Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. |
2 | Initialize the first row to 0..n. Initialize the first column to 0..m. |
3 | Examine each character of s (i from 1 to n). |
4 | Examine each character of t (j from 1 to m). |
5 | If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1. |
6 | Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. |
7 | After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. |
Example
This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".Steps 1 and 2
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | |||||
A | 2 | |||||
M | 3 | |||||
B | 4 | |||||
O | 5 | |||||
L | 6 |
Steps 3 to 6 When i = 1
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | ||||
A | 2 | 1 | ||||
M | 3 | 2 | ||||
B | 4 | 3 | ||||
O | 5 | 4 | ||||
L | 6 | 5 |
Steps 3 to 6 When i = 2
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | |||
A | 2 | 1 | 1 | |||
M | 3 | 2 | 2 | |||
B | 4 | 3 | 3 | |||
O | 5 | 4 | 4 | |||
L | 6 | 5 | 5 |
Steps 3 to 6 When i = 3
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | ||
A | 2 | 1 | 1 | 2 | ||
M | 3 | 2 | 2 | 1 | ||
B | 4 | 3 | 3 | 2 | ||
O | 5 | 4 | 4 | 3 | ||
L | 6 | 5 | 5 | 4 |
Steps 3 to 6 When i = 4
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | 3 | |
A | 2 | 1 | 1 | 2 | 3 | |
M | 3 | 2 | 2 | 1 | 2 | |
B | 4 | 3 | 3 | 2 | 1 | |
O | 5 | 4 | 4 | 3 | 2 | |
L | 6 | 5 | 5 | 4 | 3 |
Steps 3 to 6 When i = 5
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | 3 | 4 |
A | 2 | 1 | 1 | 2 | 3 | 4 |
M | 3 | 2 | 2 | 1 | 2 | 3 |
B | 4 | 3 | 3 | 2 | 1 | 2 |
O | 5 | 4 | 4 | 3 | 2 | 1 |
L | 6 | 5 | 5 | 4 | 3 | 2 |
Step 7
The distance is in the lower right hand corner of the matrix, i.e. 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and 1 insertion = 2 changes).由于,我在实际应用中要处理中文,每个汉字在内存中占两个字节,如果单纯用上述程序进行比较,就会有一些微小错误容易让人忽视,如汉字的“啊”和“阿”他们就有一个字节是相同的,一个字节是不同的,利用上述程序统计出的更改数除以2就会出现半个字,所以,对于汉英混合文本统计更改数时,需先判断当前进行比较的两个字是汉字还是西文字母,然后填写一个代价矩阵,在填写时,如果是汉字,要把其相邻的两个字节对应的代价矩阵赋为同一个值,具体做法,请看代码:
当然java不存在这种情况,因为java本来就是双字节的
public int getDistance(String str1, String str2) {
int cost=0;
int[][] matrix = new int[str1.length()+1][str2.length()+1];
for (int i = 0; i < str1.length()+1; i++) {
matrix[i][0] = i;
}
for (int j = 0; j < str2.length()+1; j++) {
matrix[0][j] = j;
}
for (int i=1; i < str1.length()+1; i++) {
for (int j=1; j < str2.length()+1; j++) {
if(str1.charAt(i-1)==str2.charAt(j-1)){
cost=0;
}else{
cost=1;
}
matrix[i][j]=min(matrix[i-1][j]+1,matrix[i][j-1]+1,matrix[i-1][j-1]+cost);
}
}
display(matrix);
return matrix[str1.length()][str2.length()];
}
private int min(int i1,int i2,int i3){
int min;
min=i1>i2?i2:i1;
min=min>i3?i3:min;
return min;
}
private void display(int[][] it) {
for (int[] i : it) {
for (int j : i) {
System.out.print(j + " ");
}
System.out.println();
}
}
相关文章推荐
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- java实现编辑距离算法,计算字符串相似度
- 编辑距离及编辑距离算法(求字符的相似度) js版
- [转]字符串相似度算法(编辑距离算法 Levenshtein Distance)[附c#,asp源码]
- [经典算法] 字符串相似度-编辑距离
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- 用C#实现字符串相似度算法(编辑距离算法 Levenshtein Distance)
- [转]字符串相似度算法(编辑距离算法 Levenshtein Distance)[附c#,asp源码]
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- C# 文章相似度算法 Levenshtein 编辑距离算法(转)
- 相似度算法——Levenshtein(编辑距离)
- java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解
- [随笔]初步了解 Levenshtein Distance (Edit Distance) 编辑距离,字符相似度算法
- 编辑距离LCS算法详解:Levenshtein Distance算法计算两个字符串的相似度
- Levenshtein Distance Levenshtein 编辑距离——一种相似度的计算方法
- 算法爱好者——编辑距离? 待解决
- Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离
- Python----python实现机器学习中的各种距离计算及文本相似度算法