[转载]Dynamic Programming Algorithm (DPA) for Edit-Distance
2010-04-12 22:21
423 查看
转自:http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/
The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'.
The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
change a letter,
insert a letter or
delete a letter
The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.
Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!
附算法:
The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'.
The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
change a letter,
insert a letter or
delete a letter
The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
d('', '') = 0 -- '' = empty string d(s, '') = d('', s) = |s| -- i.e. length of s d(s1+ch1, s2+ch2) = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi, d(s1+ch1, s2) + 1, d(s1, s2+ch2) + 1 )
The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.
Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
m[i,j] = d(s1[1..i], s2[1..j]) m[0,0] = 0 m[i,0] = i, i=1..|s1| m[0,j] = j, j=1..|s2| m[i,j] = min(m[i-1,j-1] + if s1[i]=s2[j] then 0 else 1 fi, m[i-1, j] + 1, m[i, j-1] + 1 ), i=1..|s1|, j=1..|s2|
m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!
附算法:
<SCRIPT LANGUAGE="JavaScript"> <!-- function DPA(s1, s2) { var m = new Array(); var i, j; for(i=0; i < s1.length + 1; i++) m[i] = new Array(); // i.e. 2-D array m[0][0] = 0; // boundary conditions for(j=1; j <= s2.length; j++) m[0][j] = m[0][j-1]-0 + 1; // boundary conditions for(i=1; i <= s1.length; i++) // outer loop { m[i][0] = m[i-1][0]-0 + 1; // boundary conditions for(j=1; j <= s2.length; j++) // inner loop { var diag = m[i-1][j-1]; if( s1.charAt(i-1) != s2.charAt(j-1) ) diag++; m[i][j] = Math.min( diag, // match or change Math.min( m[i-1][j]-0 + 1, // deletion m[i][j-1]-0 + 1 ) ) // insertion }//for j }//for i traceBack('', '', '', m, s1.length, s2.length, s1, s2); return m[s1.length][s2.length]; }//DPA function traceBack(row1, row2, row3, m, i, j, s1, s2) // recover the alignment of s1 and s2 { if(i > 0 && j > 0) { var diag = m[i-1][j-1], diagCh = '|'; if( s1.charAt(i-1) != s2.charAt(j-1) ) { diag++; diagCh = ' '; } if( m[i][j] == diag )//LAllison comp sci monash uni au traceBack(s1.charAt(i-1)+row1, diagCh+row2, s2.charAt(j-1)+row3, m, i-1, j-1, s1, s2); // change or match else if( m[i][j] == m[i-1][j]-0 + 1 ) // delete traceBack(s1.charAt(i-1)+row1, ' '+row2, '-'+row3, m, i-1, j, s1, s2); else traceBack('-'+row1, ' '+row2, s2.charAt(j-1)+row3, m, i, j-1, s1, s2); // insertion } else if(i > 0) traceBack(s1.charAt(i-1)+row1, ' '+row2, '-'+row3, m, i-1, j, s1, s2); else if(j > 0) traceBack('-'+row1, ' '+row2, s2.charAt(j-1)+row3, m, i, j-1, s1, s2); else // i==0 and j==0 document.DPAform.displayArea.value += row1+'/n'+row2+'/n'+row3+'/n'; }//traceBack function DPAdr() { var s1 = document.DPAform.str1.value; var s2 = document.DPAform.str2.value; if(s1.length > 22)//nosilla l inu hsanom essc dna awu sc { document.DPAform.displayArea.value = 's1 too long'; return; } if(s2.length > 22) { document.DPAform.displayArea.value = 's2 too long'; return; } document.DPAform.displayArea.value = ''; var ds1s2 = DPA(s1, s2); document.DPAform.displayArea.value += 'd(s1,s2)=' + ds1s2 + '/n'; }//DPAdr // --> </SCRIPT>
相关文章推荐
- Dynamic Programming Algorithm (DPA) for Edit-Distance
- UESTC_菲波拉契数制升级版 2015 UESTC Training for Dynamic Programming<Problem L>
- 2015 UESTC Training for Dynamic Programming N - 导弹拦截 LIS nlog(n)+打印字典序最小的路径
- UESTC_菲波拉契数制 2015 UESTC Training for Dynamic Programming<Problem E>
- 2016 UESTC Training for Dynamic Programming N - 柱爷与子序列 这题和N题有些相似之处、用了树状数组
- 2014 UESTC Training for Dynamic Programming J
- 2017 UESTC Training for Dynamic Programming
- UESTC_酱神的旅行 2015 UESTC Training for Dynamic Programming<Problem M>
- A Dynamic Algorithm for Local Community Detection in Graphs--阅读笔记
- 2016 UESTC Training for Dynamic Programming F - 柱爷与三叉戟不得不说的故事 压位dp
- 2016 UESTC Training for Dynamic Programming P - 柱爷的矩阵 矩阵、递推
- 2014 UESTC Training for Dynamic Programming L
- A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning
- 2015 UESTC Training for Dynamic Programming 男神的礼物(区间dp)
- UESTC_导弹拦截 2015 UESTC Training for Dynamic Programming<Problem N>
- GeeksforGeeks Dynamic Programming | Set 37 (Boolean Parenthesization Problem)
- 2015 UESTC Training for Dynamic Programming A- 男神的礼物(区间dp)
- UESTC_邱老师看电影 2015 UESTC Training for Dynamic Programming<Problem F>
- Dynamic Programming for Brother Du
- [论文笔记]An adaptive algorithm for failure recovery during dynamic service composition (PReMI@LNCS, 2007)