您的位置:首页 > 编程语言 > Go语言

[转载]Dynamic Programming Algorithm (DPA) for Edit-Distance

2010-04-12 22:21 423 查看
转自:http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/

The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'.

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

change a letter,

insert a letter or

delete a letter

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0               -- '' = empty string
d(s, '')  = d('', s) = |s|  -- i.e. length of s
d(s1+ch1, s2+ch2)
= min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )


The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0,0] = 0
m[i,0] = i,  i=1..|s1|
m[0,j] = j,  j=1..|s2|

m[i,j] = min(m[i-1,j-1]
+ if s1[i]=s2[j] then 0 else 1 fi,
m[i-1, j] + 1,
m[i, j-1] + 1 ),  i=1..|s1|, j=1..|s2|


m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!

附算法:

<SCRIPT LANGUAGE="JavaScript">
<!--
function DPA(s1, s2)
{ var m = new Array();
var i, j;
for(i=0; i < s1.length + 1; i++) m[i] = new Array(); // i.e. 2-D array

m[0][0] = 0; // boundary conditions

for(j=1; j <= s2.length; j++)
m[0][j] = m[0][j-1]-0 + 1; // boundary conditions

for(i=1; i <= s1.length; i++)                            // outer loop
{ m[i][0] = m[i-1][0]-0 + 1; // boundary conditions

for(j=1; j <= s2.length; j++)                         // inner loop
{ var diag = m[i-1][j-1];
if( s1.charAt(i-1) != s2.charAt(j-1) ) diag++;

m[i][j] = Math.min( diag,               // match or change
Math.min( m[i-1][j]-0 + 1,    // deletion
m[i][j-1]-0 + 1 ) ) // insertion
}//for j
}//for i

traceBack('', '', '', m, s1.length, s2.length, s1, s2);
return m[s1.length][s2.length];
}//DPA

function traceBack(row1, row2, row3, m, i, j, s1, s2)
// recover the alignment of s1 and s2
{ if(i > 0 && j > 0)
{ var diag = m[i-1][j-1],  diagCh = '|';
if( s1.charAt(i-1) != s2.charAt(j-1) ) { diag++; diagCh = ' '; }

if( m[i][j] == diag )//LAllison comp sci monash uni au
traceBack(s1.charAt(i-1)+row1, diagCh+row2, s2.charAt(j-1)+row3,
m, i-1, j-1, s1, s2);    // change or match
else if( m[i][j] == m[i-1][j]-0 + 1 ) // delete
traceBack(s1.charAt(i-1)+row1, ' '+row2, '-'+row3,
m, i-1, j, s1, s2);
else
traceBack('-'+row1, ' '+row2, s2.charAt(j-1)+row3,
m, i, j-1, s1, s2);      // insertion
}
else if(i > 0)
traceBack(s1.charAt(i-1)+row1, ' '+row2, '-'+row3, m, i-1, j, s1, s2);
else if(j > 0)
traceBack('-'+row1, ' '+row2, s2.charAt(j-1)+row3, m, i, j-1, s1, s2);
else // i==0 and j==0
document.DPAform.displayArea.value += row1+'/n'+row2+'/n'+row3+'/n';
}//traceBack

function DPAdr()
{ var s1 = document.DPAform.str1.value;
var s2 = document.DPAform.str2.value;
if(s1.length > 22)//nosilla l inu hsanom essc dna awu sc
{ document.DPAform.displayArea.value = 's1 too long'; return; }
if(s2.length > 22)
{ document.DPAform.displayArea.value = 's2 too long'; return; }

document.DPAform.displayArea.value = '';
var ds1s2 = DPA(s1, s2);
document.DPAform.displayArea.value += 'd(s1,s2)=' + ds1s2 + '/n';
}//DPAdr

// -->
</SCRIPT>
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐