您的位置:首页 > 其它

Week2-5Spelling similarity:edit distance

2015-11-07 01:29 330 查看

Spelling similarity

Typos

Variants in spelling

Edit operations

Insertion

Deletion

Substitution

Multiple edits

Levenstein method

Based on dynamic programming

Insertions, deletions and substitutions usually have a cost of 1

Example

we want to calculate the edit distance of strength and trend.



Definitions

s1(i): ith character in string s1

s2(j): jth character in string s2

D(i,j): edit distance between a prefix of s1 of length i and a prefix of s2 of length j

t(i,j): cost of aligning the ith character in string s1 with the jth character in string s2

Recursive dependencies

D(i, 0) = i
D(0, j) = j
D(i, j) = min{D( i - 1, j ) + 1,
D( i, j - 1 ) + 1,
D( i - 1, j - 1 ) + t( i, j )}


Simple edit distance

t( i, j ) = 0 iff s1( i ) = s2( j )
t( i, j ) = 1 otherwise


Initialization



Recursion









Other costs

Damerau modification

swaps of 2 adjacent characters also have cost of 1( people are likely to swap the adjacent characters)

Lev( cats, cast ) = 2

Dam( cats, cast ) = 1

Other edit distance

Dist( sit down, sit clown ) = 1?

model the errors common with optical character recognition(OCR), i.e. d is likely to be writen as cl

Dist( qeather, weather ) = 1, Dist( leather, weather ) = 2?

model spelling errors introduced by fat finger, i.e. q and w are near on the keyboard whereas l and w is far(unlikely to be a typo)

Edit distance is also used in genetic sequence and amino acid sequence in bioinformatics.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: