Week2-5Spelling similarity:edit distance
2015-11-07 01:29
330 查看
Spelling similarity
TyposVariants in spelling
Edit operations
InsertionDeletion
Substitution
Multiple edits
Levenstein method
Based on dynamic programmingInsertions, deletions and substitutions usually have a cost of 1
Example
we want to calculate the edit distance of strength and trend.Definitions
s1(i): ith character in string s1s2(j): jth character in string s2
D(i,j): edit distance between a prefix of s1 of length i and a prefix of s2 of length j
t(i,j): cost of aligning the ith character in string s1 with the jth character in string s2
Recursive dependencies
D(i, 0) = i D(0, j) = j D(i, j) = min{D( i - 1, j ) + 1, D( i, j - 1 ) + 1, D( i - 1, j - 1 ) + t( i, j )}
Simple edit distance
t( i, j ) = 0 iff s1( i ) = s2( j ) t( i, j ) = 1 otherwise
Initialization
Recursion
Other costs
Damerau modificationswaps of 2 adjacent characters also have cost of 1( people are likely to swap the adjacent characters)
Lev( cats, cast ) = 2
Dam( cats, cast ) = 1
Other edit distance
Dist( sit down, sit clown ) = 1?model the errors common with optical character recognition(OCR), i.e. d is likely to be writen as cl
Dist( qeather, weather ) = 1, Dist( leather, weather ) = 2?
model spelling errors introduced by fat finger, i.e. q and w are near on the keyboard whereas l and w is far(unlikely to be a typo)
Edit distance is also used in genetic sequence and amino acid sequence in bioinformatics.
相关文章推荐
- poj 1830 高斯消元
- PageAdapter不能刷新问题
- 【恒生电子16年校招编程题】求两有序数组的交集并返回交集个数
- 关于C#类中重写ToString方法和PHP类中__tostring()方法的比较
- 在Android studio中建立Java工程
- TSP问题
- 自动布局
- 将VS2012的项目转化为VS2010
- c++重载运算符注意
- Y400中通过easybcd在win7下面的安装Ubuntu14
- *LeetCode-Permutation
- 计算几何 uva11117 Morley's Theorem
- 【JavaScript知识点二】JavaScript 变量
- BZOJ3012 : [Usaco2012 Dec]First!
- ActiveMQ LevelDB持久化机制
- HDU5526/BestCoder Round #61 (div.1)1004 Lie 背包DP
- uva10635 Prince and Princess(LCS转LIS)
- Linux 编译报错 undefined reference to `pthread_create'
- 菜鸟日记之JSP1
- Java之随机数