您的位置:首页 > 其它


2014-08-05 02:17 447 查看


sitten (k→s)
sittin (e→i)
sitting (→g)

俄罗斯科学家Vladimir Levenshtein在1965年提出这个概念。






Java工具包Apache的StringUtils类(在包commons-lang中,最新为commons-lang3-3.3.2)中采用的则是仅保留上一行的结果。减少的空间,并且避免长字符串时的内容溢出。详细请查看该包的源代码org.apache.commons.lang3.StringUtils 中的StringUtils.getLevenshteinDistance(CharSequence s, CharSequence t);中。

// Misc
* <p>Find the Levenshtein distance between two Strings.</p>
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
* <p>The previous implementation of the Levenshtein distance algorithm
* was from <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
* <p>Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError
* which can occur when my Java implementation is used with very large strings.<br>
* This implementation of the Levenshtein distance algorithm
* is from <a href="http://www.merriampark.com/ldjava.htm">http://www.merriampark.com/ldjava.htm</a></p>
* <pre>
* StringUtils.getLevenshteinDistance(null, *)             = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null)             = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","")               = 0
* StringUtils.getLevenshteinDistance("","a")              = 1
* StringUtils.getLevenshteinDistance("aaapppp", "")       = 7
* StringUtils.getLevenshteinDistance("frog", "fog")       = 1
* StringUtils.getLevenshteinDistance("fly", "ant")        = 3
* StringUtils.getLevenshteinDistance("elephant", "hippo") = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant") = 7
* StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8
* StringUtils.getLevenshteinDistance("hello", "hallo")    = 1
* </pre>
* @param s  the first String, must not be null
* @param t  the second String, must not be null
* @return result distance
* @throws IllegalArgumentException if either String input {@code null}
* @since 3.0 Changed signature from getLevenshteinDistance(String, String) to
* getLevenshteinDistance(CharSequence, CharSequence)
public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");

The difference between this impl. and the previous is that, rather
than creating and retaining a matrix of size s.length() + 1 by t.length() + 1,
we maintain two single-dimensional arrays of length s.length() + 1.  The first, d,
is the 'current working' distance array that maintains the newest distance cost
counts as we iterate through the characters of String s.  Each time we increment
the index of String t we are comparing, d is copied to p, the second int[].  Doing so
allows us to retain the previous cost counts as required by the algorithm (taking
the minimum of the cost count to the left, up one, and diagonally up and to the left
of the current cost count being calculated).  (Note that the arrays aren't really
copied anymore, just switched...this is clearly much better than cloning an array
or doing a System.arraycopy() each time  through the outer loop.)

Effectively, the difference between the two implementations is this one does not
cause an out of memory condition when calculating the LD over two very large strings.

int n = s.length(); // length of s
int m = t.length(); // length of t

if (n == 0) {
return m;
} else if (m == 0) {
return n;

if (n > m) {
// swap the input strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();

int p[] = new int[n + 1]; //'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d

// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t

char t_j; // jth character of t

int cost; // cost

for (i = 0; i <= n; i++) {
p[i] = i;

for (j = 1; j <= m; j++) {
t_j = t.charAt(j - 1);
d[0] = j;

for (i = 1; i <= n; i++) {
cost = s.charAt(i - 1) == t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);

// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;

// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p

如果不仅是求出Levenshtein Distance, 还要输出编辑的路径,那么只能保留矩阵,然后倒退求取编辑路径。


package cn.com.sp.align.model;

* 三个成员是,
*     操作位置 index
*     替换的目标 targetStr,(删除,替换为空“”;替换,替换为目标字符;添加,替换为目标字符串)
*     操作的类型 operateType,定义为枚举类型OperateEnum。其实从替换目标就能判断出操作类型,为了简便,省去了每步的判断。
public class OperateObj {
public enum OperateEnum {
add, delete, replace;

private int index = 0;

private String targetStr = "";

OperateEnum	operateType;

public OperateObj(int index, String targetStr, OperateEnum operateType) {
this.index = index;
this.targetStr = targetStr;
this.operateType = operateType;

public int getIndex() {
return index;

public void setIndex(int index) {
this.index = index;

public String getTargetStr() {
return targetStr;

public void setTargetStr(String targetStr) {
this.targetStr = targetStr;

public OperateEnum getOperateType() {
return operateType;

public void setOperateType(OperateEnum operateType) {
this.operateType = operateType;

具体实现求Levenshtein Distance,过程中,保存矩阵的所有结果,实现类为StringUtils_SP:

package cn.com.sp.align.levenshtein;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;

import cn.com.sp.align.model.OperateObj;
import cn.com.sp.align.model.OperateObj.OperateEnum;

public class StringUtils_SP {

public static int getLevenshteinDistance(CharSequence s, CharSequence t, List<OperateObj> operateList) {

if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");

int n = s.length(); // length of s
int m = t.length(); // length of t

if (n == 0) {
return m;
} else if (m == 0) {
return n;

int distance[][] = new int[s.length()+1][t.length()+1];

for(int i=0; i<s.length()+1; ++i){
distance[i][0] = i;

for(int j=1; j<t.length()+1; ++j){
distance[0][j] = j;

int cost = 0;
for(int i=1; i<s.length()+1; ++i){
for(int j=1; j<t.length()+1; ++j){
int tempCost = Math.min(distance[i-1][j]+1, distance[i][j-1]+1);
cost = 0;
cost = 1;
distance[i][j] = Math.min(distance[i-1][j-1]+cost, tempCost);


int i = s.length(), j = t.length();
int minDistance = distance[i][j];
while(i>0 && j>0){
if(distance[i][j-1]+1 == minDistance){
OperateObj operateObj = new OperateObj(i-1, s.charAt(i-1)+""+t.charAt(j-1), OperateEnum.add);

minDistance = distance[i][j-1];
j -= 1;
}else if(distance[i-1][j]+1 == minDistance){
OperateObj operateObj = new OperateObj(i-1, "", OperateEnum.delete);

minDistance = distance[i-1][j];
i -= 1;
}else if(distance[i-1][j-1]+1 == minDistance){
OperateObj operateObj = new OperateObj(i-1, t.charAt(j-1)+"", OperateEnum.replace);

minDistance = distance[i-1][j-1];
i -= 1;
j -= 1;

i -= 1;
j -= 1;


OperateObj operateObj = new OperateObj(i-1, "", OperateEnum.delete);

minDistance = distance[i-1][j];
i -= 1;

OperateObj operateObj = new OperateObj(i, t.charAt(j-1)+""+s.charAt(i), OperateEnum.add);

minDistance = distance[i][j-1];
j -= 1;

return distance[s.length()-1][t.length()-1];

<pre name="code" class="java">   public static void main(String[] args){
String s = "中华人民共和国";
String t = "中化人名和国";

ArrayList<OperateObj> operateList = new ArrayList<OperateObj>();

System.out.println("编辑距离为 : "+StringUtils_SP.getLevenshteinDistance(s, t, operateList));

String operateStr = s;
for (int i = 0; i < operateList.size(); ++i) {
OperateObj operateObj = operateList.get(i);


System.out.println(s.charAt(operateObj.getIndex())+"("+operateObj.getIndex()+","+operateObj.getOperateType()+") -> "+operateObj.getTargetStr());

operateStr = operateStr.substring(0, operateObj.getIndex()) + operateObj.getTargetStr() + operateStr.substring(operateObj.getIndex() + 1);


编辑距离为 : 3
共(4,delete) ->
民(3,replace) -> 名
华(1,replace) -> 化

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息