字符串编辑距离
2012-06-14 11:20
288 查看
这篇文章是在新浪博客上写的,现搬过来,算是第一篇吧~
题目如下,最后给出实现代码,该代码仍在测试中,仅供学习交流!
编写一个程序计算两个字符串的编辑距离。编辑距离的定义和计算方法如下:
Given two strings A and B, edit A to B with the minimum number of edit operations:
a) .Replace a letter with another letter
b) .Insert a letter
c) .Delete a letter
E.g.
A = interestingly _i__nterestingly
B = bioinformatics bioinformatics__
1011011011001111
Edit distance = 11
Instead of minimizing the number of edge operations, we can associate a cost function to the
operations and minimize the total cost. Such cost is called edit distance. Instead of using string edit, in computational biology, people like to use string alignment.We use similarity function, instead of cost function, to evaluate the goodness
of the alignment.
E.g. of similarity function: match – 2, mismatch, insert, delete – -1.
Consider two strings ACAATCC and AGCATGC.
One of their alignment is
In the above alignment, space (‘_’) is introduced to both strings. There are 5 matches, 1
mismatch, 1 insert, and 1 delete.The alignment has similarity score 7.
A_CAATCC
AGCA_TGC
Note that the above alignment has the maximum score.Such alignment is called optimal
alignment.String alignment problem tries to find the alignment with the maximum similarity
score!String alignment problem is also called global alignment problem.
Needleman-Wunsch algorithm
Consider two strings S[1..n] and T[1..m].Define V(i, j) be the score of the optimal alignment
between S[1..i] and T[1..j].
Basis:
V(0, 0) = 0
V(0, j) = V(0, j-1) + d(_, T[j]):Insert j times
V(i, 0) = V(i-1, 0) + d(S[i], _):Delete i times
that is:
Example (I):
实现代码如下:
头文件StrEditDistance.h
源文件StrEditDistance.cpp
main.cpp文件如下:
题目如下,最后给出实现代码,该代码仍在测试中,仅供学习交流!
编写一个程序计算两个字符串的编辑距离。编辑距离的定义和计算方法如下:
Given two strings A and B, edit A to B with the minimum number of edit operations:
a) .Replace a letter with another letter
b) .Insert a letter
c) .Delete a letter
E.g.
A = interestingly _i__nterestingly
B = bioinformatics bioinformatics__
1011011011001111
Edit distance = 11
Instead of minimizing the number of edge operations, we can associate a cost function to the
operations and minimize the total cost. Such cost is called edit distance. Instead of using string edit, in computational biology, people like to use string alignment.We use similarity function, instead of cost function, to evaluate the goodness
of the alignment.
E.g. of similarity function: match – 2, mismatch, insert, delete – -1.
Consider two strings ACAATCC and AGCATGC.
One of their alignment is
In the above alignment, space (‘_’) is introduced to both strings. There are 5 matches, 1
mismatch, 1 insert, and 1 delete.The alignment has similarity score 7.
A_CAATCC
AGCA_TGC
Note that the above alignment has the maximum score.Such alignment is called optimal
alignment.String alignment problem tries to find the alignment with the maximum similarity
score!String alignment problem is also called global alignment problem.
Needleman-Wunsch algorithm
Consider two strings S[1..n] and T[1..m].Define V(i, j) be the score of the optimal alignment
between S[1..i] and T[1..j].
Basis:
V(0, 0) = 0
V(0, j) = V(0, j-1) + d(_, T[j]):Insert j times
V(i, 0) = V(i-1, 0) + d(S[i], _):Delete i times
that is:
Example (I):
实现代码如下:
头文件StrEditDistance.h
#pragma once #include <string> class CStrEditDistance { public: CStrEditDistance(std::string& vStrRow, std::string& vStrColumn); ~CStrEditDistance(void); int getScore() { return m_Score; } int getEditDis() { return m_EditDis; } void setEditDis(int vDis) { m_EditDis = vDis; } void setScore(int vScore) { m_Score = vScore; } private: void process(const std::string& vStrRow, const std::string& vStrColumn); int getMaxValue(int a, int b, int c) { if (a < b){ if (b < c) return c; return b; } else { if (b > c) return a; return a < c ? c : a; } } private: int m_EditDis; int m_Score; };
源文件StrEditDistance.cpp
#include "StrEditDistance.h" #include <iostream> #include <iomanip> #define MATCH 2 #define MISS_MATCH -1 #define INSERT -1 #define DELETE -1 CStrEditDistance::CStrEditDistance(std::string& vStrRow, std::string& vStrColumn) { process(vStrRow, vStrColumn); } CStrEditDistance::~CStrEditDistance(void) { } /***********************************************************************/ //FUNCTION: void CStrEditDistance::process(const std::string& vStrRow, const std::string& vStrColumn) { int editDis = 0; //编辑距离 int row = vStrColumn.length(); int column = vStrRow.length(); const int sizeR = row + 1; const int sizeC = column + 1; int **pScore = new int*[sizeR]; //二维指针 for (int i = 0; i <= row; i++) pScore[i] = new int[sizeC]; //初始化第一行和第一列 for (int c = 0; c <= column; c++) pScore[0][c] = 0 - c; for (int r = 0; r <= row; r++) pScore[r][0] = 0 - r; //从v(1,1)开始每列计算 for (int c = 1; c <= column; c++) { for (int r = 1; r <= row; r++) { //计算v(i,j), 其值等于A:v(i-1,j)+insert、B:v(i,j-1)+delet e、C:v(i-1,j-1)+@(i,j)中的最大的一个 int valueMatch; if (vStrColumn[r-1] == vStrRow[c-1]) valueMatch = MATCH; else valueMatch = MISS_MATCH; int A = pScore[r-1][c] + INSERT; int B = pScore[r][c-1] + DELETE; int C = pScore[r-1][c-1] + valueMatch; pScore[r][c] = getMaxValue(A, B, C); } } //计算编辑距离 int r = row, c = column; while(r > 0 && c > 0) { if (pScore[r][c]+1 == pScore[r-1][c]) { editDis++; r--; continue; } else if (pScore[r][c]+1 == pScore[r][c-1]) { editDis++; c--; continue; } else if (pScore[r][c]+1 == pScore[r-1][c-1]){ editDis++; r--; c--; continue; } else { r--; c--; } } if (r > 0 && c == 0) editDis += r; else if (c > 0 && r == 0) editDis += c; std::cout << std::endl; //----------------DEBUG-------------------// //打印中间数据 for (int i = 0; i <= row; i++) { for (int j = 0; j <= column; j++) std::cout << std::setw(2) << pScore[i][j] << " "; std::cout << std::endl; } std::cout << std::endl; //设置编辑距离和得分 setEditDis(editDis); setScore(pScore[row][column]); for (int i = 0; i <= row; i++) //释放内存 { delete pScore[i]; pScore[i] = NULL; } delete[] pScore; }
main.cpp文件如下:
#include "StrEditDistance.h" #include <iostream> #define MAX_SIZE 100 int main(int, char **) { char sRow[MAX_SIZE]; char sColumn[MAX_SIZE]; std::cout << "input row string : "; std::cin.getline(sRow, MAX_SIZE); std::cout << "input column str : "; std::cin.getline(sColumn, MAX_SIZE); std::string strRow(sRow), strColumn(sColumn); CStrEditDistance *pStrEdit = new CStrEditDistance(strRow, strColumn); std::cout << "The score is : " << pStrEdit->getScore() << std::endl; std::cout << "The edit distance is : " << pStrEdit->getEditDis() << std::endl; delete pStrEdit; system("pause"); return 0; }
相关文章推荐
- 最大子序列、最长递增子序列、最长公共子串、最长公共子序列、字符串编辑距离
- 8.动态规划(1)——字符串的编辑距离
- 字符串编辑距离
- 编辑距离 字符串相似度问题
- 【算法】字符串编辑距离
- Edit Distance(编辑距离)算法。计算两个字符串的相似程度。
- 最大子序列,最长递增子序列,最长公共字串,最长公共子序列,字符串编辑距离
- 字符串编辑距离与拼写错误检查
- 数组字符串那些经典算法:最大子序列和,最长递增子序列,最长公共子串,最长公共子序列,字符串编辑距离,最长不重复子串,最长回文子串
- 程序员编程艺术第二十八~二十九章:最大连续乘积子串、字符串编辑距离
- DP求两个字符串的编辑距离
- Trie树求多个字符串最短编辑距离的空间优化
- 求两个字符串的编辑距离
- (51Nod 1183 编辑距离)字符串编辑距离
- 402字符串编辑距离
- 字符串“编辑距离”(最大公共子串提取)
- 经典算法求字符串的编辑距离
- 数组字符串那些经典算法:最大子序列和,最长递增子序列,最长公共子串,最长公共子序列,字符串编辑距离,最长不重复子串,最长回文子串
- 字符串的编辑距离
- 求两个字符串的编辑距离