perl -正向最大匹配 转自Sighan 提供的FMM程序
2012-03-20 09:52
239 查看
#!/usr/bin/perl -w
# GBK编码,参数一 词典 参数二 待分文本
#转自Sighan 提供的FMM程序 ,不是原创
#所以把人家的声明都放在下面了
###########################################################################
# #
# SIGHAN #
# Copyright (c) 2003 #
# All Rights Reserved. #
# #
# Permission is hereby granted, free of charge, to use and distribute #
# this software and its documentation without restriction, including #
# without limitation the rights to use, copy, modify, merge, publish, #
# distribute, sublicense, and/or sell copies of this work, and to #
# permit persons to whom this work is furnished to do so, subject to #
# the following conditions: #
# 1. The code must retain the above copyright notice, this list of #
# conditions and the following disclaimer. #
# 2. Any modifications must be clearly marked as such. #
# 3. Original authors' names are not deleted. #
# 4. The authors' names are not used to endorse or promote products #
# derived from this software without specific prior written #
# permission. #
# #
# SIGHAN AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES #
# WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF #
# MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL SIGHAN NOR THE #
# CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL #
# DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA #
# OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER #
# TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR #
# PERFORMANCE OF THIS SOFTWARE. #
# #
###########################################################################
# #
# Author: Richard Sproat (rws@research.att.com) #
# #
###########################################################################
$USAGE = "Usage:\t$0 dictionary\n\t";
if (@ARGV < 1) {print "$USAGE\n"; exit;}
%dict = ();
$maxwlen = 0;
open (S, $ARGV[0]) or die "$ARGV[0]: $!\n";
while (<S>) {
chop;
$dict{$_} = 1;
my $l = length($_);
$maxwlen = $l if $l > $maxwlen;
}
close (S);
shift @ARGV;
$n = 0;
while (<>) {
chop;
s/\s*//g;
my $text = $_;
#print "$text\t";
while ($text ne "") {
$sub = substr($text, 0, $maxwlen);
while ($sub ne "") {
if ($dict{$sub}) {
print "$sub\/";
for (my $i = 0; $i < length($sub); ++$i) {
$text =~ s/^.//;
}
last;
}
$sub =~ s/.$//;
}
if ($sub eq "") {
if ($text =~ /^([\x21-\x7e])/) {
print "$1\/";
$text =~ s/^.//;
}
elsif ($text =~ /^([^\x21-\x7e].)/) {
print "$1\/";
$text =~ s/^..//;
}
else { ## shouldn't happen
print STDERR "Oops: shouldn't be here: $n\n";
print "$1\/";
$text =~ s/^.//;
}
}
}
print "\n";
++$n;
}
exit(0);
# GBK编码,参数一 词典 参数二 待分文本
#转自Sighan 提供的FMM程序 ,不是原创
#所以把人家的声明都放在下面了
###########################################################################
# #
# SIGHAN #
# Copyright (c) 2003 #
# All Rights Reserved. #
# #
# Permission is hereby granted, free of charge, to use and distribute #
# this software and its documentation without restriction, including #
# without limitation the rights to use, copy, modify, merge, publish, #
# distribute, sublicense, and/or sell copies of this work, and to #
# permit persons to whom this work is furnished to do so, subject to #
# the following conditions: #
# 1. The code must retain the above copyright notice, this list of #
# conditions and the following disclaimer. #
# 2. Any modifications must be clearly marked as such. #
# 3. Original authors' names are not deleted. #
# 4. The authors' names are not used to endorse or promote products #
# derived from this software without specific prior written #
# permission. #
# #
# SIGHAN AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES #
# WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF #
# MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL SIGHAN NOR THE #
# CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL #
# DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA #
# OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER #
# TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR #
# PERFORMANCE OF THIS SOFTWARE. #
# #
###########################################################################
# #
# Author: Richard Sproat (rws@research.att.com) #
# #
###########################################################################
$USAGE = "Usage:\t$0 dictionary\n\t";
if (@ARGV < 1) {print "$USAGE\n"; exit;}
%dict = ();
$maxwlen = 0;
open (S, $ARGV[0]) or die "$ARGV[0]: $!\n";
while (<S>) {
chop;
$dict{$_} = 1;
my $l = length($_);
$maxwlen = $l if $l > $maxwlen;
}
close (S);
shift @ARGV;
$n = 0;
while (<>) {
chop;
s/\s*//g;
my $text = $_;
#print "$text\t";
while ($text ne "") {
$sub = substr($text, 0, $maxwlen);
while ($sub ne "") {
if ($dict{$sub}) {
print "$sub\/";
for (my $i = 0; $i < length($sub); ++$i) {
$text =~ s/^.//;
}
last;
}
$sub =~ s/.$//;
}
if ($sub eq "") {
if ($text =~ /^([\x21-\x7e])/) {
print "$1\/";
$text =~ s/^.//;
}
elsif ($text =~ /^([^\x21-\x7e].)/) {
print "$1\/";
$text =~ s/^..//;
}
else { ## shouldn't happen
print STDERR "Oops: shouldn't be here: $n\n";
print "$1\/";
$text =~ s/^.//;
}
}
}
print "\n";
++$n;
}
exit(0);
相关文章推荐
- 【分词】正向最大匹配中文分词算法
- 中文分词引擎 java 实现 — 正向最大、逆向最大、双向最大匹配法
- python正向最大匹配分词和逆向最大匹配分词
- 正向最大匹配和反向最大匹配
- 一个简单最大正向匹配(Maximum Matching)MM中文分词算法的实现
- 分词学习(1)--正向最大匹配分词
- 一个简单最大正向匹配(Maximum Matching)MM中文分词算法的实现
- python中文分词教程之前向最大正向匹配算法详解
- <正向/反向>最大匹配算法(Java)
- 中文分词算法之最大正向匹配算法(Python版)
- 基于正向最大匹配算法的分词算法
- 搜索引擎--范例:中英文混杂分词算法的实现--正向最大匹配算法的原理和实现
- 中文分词 (机械传统方法 )正向最大匹配
- 中文分词——正向最大匹配法
- PHP实现的最大正向匹配算法示例
- 用正向最大匹配法进行中文分词
- C#汉字转拼音,自动识别多音字,带声调,提供正向、逆向、双向分词算法的小程序
- 最大正向匹配算法 PHP实现
- java实现正向最大匹配分词
- 用正向和逆向最大匹配算法进行中文分词(续)