您的位置：首页 > 其它

使用Moses搭建一个机器翻译系统及实验记录

2011-02-22 17:14 471 查看

一. 搭建实验环境：

ubuntu系统可以直接下载安装deb包即可，需要安装的deb包有Srilm, GIZA++, mkcls以及从
http://www.statmt.org/wmt08/scripts.tgz

上获得的一些脚本文件。如果不是ubuntu系统的话，需要下载这些工具的源代码进行编译安装。

下面简述使用源代码编译的方法搭建环境中的步骤，直接使用deb安装的可以跳过：

1. 安装Srilm：

下载Srilm源码文件，解压。http://www.speech.sri.com/projects/srilm/download.html

首先，检查Srilm的依赖包，这些依赖包包括：

1
）
A template-capable
ANSI-C/C++ compiler, gcc version 3.4.3 or higher.

2
）
GNU make, to control
compilation and installation.

3
）
GNU gawk, required for
many of the utility scripts.

4
）
GNU gzip to unpack the
distribution, and to allow SRILM programs to
handle “.Z” and “.gz” compressed datafiles (highly
recommended).

5
）
bzip2 to handle “.bz2″
compressed files (optional).

6
）
p7zip to handle
“7-zip” compressed files (optional).

7
）
The Tcl embeddable
scripting language library (only required for some of the test
execu tables).

8
）
csh
Unix shell

如果以上工具没有全部安装的话，
srilm
编译肯定无法通过。使用
which
命令查找以上工具是否安装，例如：

which make，若得到的输出为
/usr/bin/make
，表明系统已经安装好了
GNU
make
。没有的话需要apt-get install ***（相应的包）。

其次，修改Makefile和common/Makefile.machine.i686文件：

1
）修改
Makefile
文件

找到以下两行：

# SRILM =
/home/speech/stolcke/project/srilm/devel
，另起一行输入
srilm
的安装路径，
SRILM =
($PWD)
。

MACHINE_TYPE := $(shell
$(SRILM)/sbin/machine-type)
，在其前加＃注释掉，并另起一行输入：
MACHINE_TYPE :=
i686

2
）修改
srilm/common/Makefile.machine.i686
文件

cd common/

　　

cp Makefile.machine.i686
Makefile.machine.i686.bak

　　

vi Makefile.machine.i686

将第
15
行
# Use the GNU C
compiler
下的三行修改如下：

　　

GCC_FLAGS = -mtune=pentium3 -Wreturn-type
-Wimplicit

　　

CC = gcc $(GCC_FLAGS)

　　

CXX = g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES

将
51
行
# Tcl support
(standard in Linux)
下的两行修改如下：

　　

TCL_INCLUDE = -I/usr/include/tcl8.5

　　

TCL_LIBRARY = -L/usr/lib/tcl8.5

如果是其他版本，则需要进行相应的修改。

然后，编译Srilm

cd ..

　　
make World

顺利的话，
srilm
就编译通过了。如果出现问题，很可能就是相应的依赖工具没有装完全，需要回到
3.2
中去重新检查一下下需要的配文件。

最后，
进入
srilm/test
目录进行测试：

编译通过不等于编译成功，必须利用
srilm
提供的测试模块进行测试。

1
）声明
srilm
编译成功后工具包所在的环境变量：

Export
PATH=$PATH:/home/fuxiaoyin/MTworkdir/srilm/bin/i686:/home/fuxiaoyin/MTworkdir/srilm/bin

2
）进入
test
进行测试

make test

终端上的显示如下所示：

*** Running test adapt-marginals ***

real
0m11.701s

user
0m11.507s

sys
0m0.148s

adapt-marginals: stdout output IDENTICAL.

adapt-marginals: stderr output IDENTICAL.

如果显示的输出大部分都是
IDENTICAL
，只有少数为
DIFFERS
，则表示
srilm
已经编译成功啦！
^_^

3
）声明
expot
命令只在当前用户登录有效，切换用户或者重启电脑以后则需要重新
export
路径。为了一劳永逸，我们还需要做如下设置。（不过这一步不一定是必须的，我在自己的机器上并没有经过这一步）

在
/etc/profile
文件中找到下面一段代码：

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE

for i in /etc/profile.d/*.sh ; do

if [ -r "$i" ]; then

if [ "$PS1" ]; then

.
$i

else

. $i >/dev/null
2>&1

fi

fi

export PATH=$PATH:/home/fuxiaoyin/MTworkdir/srilm/bin/i686:/home/fuxiaoyin/MTworkdir/srilm/bin

done

红色字体为要插入的声明，保存以后再重新
test
一下。

2. 安装翻译模型训练工具GIZA++和mkcls：

1）在mtworkdir目录下下载并解压Giza++：

　　cd /home/52nlp/mtworkdir

　　wget http://ling.umd.edu/~redpony/software/giza++.gcc41.tar.gz
　　tar -zxvf giza++.gcc41.tar.gz

　解压后得到GIZA++-v2/目录

2）编译Giza++:

　　cd GIZA++v2

　　make

3）下载解压并编译mkcls：

　　cd ..(重新进入mtworkdir目录）

　　wget http://ling.umd.edu/~redpony/software/mkcls.gcc41.tar.gz
　　tar -zxvf mkcls.gcc41.tar.gz

　　cd mkcls-v2

　　make

　这一步一般没啥问题。

4）建立bin目录，并将giza++,mkcls工具拷贝到bin目录下：

　　cd ..

　　mkdir -p bin

　　cp GIZA++-v2/GIZA++ bin/

　　cp GIZA++-v2/snt2cooc.out bin/

　　cp mkcls-v2/mkcls bin/

3. 安装Moses：

1）建立目录，通过svn下载moses:

　　mkdir -p moses

　　svn co 　https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses

2）下载完成后编译：

　　cd moses

　　./regenerate-makefiles.sh

　　./configure –with-srilm=.../mtworkdir/srilm

　　make -j 4

　　cd ..

　注：srilm指向绝对路径。

3）安装Moses训练脚本

　建立训练脚本目录：

　　mkdir -p bin/moses-scripts

　　修改makefile:

　　vi moses/scripts/Makefile

　　将第13、14行修改如下：

　　TARGETDIR=.../mtworkdir/bin/moses-scripts（moses-scripts目录的绝对目录地址）

　　BINDIR=.../mtworkdir/bin（bin目录的绝对目录地址）

　编译：

　　cd moses/scripts/

　　make release

　　cd ../..

　使用时需要声明环境：

　　export SCRIPTS_ROOTDIR=.../mtworkdir/bin/moses-scripts/scripts-20090113-1019

二、实验步骤：

1. 建立work-dir目录，在该目录下建立4个工作目录：

work-dir/

工作目录，名称可以自己指定

alignment-corpus/

训练数据存放

lang-model-corpus/

语言模型数据存放

tuning/
MERT
的
DEV
数据存放

evaluation/

测试数据存放

2. 调用ICTLAS pai实现一个中文分词工具【详见上一篇blog】

3. 在
alignment-corpus/ 目录下完成中文文本的分词和英文文本的分词、过滤长语句、小写化处理：

cd ChineseSegmentation

./ChineseSeg

（以上两步是中文分词的过程，ChineseSeg是我实现的一个中文分词程序的binary文件）

cd '.../work-dir/alignment-corpus'

tokenizer.perl -l en < '.../work-dir/alignment-corpus/e.txt' > corpus.tok.en

clean-corpus-n.perl corpus.tok ch en corpus.clean 1 40

lowercase.perl < corpus.clean.en > corpus.lowercased.en

(copy corpus.clean.ch as corpus.lowercased.ch)

4. 英文语言模型训练：

这里采用的是3-gram

cd '.../work-dir/lang-model-corpus'

ngram-count -order 3 -interpolate -kndiscount -text '.../work-dir/lang-model-corpus/corpus.lowercased.en' -lm corpus.lm

5. 翻译模型训练：

根据分过词的中英文文本进行训练，生成翻译模型。

cd ..

train-factored-phrase-model.perl -scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus alignment-corpus/corpus.lowercased -f ch -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:'.../work-dir/lang-model-corpus/corpus.lm':0

（注意：以上几步所使用的语料库是训练集语料，包含了一个中文语料文件和一个对应的英文语料文件）

6. 最小错误率训练（这一步是针对开发集数据做的，我的开发集数据包括了一个中文文本和对应的4个建议的英文翻译文本）：

首先使用第3步的方法对这5个文本进行分词和小写化处理，生成5个文件在tuning/文件夹中：input（分词后的中文开发集语料），ref0~3（分词和小写化后的英文开发集语料）。

mert-moses.pl tuning/input tuning/ref /usr/bin/moses model/moses.ini --working-dir tuning --rootdir /usr/share/moses/scripts （tuning/input为中文语料；tuning/ref为英文预料，该命令会自动寻找命名为ref0~n的各个语料）

上一步需要很长的时间，和开发语料库的大小及机器性能有关系。我这边跑了10个小时，反正晚上让它跑着呗~~

成功执行后，我们获得了moses.ini文件，这就是可以用来进行汉英翻译的核心文件。下面我们对其进行进一步处理：

reuse-weights.perl tuning/moses.ini < model/moses.ini > tuning/moses.weight-reused.ini

我们获得了最终的翻译模型moses.weightreused.ini

7. 对测试数据翻译并得分（对测试集数据进行，我的测试集数据包括了一个中文文本和对应的4个建议的英文翻译文本）：

首先使用第3步的方法对这5个文本进行分词和小写化处理，生成5个文件在evaluation/文件夹中：test.ch.input（分词后的中文测试集语料），test.en.ref0~3（分词和小写化后的英文训练集语料）。

短语过滤：

filter-model-given-input.pl evaluation/corpus tuning/moses.weight-reused.ini evaluation/test.ch.input

使用moses解码：

moses -config evaluation/corpus/moses.ini -input-file evaluation/test.ch.input > evaluation/test.ch.output

计算bleu得分：

multi-bleu.perl evaluation/test.en.ref < evaluation/test.ch.output > result.txt

最终的得分结果在result.txt中，其中包含了该翻译系统针对测试集翻译的bleu值，如果超过了我们预计的baseline，说明效果不错~~

我的实现获得的汉英翻译的bleu值为16.59，不是很理想。但是将这个实验过程记录下来，以备需要时使用~~

本文参考了：
http://www.52nlp.cn/ubuntu-moses-platform-build-process-record http://blog.sina.com.cn/s/blog_6cb4abf10100nw9z.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航