您的位置：首页 > 其它

tesseract-ocr语言库训练的一种出错情况

2015-05-07 14:12 441 查看

看过一些其他博客关于tesseract-ocr的介绍，关于训练语言库的方法都类似。但是，由于一些小地方的错误，都没有出现预期的结果。比如定义字体特征文件，文件的后缀为.txt文件，具体怎么设置可以详看http://blog.csdn.net/firehood_/article/details/8433077的文章。我根据这个步骤下来，只有到最后一步“7.生成语言文件”时才出现了错误。它的批处理文件里是这样的内容：

rem 执行改批处理前先要目录下创建font_properties文件



echo Run Tesseract for Training..

tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train



echo Compute the Character Set..

unicharset_extractor.exe num.font.exp0.box

mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr



echo Clustering..

cntraining.exe num.font.exp0.tr



echo Rename Files..

rename normproto num.normproto

rename inttemp num.inttemp

rename pffmtable num.pffmtable

rename shapetable num.shapetable



echo Create Tessdata..

combine_tessdata.exe num.

可是出现了这样的结果：

E:\tesseract\tessdata>test.bat

E:\tesseract\tessdata>rem 执行改批处理前先要目录下创建font_properties文件

E:\tesseract\tessdata>echo Run Tesseract for Training..

Run Tesseract for Training..

E:\tesseract\tessdata>tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.

train

Tesseract Open Source OCR Engine v3.02 with Leptonica

Page 1 of 5

APPLY_BOXES:

Boxes read from boxfile: 10

Found 10 good blobs.

TRAINING ... Font name = font

Generated training data for 1 words

Page 2 of 5

APPLY_BOXES:

Boxes read from boxfile: 10

Found 10 good blobs.

Generated training data for 4 words

Page 3 of 5

APPLY_BOXES:

Boxes read from boxfile: 10

Found 10 good blobs.

Generated training data for 1 words

Page 4 of 5

APPLY_BOXES:

Boxes read from boxfile: 10

Found 10 good blobs.

Generated training data for 1 words

Page 5 of 5

APPLY_BOXES:

Boxes read from boxfile: 10

Found 10 good blobs.

Generated training data for 1 words

E:\tesseract\tessdata>echo Compute the Character Set..

Compute the Character Set..

E:\tesseract\tessdata>unicharset_extractor.exe num.font.exp0.box

Extracting unicharset from num.font.exp0.box

Wrote unicharset file ./unicharset.

E:\tesseract\tessdata>mftraining -F font_properties -U unicharset -O num.unichar

set num.font.exp0.tr

Warning: No shape table file present: shapetable

Failed to load font_properties from font_properties

E:\tesseract\tessdata>echo Clustering..

Clustering..

E:\tesseract\tessdata>cntraining.exe num.font.exp0.tr

Reading num.font.exp0.tr ...

Clustering ...

Writing normproto ...

E:\tesseract\tessdata>echo Rename Files..

Rename Files..

E:\tesseract\tessdata>rename normproto num.normproto

E:\tesseract\tessdata>rename inttemp num.inttemp

系统找不到指定的文件。

E:\tesseract\tessdata>rename pffmtable num.pffmtable

系统找不到指定的文件。

E:\tesseract\tessdata>rename shapetable num.shapetable

系统找不到指定的文件。

E:\tesseract\tessdata>echo Create Tessdata..

Create Tessdata..

E:\tesseract\tessdata>combine_tessdata.exe num.

Combining tessdata files

Error opening unicharset file

Error combining tessdata files into num.traineddata

非常郁闷，经过反复的尝试，最后发现，了问题所在。只要把批处理文件中“mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr”改为““mftraining -F font_properties.txt
-U unicharset -O num.unicharset num.font.exp0.tr””问题就解决了！出现了想要的结果：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： tesseract-ocr 样本训练语言库训练出错

相关文章推荐

新的分享

章节导航