python-segment - segmentation and classify library written by python - Google Project Hosting
2012-05-30 17:09
603 查看
python-segment - segmentation and classify library written by python - Google Project Hosting
简介
python-segment是一个纯python实现的分词库,他的目标是提供一个可用的,完善的分词系统和训练环境,包括一个可用的词典。更进一步的说明,请看INSTALL文档和README文档。同时,系统中附带了一个简单的分类库,具体可以看clsUsage文档。原理
python-segment的词典是带词频无词性词典,程序基于剪枝和词频概率工作,不考虑词性,不考虑马尔可夫链。词典含两部分内容,单字词频和词组词频。两者的统计和使用是分离的。词典一般有两种形态,marshal格式和txt格式。性能说明
在一台虚拟机上测试的结果,载入词典后消耗内存(带python)大约60m,分词效率大约100k字/秒。注意,默认情况下,程序使用yield返回分词结果,这不会消耗太多内存。但是如果需要保留分词得到的每个词语碎片,将耗费大量内存。根据测试,一个10M的文本文件(大约500W字)需要120m以上的内存来保持词语碎片。词典生成
按照如下方式,使用dbmgr生成frq.db文件。gunzip dict.tar.gz ./ps_dbmgr create dict.txt你可以看到生成了frq.db,这是词典的默认文件名。注意,词典文件的格式和具体的版本有关,换用版本后最好重新生成词典。
命令行使用
假定有一个文本文件,test.txt,里面内容是中文平文本,编码任意。./ps_cutter cutshow test.txtcutter会自动推测编码。
代码使用
假如当前有一个frq.db词库。import segment cut = segment.get_cutter('frq.db') print list(cut.parse(u'工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作'))注意,仅仅使用parse是不会进行分词的,因为parse返回的是一个生成器。
相关文章推荐
- pywebproxy - anti-gfw proxy written by python - Google Project Hosting
- spserver - SPServer is a high concurrency server framework library written on C++ - Google Project Hosting
- python-message - A message-oriented programming library for python - Google Project Hosting
- psutil - A cross-platform process and system utilities module for Python - Google Project Hosting
- webscraping - Python library for web scraping - Google Project Hosting
- cute-log - A lightweight, flxiable, high configurable, thread-safe and cute logging library - Google Project Hosting
- web-classify - 用于网页分类的python工具包 - Google Project Hosting
- pychseg - A Python Chinese Segment Project - Google Project Hosting
- templatemaker - Python library for extracting data from similarly formatted text strings. - Google Project Hosting
- pyscripter - An open-source Python Integrated Development Environment (IDE) - Google Project Hosting
- ulipad - python editor based on wxPython - Google Project Hosting
- scipy-cluster - An extension to Scipy for generating, visualizing, and analyzing hierarchical clusters. - Google Project Hosting
- google-glog - Logging library for C++ - Google Project Hosting
- byteplay - Python bytecode assembler-disassembler - Google Project Hosting
- cefpython - Python bindings for embedding the Chrome browser - Google Project Hosting
- html5lib - Library for working with HTML documents - Google Project Hosting html5lib - Library for working with HTML documents - Google Project Hosting
- sdict - sorted dictionary for python - Google Project Hosting
- libibase - 实时增量全文检索搜索引擎系统(Instant and Incremental Full-Text Search Engine) - Google Project Hosting
- SimpleCV install and "You need the python image library to save by filehandle"
- xrelayer - A lightweight HTTP proxy written in C++ - Google Project Hosting