不选择使用Lucene的6大原因(转载) - Hubble.net 将尽可能解决这些问题
2009-03-12 18:18
609 查看
原文: 不选择使用Lucene的6大原因
Lucene是开放源代码的全文搜索引擎工具包,凭借着其强劲的搜索功能和简单易用的实现,在国内已经很普及,甚至一度出现了言搜索必称Lucene的盛
景。上个月Lucene的开发团队发布了 Java Lucene 2.3.1
,相信很多朋友们都用上了。在国内对Lucene的介绍可以分为3块儿:
第一类是:以车东 的Lucene:基于Java的全文检索引擎简介 为代表的基础入门介绍;
第二类是Lucene倒排索引原理和Lucene软件包、实现类的介绍;
第三类是以中文分词为中心的介绍;
任何一个软件,包括所有伟大的软件都有这样或者那样的“缺点”和各自适用的领域,Lucene也不例外。在国内对Lucene这个软件包的批评,似乎没有
看到过。可能大家都忙于做项目,纵然Lucene有再大的缺陷,凭借着Lucene良好的口碑,也不会说上一句不是。
今天在阅读LingWay (一个做垂直的语义搜索引擎)的CTO Cedric Champeau 先生的博客是发现有一篇题为:Why
lucene isn't that good 为什么Lucene并不是想象的那么棒 的文章:Champeau
开门见山指出了Lucene的6大不足之处,鉴于 Lingway 公司使用Lucene已有好几年的历史,我相信Cedric
Champeau的对Lucene的评论还是值得一读。
不选择使用Lucene的6大原因:
6、Lucene 的内建不支持群集。
Lucene是作为嵌入式的工具包的形式出现的,在核心代码上没有提供对群集的支持。实现对Lucene的群集有三种方式:1、继承实现一个
Directory;2、使用Solr 3、使用 Nutch+Hadoop;使用Solr你不得不用他的Index Server
,而使用Nutch你又不得不集成抓取的模块;
5、区间范围搜索速度非常缓慢;
Lucene的区间范围搜索,不是一开始就提供的是后来才加上的。对于在单个文档中term出现比较多的情况,搜索速度会变得很慢。因此作者称Lucene是一个高效的全文搜索引擎,其高效仅限于提供基本布尔查询 boolean queries;
4、排序算法的实现不是可插拔的,因为贯穿Lucene的排序算法的tf/idf 的实现,尽管term是可以设置boost或者扩展Lucene的Query类,但是对于复杂的排序算法定制还是有很大的局限性;
3、Lucene的结构设计不好;
Lucene的OO设计的非常糟,尽管有包package和类class,但是Lucene的设计基本上没有设计模式的身影。这是不是c或者c++程序员写java程序的通病?
A、Lucene中没有使用接口Interface,比如Query 类( BooleanQuery, SpanQuery, TermQuery...) 大都是从超类中继承下来的;
B、Lucene的迭代实现不自然: 没有hasNext() 方法, next() 返回一个布尔值 boolean然后刷新对象的上下文;
2、封闭设计的API使得扩展Lucene变得很困难;
参考第3点;
1、Lucene的搜索算法不适用于网格计算;
下面是英文原文 Moving Lucene a step forward
6. No built-in support for clustering. If you want to create clusters, either write your own implementation of a Directory, or use Solr, or Nutch+Hadoop.
Both Solr and Nutch leverage Lucene, but are not straight replacements.
Lucene is embeddable, while you must leverage on Solr or Nutch. Well, I
think it is not very surprising that Hadoop idea emerged from the
Lucene team : Lucene doesn't scale out. It's internals makes it (very)
fast for most common situations, but for large document sets, you have to
scale out, and as Lucene does not implement clustering at the core
level, you must switch from Lucene to another search engine layer,
which is not straightforward. The problem with switching to Solr or
Nutch is that you carry things you probably won't need : integrated
crawling for Nutch and indexing server for Solr.
5. Span queries are slow. This may be interpreted
as a problem specific to Lingway where we make intensive use of span
queries (NEAR operator : "red NEAR car"), but the Lucene index
structure has been updated to add this feature, which was not thought
at first. The underlying implementation leads to complex algorithms
that are very slow, especially when some term is repeated many times in
a single large document. That's why I tend to say that Lucene is a high-performance text search engine only if you use basic boolean queries.
4. Scoring is not really pluggable. Lucene has its own implementation of a scoring algorithm, where terms may be boosted, and makes use of a Similarity
class, but soon shows limitations when you want to perform complex
scoring, for example based on actual matches and meta data about the
query. If you want to do so, you'll have to extend the Lucene query
classes. The facts is that Lucene has been thought in terms of tf/idf
like scoring algorithms, while in our situation, for linguistic based
scoring, the structure of Lucene scoring facilities don't fit. We've
been obliged to override every Lucene query class in order to add
support for our custom scoring. And that was a problem :
3. Lucene is not well designed. As a software
architect, I would tend to make this reason 1. Lucene has a very poor
OO design. Basically, you have packages, classes, but almost no design
pattern usage. I always makes me think about an application written by
C(++) developers who discover Java and carry bad practices among them.
This is a serious point : whenever you have to customize Lucene to suit
your needs (and you will have to do so), you'll have to face the
problem. Here are some examples :
Almost no use of interfaces. Query classes (for example BooleanQuery, SpanQuery, TermQuery...)
are all subclasses of an abstract class. If you want to add a feature
to those classes, you'll basically want to write an interface which
describes the contract for your extension, but as the abstract Query
class does not implement an interface, you'll have to constantly cast
your custom query objects to a Query in order to be able to
use your objects in native Lucene calls. Tons of examples like this
(HitCollector, ...). This is also a problem when you wan't to use AOP
and auto-proxying.
Unnatural iterator implementations. No hasNext() method, next()
returns a boolean and refreshes the object context. This is a pain when
you want to keep track of iterated elements. I assume this have been
done intentionally to reduce memory footprint but it makes once again
algorithms both unclear and complex.
2. A closed API which makes extending Lucene a pain.
In Lucene world, it is called a feature. The policy is to open classes
when some user needs to gain access to some feature. This leads to an
API where most classes are package protected, which means you won't
ever be able to extend it (unless you create your class in the same
package, which is quite dirty for custom code) or you'll have to copy
and rewrite the code. Moreover, and it is quite related to the previous
point, there's a serious lack of OO design here too : some classes
which should have been inner are not, and anonymous classes are used
for complex operations where you would typically need to override their
behaviour. The reason invoked for closing the API is that the code has
to be cleaned up and made stable before releasing it publicly. While
the idea is honourable, once again it is a pain because if you have
some code that does not fit in the mainstream Lucene idea, you'll have
to constantly backport Lucene improvements to your version until your
patch is accepted. However, as the developers want to limit API changes
as long as possible, there's little chance that your patch will ever be
commited. Add some final modifiers on either classes or
methods and you'll face the problem. I don't think the Spring framework
would have come so popular if the code had been so locked...
1. Lucene search algorithms are not adapted to grid computing.
Lucene has been written when hardware did not have memory that much and
multicore processors didn't exist. Therefore, the index structure has
been thought and implemented in order to perform fast linear searches
with a very little memory footprint. I've personally spent lots of
hours trying to rewrite span query algorithms so that I could make use
of a multithreaded context (for dual/quad core cpus), but the
iterator-based directory reading algorithms make it hardly impossible
to do so. In some rare cases you'll be able to optimize things and
iterate the index in parallel, but in most situations it will be
impossible. In our case, when we have a very complex query with 50+
embedded span queries, the CPU is almost not used while the system is
constantly calling I/Os, even using a RAMDirectory.
Lucene是开放源代码的全文搜索引擎工具包,凭借着其强劲的搜索功能和简单易用的实现,在国内已经很普及,甚至一度出现了言搜索必称Lucene的盛
景。上个月Lucene的开发团队发布了 Java Lucene 2.3.1
,相信很多朋友们都用上了。在国内对Lucene的介绍可以分为3块儿:
第一类是:以车东 的Lucene:基于Java的全文检索引擎简介 为代表的基础入门介绍;
第二类是Lucene倒排索引原理和Lucene软件包、实现类的介绍;
第三类是以中文分词为中心的介绍;
任何一个软件,包括所有伟大的软件都有这样或者那样的“缺点”和各自适用的领域,Lucene也不例外。在国内对Lucene这个软件包的批评,似乎没有
看到过。可能大家都忙于做项目,纵然Lucene有再大的缺陷,凭借着Lucene良好的口碑,也不会说上一句不是。
今天在阅读LingWay (一个做垂直的语义搜索引擎)的CTO Cedric Champeau 先生的博客是发现有一篇题为:Why
lucene isn't that good 为什么Lucene并不是想象的那么棒 的文章:Champeau
开门见山指出了Lucene的6大不足之处,鉴于 Lingway 公司使用Lucene已有好几年的历史,我相信Cedric
Champeau的对Lucene的评论还是值得一读。
不选择使用Lucene的6大原因:
6、Lucene 的内建不支持群集。
Lucene是作为嵌入式的工具包的形式出现的,在核心代码上没有提供对群集的支持。实现对Lucene的群集有三种方式:1、继承实现一个
Directory;2、使用Solr 3、使用 Nutch+Hadoop;使用Solr你不得不用他的Index Server
,而使用Nutch你又不得不集成抓取的模块;
5、区间范围搜索速度非常缓慢;
Lucene的区间范围搜索,不是一开始就提供的是后来才加上的。对于在单个文档中term出现比较多的情况,搜索速度会变得很慢。因此作者称Lucene是一个高效的全文搜索引擎,其高效仅限于提供基本布尔查询 boolean queries;
4、排序算法的实现不是可插拔的,因为贯穿Lucene的排序算法的tf/idf 的实现,尽管term是可以设置boost或者扩展Lucene的Query类,但是对于复杂的排序算法定制还是有很大的局限性;
3、Lucene的结构设计不好;
Lucene的OO设计的非常糟,尽管有包package和类class,但是Lucene的设计基本上没有设计模式的身影。这是不是c或者c++程序员写java程序的通病?
A、Lucene中没有使用接口Interface,比如Query 类( BooleanQuery, SpanQuery, TermQuery...) 大都是从超类中继承下来的;
B、Lucene的迭代实现不自然: 没有hasNext() 方法, next() 返回一个布尔值 boolean然后刷新对象的上下文;
2、封闭设计的API使得扩展Lucene变得很困难;
参考第3点;
1、Lucene的搜索算法不适用于网格计算;
下面是英文原文 Moving Lucene a step forward
6. No built-in support for clustering. If you want to create clusters, either write your own implementation of a Directory, or use Solr, or Nutch+Hadoop.
Both Solr and Nutch leverage Lucene, but are not straight replacements.
Lucene is embeddable, while you must leverage on Solr or Nutch. Well, I
think it is not very surprising that Hadoop idea emerged from the
Lucene team : Lucene doesn't scale out. It's internals makes it (very)
fast for most common situations, but for large document sets, you have to
scale out, and as Lucene does not implement clustering at the core
level, you must switch from Lucene to another search engine layer,
which is not straightforward. The problem with switching to Solr or
Nutch is that you carry things you probably won't need : integrated
crawling for Nutch and indexing server for Solr.
5. Span queries are slow. This may be interpreted
as a problem specific to Lingway where we make intensive use of span
queries (NEAR operator : "red NEAR car"), but the Lucene index
structure has been updated to add this feature, which was not thought
at first. The underlying implementation leads to complex algorithms
that are very slow, especially when some term is repeated many times in
a single large document. That's why I tend to say that Lucene is a high-performance text search engine only if you use basic boolean queries.
4. Scoring is not really pluggable. Lucene has its own implementation of a scoring algorithm, where terms may be boosted, and makes use of a Similarity
class, but soon shows limitations when you want to perform complex
scoring, for example based on actual matches and meta data about the
query. If you want to do so, you'll have to extend the Lucene query
classes. The facts is that Lucene has been thought in terms of tf/idf
like scoring algorithms, while in our situation, for linguistic based
scoring, the structure of Lucene scoring facilities don't fit. We've
been obliged to override every Lucene query class in order to add
support for our custom scoring. And that was a problem :
3. Lucene is not well designed. As a software
architect, I would tend to make this reason 1. Lucene has a very poor
OO design. Basically, you have packages, classes, but almost no design
pattern usage. I always makes me think about an application written by
C(++) developers who discover Java and carry bad practices among them.
This is a serious point : whenever you have to customize Lucene to suit
your needs (and you will have to do so), you'll have to face the
problem. Here are some examples :
Almost no use of interfaces. Query classes (for example BooleanQuery, SpanQuery, TermQuery...)
are all subclasses of an abstract class. If you want to add a feature
to those classes, you'll basically want to write an interface which
describes the contract for your extension, but as the abstract Query
class does not implement an interface, you'll have to constantly cast
your custom query objects to a Query in order to be able to
use your objects in native Lucene calls. Tons of examples like this
(HitCollector, ...). This is also a problem when you wan't to use AOP
and auto-proxying.
Unnatural iterator implementations. No hasNext() method, next()
returns a boolean and refreshes the object context. This is a pain when
you want to keep track of iterated elements. I assume this have been
done intentionally to reduce memory footprint but it makes once again
algorithms both unclear and complex.
2. A closed API which makes extending Lucene a pain.
In Lucene world, it is called a feature. The policy is to open classes
when some user needs to gain access to some feature. This leads to an
API where most classes are package protected, which means you won't
ever be able to extend it (unless you create your class in the same
package, which is quite dirty for custom code) or you'll have to copy
and rewrite the code. Moreover, and it is quite related to the previous
point, there's a serious lack of OO design here too : some classes
which should have been inner are not, and anonymous classes are used
for complex operations where you would typically need to override their
behaviour. The reason invoked for closing the API is that the code has
to be cleaned up and made stable before releasing it publicly. While
the idea is honourable, once again it is a pain because if you have
some code that does not fit in the mainstream Lucene idea, you'll have
to constantly backport Lucene improvements to your version until your
patch is accepted. However, as the developers want to limit API changes
as long as possible, there's little chance that your patch will ever be
commited. Add some final modifiers on either classes or
methods and you'll face the problem. I don't think the Spring framework
would have come so popular if the code had been so locked...
1. Lucene search algorithms are not adapted to grid computing.
Lucene has been written when hardware did not have memory that much and
multicore processors didn't exist. Therefore, the index structure has
been thought and implemented in order to perform fast linear searches
with a very little memory footprint. I've personally spent lots of
hours trying to rewrite span query algorithms so that I could make use
of a multithreaded context (for dual/quad core cpus), but the
iterator-based directory reading algorithms make it hardly impossible
to do so. In some rare cases you'll be able to optimize things and
iterate the index in parallel, but in most situations it will be
impossible. In our case, when we have a very complex query with 50+
embedded span queries, the CPU is almost not used while the system is
constantly calling I/Os, even using a RAMDirectory.
相关文章推荐
- 不选择使用Lucene的6大原因(转载) - Hubble.net 将尽可能解决这些问题
- 转载:在部署时使用Excel .NET运行库导出Excel遇到问题及解决办法
- [Lucene.Net] 内存泄漏问题解决方法 (转载)
- 使用ASP.NET AJAX 和JQuery一起解决翻页选择的问题
- 转载+整理:在部署时使用Excel .NET运行库导出Excel遇到问题及解决办法
- 使用ArcGIS GP服务遇到的问题,原因以及解决方法 - 客户端API开发(Javascript/Flex/Silverlig 转载
- 使用JMAIL收发邮件问题。利用 Chilkat .NET for 2.0组件解决收发邮件的问题
- IIS服务器不支持ASP.NET的原因及解决方法[转载]
- ASP.NET AJAX Advance Tips & Tricks (10) 解决使用AJAX Extender时的页面导出(Word/Excel)问题(Extender control 'XXX'
- ASP.Net MVC_DotNetZip简单使用方法,解决文件压缩的问题
- asp.net中使用response.write造成界面变形问题的解决办法
- Git 系列之二:Windows 下 Git 客户端的选择,及 msysGit 各种中文问题的解决-转载
- 使用sun.net.ftp.FtpClient进行上传功能开发,在jdk1.7上不适用问题的解决
- 关于ASP.NET“操作必须使用一个可更新的查询”问题的解决方法
- 问题解决---未能将网站配置为使用ASP.NET4.0
- vb.net中使用GetPrivateProfileString访问INI文件,解决中文路径问题
- 常量,字段,构造方法 调试 ms 源代码 一个C#二维码图片识别的Demo 近期ASP.NET问题汇总及对应的解决办法 c# chart控件柱状图,改变柱子宽度 使用C#创建Windows服务 C#服务端判断客户端socket是否已断开的方法 线程 线程池 Task .NET 单元测试的利剑——模拟框架Moq
- lucene使用IndexWriter时遇到LockObtainFailedException: Lock obtain timed out 异常原因及解决办法
- 在ASP.NET中使用IHttpHandler处理请求(如自实现AJAX)时,无法获得Session(或者说是Session 为 null)的原因及解决方法
- 关于使用asp.net调试器出现的问题及相关解决方法