Solr using RankingAlgorithm实现准实时搜索
2011-09-15 18:15
344 查看
Implementing NRT (Near Real Time Search) in Solr using RankingAlgorithm
By
Nagendra Nagarajayya
http://solr-ra.tgels.com
Summary
This paper describes how NRT (Near Real Time Search) can be implemented in Solr
using the RankingAlgorithm. The technical details of the NRT implementation are
discussed below.
Step 1:
Changes to DirectUpdateHandler2.java
a. code changes
The main code change is in DirectUpdateHandler2.java where a commit is no longer
needed now.
public int addDoc(AddUpdateCommand cmd) throws IOException {
…
The below code needs to be added to the the method:
if (realtime) {
1 IndexReader r = core.getNRTReader();
2 core.storeNRTReader(writer.getReader());
if (r != null) {
3 r.close();
}
}
Description:
If realtime is enabled in solrconfig.xml by adding <realtime>true</realtime>:
1. the method core.getNRTReader(); retrieves the existing reader.
2. The writer.getReader() method gets a new reader with the newly added docs in its cache.
3. The old reader is closed.
Other modifications to DirectUpdateHandler2.java:
protected void closeWriter() throws IOException {
IndexReader r = core.getNRTReader();
if (r != null) {
r.close();
core.storeNRTReader(null);
}
protected void rollbackWriter() throws IOException {
try {
numDocsPending.set(0);
if (writer!=null) writer.rollback();
IndexReader r = core.getNRTReader();
if (r != null) {
r.close();
core.storeNRTReader(null);
}
Step 2:
Changes to SolrCore.java
a. code changes
The below instance attributes are used to store the reader and the time of update:
private HashMap<String, IndexReader> reader_hm = new HashMap() ;
private HashMap<String, Long> update_hm = new HashMap() ;
The below methods make available the new reader to other components:
public IndexReader getNRTReader() {
return reader_hm.get(name);
}
public long getNRTWhenTime() {
Long l = update_hm.get(name);
if (l == null) {
return 0;
}
return l.longValue();
}
public void storeNRTReader(IndexReader ir) {
reader_hm.put(name, ir);
update_hm.put(name, new Long(System.currentTimeMillis()));
}
Step 3:
Changes to SolrIndexSearcher.java
a. code changes
private IndexReader ir = null;
RankingAlgorithm uses the new IndexReader as below:
ir1 = reader;
1 if (realtime) {
2 ir1 = core.getNRTReader();
3 Long when = core.getNRTWhenTime();
4 if (ir1 != null) {
5 if (this.when.longValue() < when.longValue() ) {
if (ir != null) {
6 ir.close();
}
7 ir = (IndexReader)ir1.clone();
}
8 ir1 = ir;
}else {
9 ir1 = reader;
}
}
Description:
1. Realtime is enabled
2. request core to get any new IndexReader if available.
3. get time when reader was created
4. if reader exists
5. check timestamps to see if it is a new reader
6. if so, close any old readers
7. clone the new reader
8. use this as the reader for search
9. use the old reader for search
public int maxDoc() throws IOException {
1 if (realtime && ir != null) {
2 return ir.maxDoc();
}
3 return super.maxDoc();
}
Description:
1. if realtime
2. return maxDoc using the new reader
3. If not return maxDoc with the old reader
A new method getWrappedReader() that returns the IndexReader instead of the
SolrIndexReader for faceting, fq, etc.:
public IndexReader getWrappedReader() {
if (ir != null && realtime) {
return ir;
}
return reader.getWrappedReader();
}
public Document doc(int n, FieldSelector fieldSelector) throws IOException {
try {
if (ir != null && realtime) {
return ir.document(n);
}
return getIndexReader().document(n, fieldSelector);
} catch(IOException t) {
throw t;
}
}
public Document doc(int i, Set<String> fields) throws IOException {
Document d=null;
if (documentCache != null) {
d = (Document)documentCache.get(i);
if (d!=null) return d;
}
if(!enableLazyFieldLoading || fields == null) {
//d = getIndexReader().document(i);
try {
if (ir == null && realtime) {
IndexReader ir1 = core.getNRTReader();
when = core.getNRTWhenTime();
if (ir1 != null) {
ir = (IndexReader)ir1.clone();
}
}
if (ir != null) {
d = ir.document(i);
}else {
d = getIndexReader().document(i);
}
} catch(IOException t) {
throw t;
}
} else {
//d = getIndexReader().document(i,
//s new SetNonLazyFieldSelector(fields));
try {
if (ir == null && realtime) {
IndexReader ir1 = core.getNRTReader();
when = core.getNRTWhenTime();
if (ir1 != null) {
ir = (IndexReader)ir1.clone();
}
}
if (ir != null) {
d = ir.document(i, new SetNonLazyFieldSelector(fields));
}else {
d = getIndexReader().document(i, new
SetNonLazyFieldSelector(fields));
}
} catch(Throwable t) {
throw new IOException(t);
}
}
if (documentCache != null) {
documentCache.put(i, d);
}
return d;
}
Step 4:
Changes to UnInvertedField.java:
a. code changes
public static UnInvertedField getUnInvertedField(String field, SolrIndexSearcher
searcher) throws IOException {
SolrCache cache = searcher.getFieldValueCache();
if (cache == null) {
return new UnInvertedField(field, searcher);
}
UnInvertedField uif = (UnInvertedField)cache.get(field);
if (uif == null) {
synchronized (cache) {
uif = (UnInvertedField)cache.get(field);
if (uif == null) {
uif = new UnInvertedField(field, searcher);
cache.put(field, uif);
}
}
}
/* NRT */
1 if (searcher.maxDoc() > uif.index.length) {
2 uif = new UnInvertedField(field, searcher); /* need to make this
dynamic*/
3 cache.put(field, uif);
}
return uif;
}
}
Description:
1. Check if any docs were added
2. Create a new copy of UIF
3. Store this in the cache and return the new UIF
b. Change all getReader() method calls to getWrappedReader() in the file.
Step 5:
Changes to SimpleFacet.java:
a. Change all getReader() method calls to getWrappedReader()
Step 6:
Changes to SolrConfig.java:
a. code changes
public boolean realtime = false;
public boolean getRealtime() {
return realtime;
}
realtime = getBool("realtime", false);
Conclusion
The near real time search in Solr-RA works well and allows concurrent search with
indexing in parallel without closing the IndexSearchers or clearing the cache
providing the ability to offer searches in near real time. The NRT implementation
supports faceting, filter queries, etc. The faceting count can be seen changing as
documents are added in the screenshots below Fig 1 and Fig2. Fig 1 shows a facet
query for “john” from the mbartists index (from the book Solr-14-Enterprise-Search-Server). Fig 2 shows the same query after adding a new artist to the index as below:
curl "http://localhost:8990/solr/mbartists/update/csv?stream.file=/tmp/x.csv&encapsulator=%1f"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">163</int></lst>
</response>
cat /tmp/x:
id,type,a_name,a_name_sort,a_alias,a_type,a_begin_date,a_end_date,a_member_name,a_member_id,a_release_d
ate_latest,a_spell,a_spellPhrase,r_name,r_name_sort,r_name_facetLetter,r_a_name,r_a_id,r_attributes,r_t
ype,r_official,r_lang,r_tracks,r_event_country,r_event_date,r_event_date_earliest,l_name,l_name_sort,l_
type,l_begin_date,l_end_date,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r
_tracks,t_trm_lookups,word,includes
Artist:3991866,Artist,John Ab Davis,John Ab Davis,,person,1942-12-29T00:00:00Z,1999-12-10T00:00:00Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Fig 1, shows numFound as 3256, and the facet count for “john” as 3256. Fig 2 after
adding a doc with curl shows 3257, and the facet count for “john” as 3257. The Solr
query is as below:
http://192.168.1.126:8990/solr/mbartists/select/?q=john&facet=on&facet.field=a_name&facet.field=a_type&fl=score
Note: make sure you clear the browser cache
The indexing performance observed on a 2 core intel system with Fedora Linux 12 is
about 262 tps (new document adds). This could be improved to a very high number
(from 14 secs for indexing about 3900 documents to about 2 secs) if
IndexWriter.getReader() performance is improved; at the moment, it takes
about 70-90 ms to get a IndexReader.
The modified src can be downloaded along with Solr with RankingAlgorithm from
here:
http://solr-ra.tgels.com
Fig 1
Fig 2
By
Nagendra Nagarajayya
http://solr-ra.tgels.com
Summary
This paper describes how NRT (Near Real Time Search) can be implemented in Solr
using the RankingAlgorithm. The technical details of the NRT implementation are
discussed below.
Step 1:
Changes to DirectUpdateHandler2.java
a. code changes
The main code change is in DirectUpdateHandler2.java where a commit is no longer
needed now.
public int addDoc(AddUpdateCommand cmd) throws IOException {
…
The below code needs to be added to the the method:
if (realtime) {
1 IndexReader r = core.getNRTReader();
2 core.storeNRTReader(writer.getReader());
if (r != null) {
3 r.close();
}
}
Description:
If realtime is enabled in solrconfig.xml by adding <realtime>true</realtime>:
1. the method core.getNRTReader(); retrieves the existing reader.
2. The writer.getReader() method gets a new reader with the newly added docs in its cache.
3. The old reader is closed.
Other modifications to DirectUpdateHandler2.java:
protected void closeWriter() throws IOException {
IndexReader r = core.getNRTReader();
if (r != null) {
r.close();
core.storeNRTReader(null);
}
protected void rollbackWriter() throws IOException {
try {
numDocsPending.set(0);
if (writer!=null) writer.rollback();
IndexReader r = core.getNRTReader();
if (r != null) {
r.close();
core.storeNRTReader(null);
}
Step 2:
Changes to SolrCore.java
a. code changes
The below instance attributes are used to store the reader and the time of update:
private HashMap<String, IndexReader> reader_hm = new HashMap() ;
private HashMap<String, Long> update_hm = new HashMap() ;
The below methods make available the new reader to other components:
public IndexReader getNRTReader() {
return reader_hm.get(name);
}
public long getNRTWhenTime() {
Long l = update_hm.get(name);
if (l == null) {
return 0;
}
return l.longValue();
}
public void storeNRTReader(IndexReader ir) {
reader_hm.put(name, ir);
update_hm.put(name, new Long(System.currentTimeMillis()));
}
Step 3:
Changes to SolrIndexSearcher.java
a. code changes
private IndexReader ir = null;
RankingAlgorithm uses the new IndexReader as below:
ir1 = reader;
1 if (realtime) {
2 ir1 = core.getNRTReader();
3 Long when = core.getNRTWhenTime();
4 if (ir1 != null) {
5 if (this.when.longValue() < when.longValue() ) {
if (ir != null) {
6 ir.close();
}
7 ir = (IndexReader)ir1.clone();
}
8 ir1 = ir;
}else {
9 ir1 = reader;
}
}
Description:
1. Realtime is enabled
2. request core to get any new IndexReader if available.
3. get time when reader was created
4. if reader exists
5. check timestamps to see if it is a new reader
6. if so, close any old readers
7. clone the new reader
8. use this as the reader for search
9. use the old reader for search
public int maxDoc() throws IOException {
1 if (realtime && ir != null) {
2 return ir.maxDoc();
}
3 return super.maxDoc();
}
Description:
1. if realtime
2. return maxDoc using the new reader
3. If not return maxDoc with the old reader
A new method getWrappedReader() that returns the IndexReader instead of the
SolrIndexReader for faceting, fq, etc.:
public IndexReader getWrappedReader() {
if (ir != null && realtime) {
return ir;
}
return reader.getWrappedReader();
}
public Document doc(int n, FieldSelector fieldSelector) throws IOException {
try {
if (ir != null && realtime) {
return ir.document(n);
}
return getIndexReader().document(n, fieldSelector);
} catch(IOException t) {
throw t;
}
}
public Document doc(int i, Set<String> fields) throws IOException {
Document d=null;
if (documentCache != null) {
d = (Document)documentCache.get(i);
if (d!=null) return d;
}
if(!enableLazyFieldLoading || fields == null) {
//d = getIndexReader().document(i);
try {
if (ir == null && realtime) {
IndexReader ir1 = core.getNRTReader();
when = core.getNRTWhenTime();
if (ir1 != null) {
ir = (IndexReader)ir1.clone();
}
}
if (ir != null) {
d = ir.document(i);
}else {
d = getIndexReader().document(i);
}
} catch(IOException t) {
throw t;
}
} else {
//d = getIndexReader().document(i,
//s new SetNonLazyFieldSelector(fields));
try {
if (ir == null && realtime) {
IndexReader ir1 = core.getNRTReader();
when = core.getNRTWhenTime();
if (ir1 != null) {
ir = (IndexReader)ir1.clone();
}
}
if (ir != null) {
d = ir.document(i, new SetNonLazyFieldSelector(fields));
}else {
d = getIndexReader().document(i, new
SetNonLazyFieldSelector(fields));
}
} catch(Throwable t) {
throw new IOException(t);
}
}
if (documentCache != null) {
documentCache.put(i, d);
}
return d;
}
Step 4:
Changes to UnInvertedField.java:
a. code changes
public static UnInvertedField getUnInvertedField(String field, SolrIndexSearcher
searcher) throws IOException {
SolrCache cache = searcher.getFieldValueCache();
if (cache == null) {
return new UnInvertedField(field, searcher);
}
UnInvertedField uif = (UnInvertedField)cache.get(field);
if (uif == null) {
synchronized (cache) {
uif = (UnInvertedField)cache.get(field);
if (uif == null) {
uif = new UnInvertedField(field, searcher);
cache.put(field, uif);
}
}
}
/* NRT */
1 if (searcher.maxDoc() > uif.index.length) {
2 uif = new UnInvertedField(field, searcher); /* need to make this
dynamic*/
3 cache.put(field, uif);
}
return uif;
}
}
Description:
1. Check if any docs were added
2. Create a new copy of UIF
3. Store this in the cache and return the new UIF
b. Change all getReader() method calls to getWrappedReader() in the file.
Step 5:
Changes to SimpleFacet.java:
a. Change all getReader() method calls to getWrappedReader()
Step 6:
Changes to SolrConfig.java:
a. code changes
public boolean realtime = false;
public boolean getRealtime() {
return realtime;
}
realtime = getBool("realtime", false);
Conclusion
The near real time search in Solr-RA works well and allows concurrent search with
indexing in parallel without closing the IndexSearchers or clearing the cache
providing the ability to offer searches in near real time. The NRT implementation
supports faceting, filter queries, etc. The faceting count can be seen changing as
documents are added in the screenshots below Fig 1 and Fig2. Fig 1 shows a facet
query for “john” from the mbartists index (from the book Solr-14-Enterprise-Search-Server). Fig 2 shows the same query after adding a new artist to the index as below:
curl "http://localhost:8990/solr/mbartists/update/csv?stream.file=/tmp/x.csv&encapsulator=%1f"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">163</int></lst>
</response>
cat /tmp/x:
id,type,a_name,a_name_sort,a_alias,a_type,a_begin_date,a_end_date,a_member_name,a_member_id,a_release_d
ate_latest,a_spell,a_spellPhrase,r_name,r_name_sort,r_name_facetLetter,r_a_name,r_a_id,r_attributes,r_t
ype,r_official,r_lang,r_tracks,r_event_country,r_event_date,r_event_date_earliest,l_name,l_name_sort,l_
type,l_begin_date,l_end_date,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r
_tracks,t_trm_lookups,word,includes
Artist:3991866,Artist,John Ab Davis,John Ab Davis,,person,1942-12-29T00:00:00Z,1999-12-10T00:00:00Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Fig 1, shows numFound as 3256, and the facet count for “john” as 3256. Fig 2 after
adding a doc with curl shows 3257, and the facet count for “john” as 3257. The Solr
query is as below:
http://192.168.1.126:8990/solr/mbartists/select/?q=john&facet=on&facet.field=a_name&facet.field=a_type&fl=score
Note: make sure you clear the browser cache
The indexing performance observed on a 2 core intel system with Fedora Linux 12 is
about 262 tps (new document adds). This could be improved to a very high number
(from 14 secs for indexing about 3900 documents to about 2 secs) if
IndexWriter.getReader() performance is improved; at the moment, it takes
about 70-90 ms to get a IndexReader.
The modified src can be downloaded along with Solr with RankingAlgorithm from
here:
http://solr-ra.tgels.com
Fig 1
Fig 2
相关文章推荐
- cloudera search1.0.0环境搭建(2):利用flume-ng的MorphlineSolrSink实现近实时(NRT)搜索
- cloudera search1.0.0环境搭建(2):利用flume-ng的MorphlineSolrSink实现近实时(NRT)搜索
- Sql Support within Solr-类Sql的solr搜索实现(1)
- 基于Vue实现关键词实时搜索高亮显示关键词
- lucene3.5通过NRTManager和SearchManager实现近实时搜索
- 《Single Image Haze Removal Using Dark Channel Prior》一文中图像去雾算法的原理、实现、效果(速度可实时)
- django-haystack+solr实现搜索
- 使用 Apache Lucene 和 Solr 4 实现下一代搜索和分析
- compass和spring 集成实现实时搜索
- 使用solr实现pinyin分词,针对短词搜索,比如电影搜索
- lucene5--增量索引(Zoie)(近实时搜索的实现)
- lucene4之后的近实时搜索实现
- 纯代码实现 searchBar 并可以实时搜索
- 利用solr实现商品的搜索功能
- 使用 Apache Solr 实现更加灵巧的搜索,第 1 部分: 基本特性和 Solr 模式
- 《Single Image Haze Removal Using Dark Channel Prior》一文中图像去雾算法的原理、实现、效果(速度可实时)
- reduceByKeyAndWindow实现基于滑动窗口的热点搜索词实时统计(Java版本)
- easyui-combobox---ajax获取数据库JSON数据,实现搜索框实时显示模糊搜索结果
- Ruby on Rails 整合solr实现分面搜索