elasticsearch 分页 (from+size)(scroll scan) (search after) 详解-->解决深分页 (持续更新)
2017-01-05 18:12
741 查看
学习交流QQ群:481223144
官方参考
Guide https://www.elastic.co/guide/en/elasticsearch/guide/2.x/scroll.htmlRef https://www.elastic.co/guide/en/elasticsearch/reference/5.1/search-request-scroll.html
Java client https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.1/java-search-scrolling.html
Scroll or Search After API
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-scroll.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-search-after.html
浅分页: from+
size
官方 ref 5.5如果需要搜索分页,可以通过 from + size 组合来进行。from表示从第几行开始,size表示查询多少条文档。from默认为0,size默认为10。
当Elasticsearch响应请求时,它必须确定docs的顺序,全局排序响应结果。
如果请求的页数较少时,假设每页10个docs——即pageSize=10, 此时Elasticsearch不会有什么问题。
但若取的页数较大时(深分页),如请求第20页,Elasticsearch不得不取出所有分片上的第1页到第20页的所有docs,假设你有16个分片,则需要在coordinate node 汇总到
shards* (from+size)条记录,即需要 16*(20+10)记录后做一次全局排序,再最终取出 from后的size条结果作为最终的响应。
所以:当索引非常非常大(千万或亿),是无法安装 from + size 做深分页的,分页越深则越容易OOM,即便不OOM,也是很消耗CPU和内存资源的。
为了不合理使用 from + size 造成OOM及最终的集群不稳定,官方在后2.x版本中已增加限定
index.max_result_window:10000作为保护措施 ,即默认 from + size 不能超过1万。当然这个参数可以动态修改,也可以在配置文件配置——但最好不要这么做,除非你知道这意味着什么,可以参考 Scroll or Search After API 来满足你业务的需求。
海量导数据第一版:scan
Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left. It’s a bit like a cursor in a traditional database.A scrolled search takes a snapshot in time(适时). 中间更新不可见。
By keeping old data files around.
深分页的代价是全局排序,若禁止排序,sort by _doc,return the next batch of results from every shard that still has results to return.
context keepalive time(当批够用) 和 scroll_id(最新)
Set the scroll value to the length of time we want to keep the scroll window open.
How long it should keep the “search context” alive.
The scroll expiry time is refreshed every time we run a scroll request,所以不宜过长(垃圾)、过短(超时),够处理一批数据即可。
GET /old_index/_search?scroll=1m //第1次请求 { "query": { "match_all": {}}, "sort" : ["_doc"], //the most efficient sort order "size": 1000 } 返回结果包含:_scroll_id ,base-64编码的字符串 GET /_search/scroll //后续请求 { "scroll": "1m", "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs=" }
scroll parameter : how long it should keep the search context alive,long enough to process the previous batch of results, each scroll request sets a new expiry time.
An open search context prevents the old segments from being deleted while they are still in use.
注意:Keeping older segments alive means that more file handles(FD) are needed.
检查有多少search contexts(open_contexts):
GET _nodes/stats/indices/search
Clear scroll API
Search context are automatically removed when the scroll timeout has been exceeded.
清所有,可以清部分(无意义): DELETE _search/scroll/_all
size
When scanning, the size is applied to each shard, 真实size是:size * number_of_primary_shards.
否则(regular scroll),返还总的size。
查询结束
No more hits are returned. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty.
适用场景
Scrolling is not intended for real time(实时) user requests, but rather for processing large amounts of data.
scroll目的,不是处理实时的用户请求,而是为处理大数据的。
似快照
The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.
聚合
If the request specifies aggs, only the initial search response will contain the aggs results.
顺序无关
不关心返回文档的顺序!
Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:
GET /_search?scroll=1m { "sort": [ "_doc" ] }
slice scroll
split the scroll in multiple slices
scanning and standard scroll
scanning scroll与standard scroll 查询几点不同:1. scanning scroll 结果没有排序,结果顺序是doc入库时的顺序;
2. scanning scroll 不支持聚合
3. scanning scroll 最初查询结果的“hits”列表中不会包含结果
4. scanning scroll 最初查询中如果设定了“size”,是设定每个分片(shard)size的数量,若size=3,有5个shard,每次返回结果的最大值就是3*5=15。
scanning 本意是更高效的scroll,但ES新版本,scan意义不大,采用标准的即可。
示例
常见问题
scroll_id一样与否
the scroll_id may change over the course of multiple calls and so it is required to always pass the most recent scroll_id as the scroll_id for the subsequent request.
异常:SearchContextMissingException
SearchContextMissingException[No search context found for id [721283]];原因:scroll设置的时间过短了。
问源码(2.1.2)
scroll_id的生成:…search.type.TransportSearchHelper#buildScrollId(…) 三个参数,搜索查询类型、结果信息、查询条件参数
TransportSearchQueryThenFetchAction.AsyncAction. finishHim()
其它参考
Elasticsearch Scroll (游标)API详解使用scroll实现Elasticsearch数据遍历和深度分页
按特定条件排序的话,可以试试 search after,是5.0的新特性
相关文章推荐
- 由分库分页问题的解决方式联想到elasticsearch深度翻页&scroll search_after问题
- Elasticsearch——分页查询From&Size VS scroll
- Elasticsearch——分页查询From&Size VS scroll
- Elasticsearch——分页查询From&Size VS scroll
- elasticsearch-from to size 深度分页的缺陷
- elasticsearch核心知识--46.scroll技术滚动搜索大量数据以及和FromSize分页的本质区别和性能
- Elasticsearch——分页查询From&Size VS scroll
- Android Project from Existing Code 生成 R 文件错误、失败等问题解决办法 - 持续更新
- <持续更新>ubuntu下开发环境常见问题解决
- Elasticsearch from+size 超过10000结果解决方法
- Elasticsearch from+size 超过10000结果解决方法
- Alert: Querying v$asm_file Gives ORA-15196 After ASM Was Upgraded From 10gR2 To 11gR2 with an AU size > 1M [ID 1145365.1]
- Elasticsearch——分页查询From&Size VS scroll
- Elasticsearch from+size 超过10000结果解决方法
- 解决多个版本的python共存时的问题 => 持续更新
- Android Project from Existing Code 生成 R 文件错误、失败等问题解决办法 - 持续更新
- 『持续更新』遭遇各种问题,有的已解决,有的待解决
- 生活百态 第三版 < 持续更新>
- 狂暴的菜鸟之CentOS5.5环境下常见错误信息及解决方法【持续更新中。。。】