扫描和滚屏 - Elasticsearch 权威指南中文版

返回介绍

扫描和滚屏

发布于 2019-07-04 字数 6281 浏览 867 评论 0

扫描和滚屏

scan(扫描)搜索类型是和scroll(滚屏)API一起使用来从Elasticsearch里高效地取回巨大数量的结果而不需要付出深分页的代价。

scroll(滚屏)

一个滚屏搜索允许我们做一个初始阶段搜索并且持续批量从Elasticsearch里拉取结果直到没有结果剩下。这有点像传统数据库里的cursors(游标)。

滚屏搜索会及时制作快照。这个快照不会包含任何在初始阶段搜索请求后对index做的修改。它通过将旧的数据文件保存在手边,所以可以保护index的样子看起来像搜索开始时的样子。

scan(扫描)

深度分页代价最高的部分是对结果的全局排序,但如果禁用排序,就能以很低的代价获得全部返回结果。为达成这个目的,可以采用scan(扫描)搜索模式。扫描模式让Elasticsearch不排序,只要分片里还有结果可以返回,就返回一批结果。

为了使用scan-and-scroll(扫描和滚屏),需要执行一个搜索请求,将search_type 设置成scan,并且传递一个scroll参数来告诉Elasticsearch滚屏应该持续多长时间。

GET /old_index/_search?search_type=scan&scroll=1m (1)
{
    "query": { "match_all": {}},
    "size":  1000
}

(1)保持滚屏开启1分钟。

这个请求的应答没有包含任何命中的结果,但是包含了一个Base-64编码的_scroll_id(滚屏id)字符串。现在我们可以将_scroll_id 传递给_search/scroll末端来获取第一批结果:

GET /_search/scroll?scroll=1m      (1)
c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0 <2>
NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy
UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW
xfaGl0czoxOw==

(1) 保持滚屏开启另一分钟。

(2) _scroll_id 可以在body或者URL里传递,也可以被当做查询参数传递。

注意,要再次指定?scroll=1m。滚屏的终止时间会在我们每次执行滚屏请求时刷新,所以他只需要给我们足够的时间来处理当前批次的结果而不是所有的匹配查询的document。

这个滚屏请求的应答包含了第一批次的结果。虽然指定了一个1000的size ,但是获得了更多的document。当扫描时,size被应用到每一个分片上,所以我们在每个批次里最多或获得size * number_of_primary_shards(size*主分片数)个document。

注意:

滚屏请求也会返回一个_新_的_scroll_id。每次做下一个滚屏请求时,必须传递前一次请求返回的_scroll_id

如果没有更多的命中结果返回,就处理完了所有的命中匹配的document。

提示:

一些Elasticsearch官方客户端提供_扫描和滚屏_的小助手。小助手提供了一个对这个功能的简单封装。

<!–
[[scan-scroll]]
=== scan and scroll
The `scan` search type and the `scroll` API((("scroll API", "scan and scroll"))) are used together to retrieve
large numbers of documents from Elasticsearch efficiently, without paying the
penalty of deep pagination.
`scroll`::
+

A _scrolled search_ allows us to((("scrolled search"))) do an initial search and to keep pulling
batches of results from Elasticsearch until there are no more results left.
It's a bit like a _cursor_ in ((("cursors")))a traditional database.
A scrolled search takes a snapshot in time. It doesn't see any changes that
are made to the index after the initial search request has been made. It does
this by keeping the old data files around, so that it can preserve its “view''
on what the index looked like at the time it started.

`scan`::
The costly part of deep pagination is the global sorting of results, but if we
disable sorting, then we can return all documents quite cheaply. To do this, we
use the `scan` search type.((("scan search type"))) Scan instructs Elasticsearch to do no sorting, but
to just return the next batch of results from every shard that still has
results to return.
To use _scan-and-scroll_, we execute a search((("scan-and-scroll"))) request setting `search_type` to((("search_type", "scan and scroll")))
`scan`, and passing a `scroll` parameter telling Elasticsearch how long it
should keep the scroll open:
[source,js]
————————————————–
GET /old_index/_search?search_type=scan&scroll=1m
{
“query”: { “match_all”: {}},
“size”: 1000
}
————————————————–
Keep the scroll open for 1 minute.
The response to this request doesn’t include any hits, but does include a
`_scroll_id`, which is a long Base-64 encoded(((“scroll_id”))) string. Now we can pass the
`_scroll_id` to the `_search/scroll` endpoint to retrieve the first batch of
results:
[source,js]
————————————————–
GET /_search/scroll?scroll=1m
c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0
NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy
UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW
xfaGl0czoxOw==
————————————————–
Keep the scroll open for another minute.
The `_scroll_id` can be passed in the body, in the URL, or as a
query parameter.
Note that we again specify `?scroll=1m`. The scroll expiry time is refreshed
every time we run a scroll request, so it needs to give us only enough time
to process the current batch of results, not all of the documents that match
the query.
The response to this scroll request includes the first batch of results.
Although we specified a `size` of 1,000, we get back many more
documents.(((“size parameter”, “in scanning”))) When scanning, the `size` is applied to each shard, so you will
get back a maximum of `size * number_of_primary_shards` documents in each
batch.
NOTE: The scroll request also returns a _new_ `_scroll_id`. Every time
we make the next scroll request, we must pass the `_scroll_id` returned by the
_previous_ scroll request.
When no more hits are returned, we have processed all matching documents.
TIP: Some of the http://www.elasticsearch.org/guide[official Elasticsearch clients]
provide _scan-and-scroll_ helpers that provide an easy wrapper around this
functionality.(((“clients”, “providing scan-and-scroll helpers”)))
–>

上一篇:搜索选项

下一篇:索引管理

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

目前还没有任何评论,快来抢沙发吧!