Practical considerations - Elasticsearch 权威指南中文版


Practical considerations

发布于 2019-07-04 字数 4317 浏览 766 评论 0

=== Practical Considerations

Parent-child joins can be a useful technique for managing relationships when
index-time performance(((“parent-child relationship”, “performance and”))) is more important than search-time performance, but it
comes at a significant cost. Parent-child queries can be 5 to 10 times slower
than the equivalent nested query!

==== Memory Use

At the time of going to press, the parent-child ID map is still held in
memory.(((“parent-child relationship”, “memory usage”)))(((“memory usage”, “parent-child ID map”))) There are plans to change the map to use doc values instead, which
will be a big memory saving. Until that happens, you need to be aware of the
following: the string _id field of every parent document has to be held in memory, and
every child document requires 8 bytes (a long value) of memory. Actually,
it’s a bit less thanks to compression, but this gives you a rough idea.

You can check how much memory is being used by the parent-child cache by
consulting (((“indices-stats API”)))the indices-stats API (for a summary at the index level) or the
node-stats API (for a summary at the node level):


GET /_nodes/stats/indices/id_cache?human

Returns memory use of the ID cache summarized by node in a human-friendly format.

==== Global Ordinals and Latency

Parent-child uses <<global-ordinals,global ordinals>> to speed(((“global ordinals”)))(((“parent-child relationship”, “global ordinals and latency”))) up joins.
Regardless of whether the parent-child map uses an in-memory cache or on-disk
doc values, global ordinals still need to be rebuilt after any change to the

The more parents in a shard, the longer global ordinals will take to build.
Parent-child is best suited to situations where there are many children for
each parent, rather than many parents and few children.

Global ordinals, by default, are built lazily: the first parent-child query or
aggregation after a refresh will trigger building of global ordinals. This
can introduce a significant latency spike for your users. You can use
<<eager-global-ordinals,eager_global_ordinals>> to(((“eager global ordinals”))) shift the cost of
building global ordinals from query time to refresh time, by mapping the
_parent field as follows:


PUT /company { “mappings”: { “branch”: {}, “employee”: { “_parent”: { “type”: “branch”, “fielddata”: { “loading”: “eager_global_ordinals” } } } } }

Global ordinals for the `_parent` field will be built before a new segment
becomes visible to search.

With many parents, global ordinals can take several seconds to build. In this
case, it makes sense to increase the refresh_interval so(((“refresh_interval setting”))) that refreshes
happen less often and global ordinals remain valid for longer. This will
greatly reduce the CPU cost of rebuilding global ordinals every second.

==== Multigenerations and Concluding Thoughts

The ability to join multiple generations (see <>) sounds
attractive until (((“grandparents and grandchildren”)))(((“parent-child relationship”, “multi-generations”)))you think of the costs involved:

  • The more joins you have, the worse performance will be.
  • Each generation of parents needs to have their string _id fields stored in
    memory, which can consume a lot of RAM.

As you consider your relationship schemes and whether parent-child is right for you,
consider this advice (((“parent-child relationship”, “guidelines for using”)))about parent-child relationships:

  • Use parent-child relationships sparingly, and only when there are many more children than parents.
  • Avoid using multiple parent-child joins in a single query.
  • Avoid scoring by using the has_child filter, or the has_child query with
    score_mode set to none.
  • Keep the parent IDs short, so that they require less memory.

Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.




需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。