Elasticsearch稳定性介绍

Elasticsearch 的团队致力于不断改进 Elasticsearch 和 Apache Lucene，以保护您的数据。与任何分布式系统一样，Elasticsearch 非常复杂，每个部分都可能遇到需要正确处理的边缘情况。我们将讨论在面对硬件和软件故障时，为改进 Elasticsearch 在健壮性和弹性方面所做的持续努力。

提升ElasticSearch的弹性：https://www.elastic.co/cn/videos/improving-elasticsearch-resiliency

ElasticSearch和弹性：https://www.elastic.co/cn/elasticon/conf/2016/sf/elasticsearch-and-resiliency

参考资料：https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

本文不是ElasticSearch的Release docs，只关注与ElasticSearch在弹性/稳定性方面的进展。

V5.0.0 使用两阶段提交进行集群状态发布

Elasticsearch中的主节点会持续监控集群节点，并在某个节点无法及时响应其PING请求时将其从集群中移除。如果主节点剩下的节点过少，它将主动放弃主节点的角色，然后开始新的主节点选举过程。当网络分区导致主节点失去许多从节点时,在检测到节点丢失并且主节点下行之前，有一个很短的时间窗口。在这个时间窗口内,主节点可能会错误地接受和确认集群状态变更。为了避免这种情况,我们在集群状态发布中引入了一个新的阶段,其中提议的集群状态会发送给所有节点,但尚未提交。只有在足够多的节点主动确认更改后,更改才会被提交,并向节点发送提交消息#13062。A master node in Elasticsearch continuously monitors the cluster nodes and removes any node from the cluster that doesn’t respond to its pings in a timely fashion. If the master is left with too few nodes, it will step down and a new master election will start.When a network partition causes a master node to lose many followers, there is a short window in time until the node loss is detected and the master steps down. During that window, the master may erroneously accept and acknowledge cluster state changes. To avoid this, we introduce a new phase to cluster state publishing where the proposed cluster state is sent to all nodes but is not yet committed. Only once enough nodes actively acknowledge the change, it is committed and commit messages are sent to the nodes. See #13062.一句话总结：局部断网导致主节点与其他节点失联时，在其他节点检测到，并且投票出新的主节点前，有一个很短的时间窗口，在这个时间窗口内，主节点可能会错误的接受变更改进：引入了一个新的状态来解决这个问题，想修改集群的状态，需要足够多的节点主动确认后才可以。

V5.0.0 使索引创建能够灵活应对索引关闭和整个集群崩溃

恢复索引需要一定的分片副本（2 除外）可用于分配主分片副本的仲裁。这意味着如果集群在足够的分片被分配#9126之前就宕机了,将无法分配一个主分片。如果一个索引在足够的分片副本启动之前就被关闭了,重启索引也会出现同样的问题,无法重新打开索引#15281。分配ID#14739通过跟踪集群中已分配的分片副本来解决这个问题。这使得在仅有一个分片副本的情况下也可以安全地恢复一个索引。分配ID还可以区分索引已创建但尚未启动任何分片的情况。如果在索引被关闭之前，没有启动任何分片，那么在重新打开索引时将会分配一个全新的分片。Recovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that a primary cannot be assigned if the cluster dies before enough shards have been allocated (#9126). The same happens if an index is closed before enough shard copies were started, making it impossible to reopen the index (#15281). Allocation IDs (#14739) solve this issue by tracking allocated shard copies in the cluster. This makes it possible to safely recover an index in the presence of a single shard copy. Allocation IDs can also distinguish the situation where an index has been created but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh shard will be allocated upon reopening the index.一句话总结：恢复一个索引，需要一定数量的（a quorum：至少一半）分片来投票选出能够用来恢复的分片，（分片数为2除外，根本没法投，谁可用就用谁），这样如果集群还没有足够多的分片就宕机了，那么这个主分片就没法恢复了。另外，如果一个索引在足够的分片副本启动之前就被关闭了,重启索引也会出现同样的问题，改进：引入了分配ID的概念

V5.0.0 不允许过时的分片自动提升为主分片

不允许过时的分片自动提升为主分片在某些情况下，在丧失所有有效副本之后，可能会自动将过时的副本分片分配为主分片，更偏向于旧数据而不是没有数据#14671，如果有效副本未丧失而只是暂时不可用，这可能会导致已确认的写入数据的丧失。分配ID#14739通过跟踪群集中的非过时副本分片并使用此跟踪信息来分配主分片来解决此问题。当所有副本分片都丧失或仅有过时的分片可用时，Elasticsearch 将等待其中一个良好的副本分片重新出现。在所有良好的副本都丧失的情况下，可以使用手动覆盖命令来分配一个过时的副本分片。In some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data to no data at all (#14671). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather temporarily unavailable. Allocation IDs (#14739) solve this issue by tracking non-stale shard copies in the cluster and using this tracking information to allocate primary shards. When all shard copies are lost or only stale ones available, Elasticsearch will wait for one of the good shard copies to reappear. In case where all good copies are lost, a manual override command can be used to allocate a stale shard copy.改进：引入了分配ID的概念

V5.0.0 安全主分片迁移

当主节点迁移完成后，会广播播一个新的集群状态，将旧的主节点标记为非活跃状态，同时将新的主节点标记为活跃状态。由于集群状态的变更不是同步应用到所有节点的，因此可能存在一个时间间隔，迁移目标节点已经处理了集群状态，并认为自己是活跃的主节点，而迁移源节点还没有处理集群状态更新，仍然认为自己是活跃主节点，这意味着如果一个索引请求被路由到新的主节点，它不会被复制到旧的主节点（因为从新主节点的角度来看，它已被停用）。如果随后的读取请求被路由到旧的主节点，它将无法看到已经被索引的文档。#15900另一种情况是，如果完成主节点迁移的集群状态更新首先应用于迁移源节点，然后应用于迁移目标节点，那么每个节点都会认为另一个节点是活动主节点。这会导致索引请求在节点之间快速来回传递，可能导致它们都耗尽内存（OOM）#12573When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the cluster state and believes to be the active primary and the relocation source has not yet processed the cluster state update and still believes itself to be the active primary. This means that an index request that gets routed to the new primary does not get replicated to the old primary (as it has been deactivated from point of view of the new primary). If a subsequent read request gets routed to the old primary, it cannot see the indexed document. #15900In the reverse situation where a cluster state update that completes primary relocation is first applied on the relocation source and then on the relocation target, each of the nodes believes the other to be the active primary. This leads to the issue of indexing requests chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. #12573一句话总结：主节点迁移完成后，会广播新的主节点，将旧的主节点废弃，但是会有时间间隔，两个节点都认为自己是主节点，接受新的写入请求，会产生数据差异，如果此时去读旧的主节点，就会发现刚才写的数据没了。正确的做法是完成主分片迁移的集群状态更新首先应用于迁移源节点，然后再应用于迁移目标节点。改进：主节点迁移的集群状态更新首先应用于迁移源节点，然后应用于迁移目标节点

V5.0.0 网络分区期间文档丢失

当网络分区将一个节点与主节点分离时，存在一段时间窗口，直到该节点检测到分区的存在。这个时间窗口的长度取决于分区的类型。如果是网络连接断开，这个时间窗口非常短。而更为恶劣的分区情况，例如在不中断网络连接的情况下静默丢弃请求，可能需要更长的时间（根据当前默认设置最多可达30秒的3倍）。如果在分区发生时，该节点正好托管着一个主分片，并且最终被隔离在集群之外(这可能会导致脑裂),一些正在被索引到该主分片中的文档可能会丢失，如果它们无法通过分区到达分配的副本之一（由于分区原因），并且该副本在之后被主节点提升为主分片#7572，为了防止这种情况发生，主分片需要等待主节点确认副本分片故障后，才向客户端确认写入操作。#14252If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in split-brain before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master (#7572). To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. #14252一句话总结：主节点拥有一个主分片，因为断网了，写入的请求到达了被孤立的主节点的主分片，但是无法写入到副本分片，因为断网了。如果后续副本被提升为主分片，就会导致数据差异，改进：主分片需要等待副本分片写入成功后才返回给客户端

V5.0.0 将Jepsen测试移植到我们的测试框架来处理确认写入的丢失

我们已经将测试范围扩大到包括 Jepsen 测试的场景，这些场景演示了确认写入的丢失，如 Elasticsearch 相关的博客中所述。我们大量使用随机化来扩展可测试的场景并引入新的错误条件。您可以查看 DiscoveryWithServiceDisruptionsIT 类的 5.0 分支上的这些更改，其中专门添加了 testAckedIndexing 测试以检查在各种故障情况下我们不会丢失确认写入。We have increased our test coverage to include scenarios tested by Jepsen that demonstrate loss of acknowledged writes, as described in the Elasticsearch related blogs. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce new error conditions. You can view these changes on the 5.0 branch of the DiscoveryWithServiceDisruptionsIT class, where the testAckedIndexing test was specifically added to check that we don’t lose acknowledged writes in various failure scenarios.

V6.3.0 主分片和副本分片之间的文档删除导致的不一致问题

在执行与文档删除相关的活动时，某些延迟的组合可能导致对该文档的操作在不同的分片副本上被解释不同。这可能导致每个副本中保存的文档数量产生差异。并发插入使用自动生成ID的未确认文档的删除操作,其对各分片副本上这些操作的处理顺序非常敏感,这是错误的。归功于引入了序列号#10708现在可以检测到这些无序操作。在特定时间间隔内重新创建一个被删除的文档,可能导致在处理重新创建该文档的索引操作时,一些分片副本上该文档的删除标记被清理,而其他分片副本上未被清理。这导致分片副本之间的行为不同。这个问题间隔由index.gc_deletes设置控制,默认为60秒。同样,序列号#10708为我们提供了检测这些冲突活动的机制。在某些罕见情况下,副本可能会错误地将一个文档的过期删除标记解释为新标记,导致相同文档的并发索引操作在此副本上的行为与主分片不同。触发这个问题需要以下活动都在一个短时间窗口内以特定顺序在主分片上发生,并且以不同顺序在副本上发生:一个文档被删除了两次使用与第一个文档相同的ID索引另一个文档使用完全不同的自动生成ID索引另一个文档两次刷新Certain combinations of delays in performing activities related to the deletion of a document could result in the operations on that document being interpreted differently on different shard copies. This could lead to a divergence in the number of documents held in each copy.Deleting an unacknowledged document that was concurrently being inserted using an auto-generated ID was erroneously sensitive to the order in which those operations were processed on each shard copy. Thanks to the introduction of sequence numbers (#10708) it is now possible to detect these out-of-order operations, and this issue was fixed in #28787.Re-creating a document a specific interval after it was deleted could result in that document’s tombstone having being cleaned up on some, but not all, copies when processing the indexing operation that re-creates it. This resulted in varying behaviour across the shard copies. The problematic interval was set by the index.gc_deletes setting, which is 60 seconds by default. Again, sequence numbers (#10708) gives us the machinery to detect these conflicting activities, and this issue was fixed in #28790.Under certain rare circumstances a replica might erroneously interpret a stale tombstone for a document as fresh, resulting in a concurrent indexing operation for that same document behaving differently on this replica than on the primary. This is fixed in #29619. Triggering this issue required the following activities all to occur in a short time window, in a specific order on the primary and a different specific order on the replica:a document is deleted twiceanother document is indexed with the same ID as this first documentanother document is indexed with a completely different, auto-generated, IDtwo refreshes改进：引入了操作的序列号，用来检查并发的编辑操作

v7.0.0 重复的网络分区可能导致集群状态更新丢失

在网络分区期间，如果大多数可选主节点正确接收到更新，则会提交集群状态更新（例如映射更改或分片分配）。这意味着当前的主节点可以访问集群中的足够节点以继续正确运行。当网络分区恢复时，孤立的节点会赶上当前状态，并接收先前错过的更改。但是，如果在集群仍在从先前的分区中恢复时发生第二个分区，并且旧主节点处于少数派一边，则可能选举出一个尚未赶上的新的主节点。如果发生这种情况，集群状态更新可能会丢失。此问题大部分已通过#20384 (v5.0.0) 修复，该修复在主节点选举期间考虑了已提交的集群状态更新。这大大降低了这种罕见问题发生的可能性，但并未完全消除这种可能性。如果第二个分区与集群状态更新同时发生，并阻止集群状态提交消息到达大多数节点，则正在进行中的更新可能会丢失。如果现在孤立的主节点仍然可以向客户端确认集群状态更新，这将导致已确认的更改丢失。修复最后一种情况是 #32006 及其子问题的目标之一。特别是 #32171和用于验证这些更改的TLA+ formal modelDuring a networking partition, cluster state updates (like mapping changes or shard assignments) are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster is still recovering from the previous one and the old master falls on the minority side, it may be that a new master is elected which has not yet catch up. If that happens, cluster state updates can be lost.This problem is mostly fixed by #20384 (v5.0.0), which takes committed cluster state updates into account during master election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this will amount to the loss of an acknowledged change.Fixing this last scenario was one of the goals of #32006 and its sub-issues. See particularly #32171 and the TLA+ formal model used to verify these changes.一句话总结：多次不同的断网，会导致数据差异，在主节点选举过程中考虑了已提交的集群状态更新，这大大降低了这种罕见问题发生的几率，但并没有完全解决。

v7.0.0 当主分片失败时，副本可能会失去同步

When a primary shard fails, a replica shard will be promoted to be the primary shard. If there is more than one replica shard, it is possible for the remaining replicas to be out of sync with the new primary shard. This is caused by operations that were in-flight when the primary shard failed and may not have been processed on all replica shards. These discrepancies are not repaired on primary promotion but instead delayed until replica shards are relocated (e.g., from hot to cold nodes); this means that the length of time in which replicas can be out of sync with the primary shard is unbounded.Sequence numbers #10708 provide a mechanism for identifying the discrepancies between shard copies at the document level, which allows to efficiently sync up the remaining replicas with the newly-promoted primary shard.当主分片失败时，一个副本分片将被提升为主分片。如果有多个副本分片，剩余的副本分片可能与新的主分片不同步。这是由于在主分片失败时正在进行的操作可能没有在所有副本分片上进行处理。这些不一致之处在主分片提升时不会修复，而是延迟到副本分片重新定位（例如从热节点到冷节点）；这意味着副本分片与主分片不同步的时间长度是不受限制的。序列号#10708 提供了一种机制，用于在文档级别识别分片副本之间的差异，从而可以有效地将剩余副本与新提升的主分片同步。一句话总结：V7.0.0提供了一种机制，可以在文档级别判断副本分片和主分片的差异，并进行同步。

v7.0.0 网络分区期间索引的文档无法唯一标识

当主分片与集群分离时，会有一个短暂的时间，直到它检测到这种情况。在此期间，它将继续在本地进行索引写操作，从而更新文档版本。然而，当它尝试复制操作时，它会发现自己与分区隔离。它不会确认写操作，并且会等待分区解决后与主节点协商如何继续。主节点将决定是否让任何未能在主分片上进行索引操作的副本失效，还是告诉主分片在此期间已选择了新的主分片，因此必须下台。由于旧的主分片已经写入了文档，客户端可能在它关闭之前已经从旧的主分片读取过。如果新的主分片已经接受了相同文档的写操作，则这些读取的_version字段可能无法唯一标识文档的版本（请参阅#19269）序列号基础设施#10708 引入了更精确跟踪主分片变更的方式。因此，这个新的基础设施提供了一种通过主要字段和序列号字段唯一标识文档的方法，即使在网络分区的情况下也是如此，并且已被用于替代在需要唯一标识文档的操作中的_version字段，如乐观并发控制。When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The _version field of these reads may not uniquely identify the document’s version if the new primary has already accepted writes for the same document (see #19269).The Sequence numbers infrastructure #10708 has introduced more precise ways for tracking primary changes. This new infrastructure therefore provides a way for uniquely identifying documents using their primary term and sequence number fields, even in the presence of network partitions, and has been used to replace the _version field in operations that require uniquely identifying the document, such as optimistic concurrency control.一句话总结：引入了更精确的方法来跟踪主要更改，这种新的基础结构序列号提供了一种使用文档的主要字段和序列号字段来唯一标识文档的方法，即使在存在网络分区的情况下，该字段也能提供一致性，并且已被用于替换需要唯一标识文档的操作中的 _version 字段，例如乐观并发控制。

进行中在节点断开连接时改进请求重试机制

如果持有主分片的节点由于任何原因断开连接，协调节点将在相同或新的主分片上重试请求。在某些罕见的情况下，如果节点断开连接并立即重新连接，原始请求可能已经成功应用但尚未报告，从而导致重复请求。这在重试批量请求时尤为明显，其中一些操作可能已经完成，而一些操作可能尚未完成。一种优化方式禁用了对使用自动生成的ID索引的文档的存在性检查，可能会导致创建重复文档。该优化已被移除。#9468（状态：已完成，v1.5.0）请求重试机制仍存在其他问题：未标记版本的索引请求可能会使_version字段增加两次，使创建状态变得模糊。已标记版本的索引请求可能会返回冲突异常，即使它们已正确应用。更新请求可能会被应用两次。参见#9967（状态：进行中）If the node holding a primary shard is disconnected for whatever reason, the coordinating node retries the request on the same or a new primary shard. In certain rare conditions, where the node disconnects and immediately reconnects, it is possible that the original request has already been successfully applied but has not been reported, resulting in duplicate requests. This is particularly true when retrying bulk requests, where some actions may have completed and some may not have.An optimization which disabled the existence check for documents indexed with auto-generated IDs could result in the creation of duplicate documents. This optimization has been removed. #9468 (STATUS: DONE, v1.5.0)Further issues remain with the retry mechanism:Unversioned index requests could increment the _version twice, obscuring a created status.Versioned index requests could return a conflict exception, even though they were applied correctly.Update requests could be applied twice.See #9967. (STATUS: ONGOING)一句话总结：如果主分片节点断开连接后又重新连接,原请求可能已经成功执行但未报告,这会导致请求被重复执行。批量请求中尤其明显。用于自动生成ID的文档存在性检查优化被移除,因为它可能创建重复文档。

进行中 OOM 稳定性

断路器系列大大减少了 OOM 异常的发生，但仍有可能导致节点的堆空间不足。已发现以下问题：对 from/size 参数#9311设置硬限制。（状态：完成，v2.1.0）防止聚合中的组合爆炸导致 OOM #8081。（状态：完成，v5.0.0）将每次命中的字节大小添加到请求断路器 #9310 中。（状态：进行中）限制单个请求的大小，并为正在进行的请求对象 #16011使用的总内存添加断路器。（状态：完成，v5.0.0）其他保护措施在元问题#11511中进行跟踪。The family of circuit breakers has greatly reduced the occurrence of OOM exceptions, but it is still possible to cause a node to run out of heap space. The following issues have been identified:Set a hard limit on from/size parameters #9311. (STATUS: DONE, v2.1.0)Prevent combinatorial explosion in aggregations from causing OOM #8081. (STATUS: DONE, v5.0.0)Add the byte size of each hit to the request circuit breaker #9310. (STATUS: ONGOING)Limit the size of individual requests and also add a circuit breaker for the total memory used by in-flight request objects #16011. (STATUS: DONE, v5.0.0)Other safeguards are tracked in the meta-issue #11511.

进行中重新定位报告基础设施遗漏的分片

索引统计信息和索引段请求会到达具有该索引分片的所有节点。在统计信息请求到达时从节点重新定位的分片将使请求的该部分失败，并在总体统计信息结果中被忽略#13719Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. #13719

V5.0.0 使用两阶段提交进行集群状态发布#

V5.0.0 使索引创建能够灵活应对索引关闭和整个集群崩溃#

V5.0.0 不允许过时的分片自动提升为主分片#

V5.0.0 安全主分片迁移#

V5.0.0 网络分区期间文档丢失#

V5.0.0 将Jepsen测试移植到我们的测试框架来处理确认写入的丢失#

V6.3.0 主分片和副本分片之间的文档删除导致的不一致问题#

v7.0.0 重复的网络分区可能导致集群状态更新丢失#

v7.0.0 当主分片失败时，副本可能会失去同步#

v7.0.0 网络分区期间索引的文档无法唯一标识#

进行中 在节点断开连接时改进请求重试机制#

进行中 OOM 稳定性#

进行中 重新定位报告基础设施遗漏的分片#