Describe the bug
The benchmark cluster for branch release-1.3.0
experienced multiple Out Of Memory (OOM) errors.
This is a potential regression, although it is likely this issue exists already longer. Note that the resources for the benchmark project were reduced recently. See #8268
Occurrences
zeebe-2 @ 2021-12-27 ~11:21:45
Only a small dip in processing throughput

GC shortly spiked and then dropped

Simultaneously JVM memory usage increased from max ~200MB to spikes above 500MB and direct buffer pool memory usage doubled in this short window from ~400MB to ~860MB.

During this time, RocksDB memory usage was similar to before ~500MB per partition.

Install requests were frequently sent 🤔

It had just transitioned to INACTIVE and had closed the database, when it started to transition to FOLLOWER.
Once it opened the database it soon after stopped.
2021-12-27 11:21:22.684 CET "Transition to INACTIVE on term 12 completed"
2021-12-27 11:21:22.734 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'."
2021-12-27 11:21:22.784 CET "Committed new snapshot 289647968-12-833412599-833412170"
2021-12-27 11:21:22.785 CET "Deleting previous snapshot 289590585-12-833268476-833246498"
2021-12-27 11:21:22.787 CET "Scheduling log compaction up to index 289647968"
2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}"
2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '289645353') and replace with received snapshot (index '289647968')"
2021-12-27 11:21:22.816 CET "Transition to FOLLOWER on term 12 requested."
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ExporterDirector"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing SnapshotDirector"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing StreamProcessor"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing QueryService"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ZeebeDb"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStream"
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStorage"
2021-12-27 11:21:22.817 CET "Preparing transition from INACTIVE on term 12 completed"
2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 starting"
2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 - transitioning LogStorage"
2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning LogStream"
2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-2-LogStream-2{status=HEALTHY}]"
2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning ZeebeDb"
2021-12-27 11:21:22.818 CET "Partition-2 recovered, marking it as healthy"
2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-2-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-27 11:21:22.818 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}"
2021-12-27 11:21:22.915 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'."
2021-12-27 11:21:22.915 CET "Transition to FOLLOWER on term 12 - transitioning QueryService"
2021-12-27 11:21:22.916 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@649caa67)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2afb9843, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@6c21e905, configuration: Configuration(false)]"
2021-12-27 11:21:22.916 CET "Transition to FOLLOWER on term 12 - transitioning StreamProcessor"
2021-12-27 11:21:22.965 CET "request [POST http://elasticsearch-master:9200/_bulk] returned 1 warnings: [299 Elasticsearch-7.16.2-2b937c44140b6559905130a8650c64dbd0879cfb "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security."]"
2021-12-27 11:21:25.069 CET ++ hostname -f
zeebe-2 @ 2021-12-28 ~09:19:45, followed by zeebe-1 @ 2021-12-28 ~09:25:15
Just before the OOM, the starter and worker restart, which might explain the loss of processing throughput.

Zeebe-2 has restarted at ~09:19:45 so the OOM should've happened just before that.

Zeebe 2
If we filter on that pod alone, we see that it was actually shortly processing as leader just before the OOM.

GC is much more quiet here before the OOM. JVM memory usage is about 600MB and direct buffer pool memory has just increased to this ~860MB again (just like before). RocksDB is still stable at 500MB per partition, no screenshot added.

Zeebe 2 did not produce any interesting logs, as far as I could tell.
Zeebe 1
Zeebe-1 also does some processing as leader shortly before its OOM, ~5 min after zeebe-2 crashed.

Zeebe-1 looks a lot like zeebe-2 when we look at the memory decomposition. Note the increase in direct pool buffer memory just before the OOM like the other cases.

Partitions fully recovered, but about 1m30s after a snapshot was committed, an actor appears blocked. This means that the health tick is no longer updated. Directly after this, the pod dies.
2021-12-28 09:22:16.229 CET "Partition-2 recovered, marking it as healthy"
2021-12-28 09:22:16.229 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:22:16.667 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=HEALTHY}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
2021-12-28 09:22:16.668 CET "Partition-1 recovered, marking it as healthy"
2021-12-28 09:22:16.668 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Broker-1-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:22:21.703 CET "Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654."
2021-12-28 09:22:21.907 CET "Finished taking temporary snapshot, need to wait until last written event position 1034640293 is committed, current commit position is 1034640235. After that snapshot will be committed."
2021-12-28 09:22:21.933 CET "Current commit position 1034640293 >= 1034640293, committing snapshot FileBasedTransientSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654, checksum=890364779, metadata=FileBasedSnapshotMetadata{index=359974668, term=18, processedPosition=1034639769, exporterPosition=1034638654}}."
2021-12-28 09:22:21.941 CET "Committed new snapshot 359974668-18-1034639769-1034638654"
2021-12-28 09:22:21.942 CET "Deleting previous snapshot 359646996-17-1033697878-1033694056"
2021-12-28 09:22:21.947 CET "Scheduling log compaction up to index 359974668"
2021-12-28 09:22:21.951 CET "raft-partition-partition-3 - Deleting log up from 359633252 up to 359947572 (removing 21 segments)"
2021-12-28 09:22:32.628 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.408 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.848 CET "Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
2021-12-28 09:23:41.852 CET "Partition-1 failed, marking it as unhealthy: Broker-1{status=HEALTHY}"
2021-12-28 09:23:41.852 CET "Detected 'UNHEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.861 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, Broker-1-Exporter-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-1-LogStream-2{status=HEALTHY}, Broker-1-SnapshotDirector-2{status=HEALTHY}, Broker-1-StreamProcessor-2{status=HEALTHY}]"
2021-12-28 09:23:41.861 CET "Partition-2 recovered, marking it as healthy"
2021-12-28 09:23:41.861 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
2021-12-28 09:24:11.884 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-StreamProcessor-3{status=UNHEALTHY, issue='actor appears blocked'}, ZeebePartition-3{status=HEALTHY}, Broker-1-Exporter-3{status=HEALTHY}, raft-partition-partition-3{status=HEALTHY}, Broker-1-LogStream-3{status=HEALTHY}, Broker-1-SnapshotDirector-3{status=HEALTHY}]"
2021-12-28 09:24:39.044 CET ++ hostname -f
zeebe-2 @ 2021-12-28 ~22:50:00
Again only a small dip in processing throughput (nice and quick failover 🚀 )
Zeebe-2 was leader and processing before OOM

Interestingly, the logs just before the restart of zeebe-2 at this time, are practically identical to the logs of zeebe-2 on the first OOM (the day before on the 27th).
Zeebe-2 had just transitioned to INACTIVE and closed the database. It was transitioning to FOLLOWER again and just after it opened the database it is transitioning the StreamProcessor. Which is the same transition it OOM-ed at the day before.
2021-12-28 22:49:37.461 CET "Transition to INACTIVE on term 16 completed"
2021-12-28 22:49:37.537 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
2021-12-28 22:49:37.624 CET "Committed new snapshot 397383357-16-1142860979-1142860461"
2021-12-28 22:49:37.625 CET "Deleting previous snapshot 397252741-16-1142453760-1142669843"
2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
2021-12-28 22:49:37.631 CET "Scheduling log compaction up to index 397383357"
2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Delete existing log (lastIndex '397286333') and replace with received snapshot (index '397383357')"
2021-12-28 22:49:37.670 CET "Transition to FOLLOWER on term 16 requested."
2021-12-28 22:49:37.670 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ExporterDirector"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing SnapshotDirector"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing StreamProcessor"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing QueryService"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ZeebeDb"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStream"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStorage"
2021-12-28 22:49:37.671 CET "Preparing transition from INACTIVE on term 16 completed"
2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 starting"
2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 - transitioning LogStorage"
2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning LogStream"
2021-12-28 22:49:37.672 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-2-LogStream-1{status=HEALTHY}]"
2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning ZeebeDb"
2021-12-28 22:49:37.672 CET "Partition-1 recovered, marking it as healthy"
2021-12-28 22:49:37.673 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Broker-2-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 22:49:37.673 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
2021-12-28 22:49:37.837 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
2021-12-28 22:49:37.838 CET "Transition to FOLLOWER on term 16 - transitioning QueryService"
2021-12-28 22:49:37.840 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@2465e772)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2495802c, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@263ffbf0, configuration: Configuration(false)]"
2021-12-28 22:49:37.841 CET "Transition to FOLLOWER on term 16 - transitioning StreamProcessor"
2021-12-28 22:49:39.701 CET ++ hostname -f
If you look at the logs from before that time, for a long period (at least multiple hours) it keeps transitioning between follower and inactive and the opposite direction. It's in a loop:
2021-12-28 16:36:14.030 CET partition-3 "Transition to LEADER on term 21 requested."
2021-12-28 16:36:14.127 CET partition-3 "Transition to LEADER on term 21 completed"
2021-12-28 16:36:19.476 CET partition-2 "Transition to LEADER on term 19 requested."
2021-12-28 16:36:19.590 CET partition-2 "Transition to LEADER on term 19 completed"
2021-12-28 16:44:13.078 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:44:13.084 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:44:13.301 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:44:13.514 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 16:54:13.701 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:54:13.705 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:54:13.987 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:54:14.206 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 16:59:14.028 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:59:14.032 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:59:14.294 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:59:14.541 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:04:14.683 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:04:14.687 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:04:15.346 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:04:15.545 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:09:15.002 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:09:15.006 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:09:15.233 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:09:15.492 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:14:15.248 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:14:15.253 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:14:15.631 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:14:15.891 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:19:15.953 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:19:15.956 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:19:16.219 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:19:16.428 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:24:15.936 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:24:15.940 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:24:16.216 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:24:16.425 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:29:17.013 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:29:17.016 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:29:17.265 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:29:17.482 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:34:17.042 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:34:17.046 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:34:17.380 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:34:17.585 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:39:17.481 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:39:17.484 CET partition-1 "Transition to INACTIVE on term 16 completed"
.... and so on, until 22:50:00
This also happened the day before: https://cloudlogging.app.goo.gl/7qpb4Rammh11eqYh6
Hypothesis
Looking at the above cases, it seems that a partition gets stuck in a transition loop between FOLLOWER and INACTIVE. Perhaps we have a memory leak in transitions.
kind/bug scope/broker severity/high area/reliability release/1.3.9 release/8.0.3 release/8.1.0-alpha2 release/8.1.0