In Mimir, we're occasionally seeing "empty ring" ring right after a process startup (e.g. querier). It's an issue that has started after the migration to memberlist.
Possible root cause
I think the issue is caused by the ring client implementation not guaranteeing to wait to get the initial ring state before switching to
Running state. In the following I share some thoughts about the code.
The ring client service is expected to switch to
Running state only after it initialized its internal state with the ring data structure. This is why it calls
r.KVClient.Get() in the
When using Consul or etcd as backend, the
r.KVClient.Get() guarantees to return the state of the ring, but I think this guarantee has been lost in the memberlist implementation and it could return a zero data structure.
The memberlist client
Get() is implemented here:
It waits until the backend KV client is running. But does waiting for it to be running guarantee the ring data structure to be populated before that? I don't think so.
KV.starting() just initialise memberlist but doesn't join the cluster:
The memberlist cluster is joined only in the
KV.running(), but that's too late, because at that point our code assume the ring data structure to be already populated: