Improve Cache Density and Cost Estimate
Hi Authzed folks - apologies in advance for this wall of text. 🙂
I noticed a few weeks ago that the cache cost functions are not accurate if the cost represents bytes (which I believe it does). For example, the cost of a checkResultEntry
is set to just 8 bytes, the cost of that struct when empty. But that cost doesn't include the memory pointed to by checkResultEntry.response
, which could be much more.
As I worked to improve the cache cost functions, I found a way to fit 2x more cache items into the same amount of memory: instead of caching the Go structs, cache the protobuf-marshaled bytes.
The improved cache cost functions help keep the physical memory used by the cache much closer to the configured max cost.
I'd be happy to open some PRs for these changes, but wanted to post my findings here and see which of the changes you'd like (if any).
Cache Density
I experimented with storing the marshaled bytes of protobuf messages rather than the Go objects directly.
There are two main advantages to this:
- Calculating the cost of a
[]byte
is quite simple. Most importantly, the cost function does not need to change as the protobuf message changes: protobuf takes care of those details.
- Second, the cache can store more items per MB of space used. In one test (below), the cache fit 212% more items per MB! However, later tests with more accurate cost functions improved cache density by a more modest 50-70%. All tests were on a single local instance of spicedb, so a load test at scale is warranted.
Below are the results for two tests run on a single spicedb instance serving check requests. Total profiled space is for the whole application, while cache profiled space includes just the stacks related to caching. In this test, the cost function was still poor, but it does show that using marshaled bytes significantly improves cache density.
| test | total profiled space | cache profiled space | cache calculated cost | key count | keys/ cache profiled MB |
| --- | --- | --- | --- | --- | --- |
| protobuf structs | 69.16 MB | 54.85 MB | 32 MB | 142,857 | 2,605 |
| marshaled []byte | 77.02 MB | 61.0 MB | 30.1 MB | 337,311 | 5,529 |
Of course, marshaling isn't free. However, existing code already calls proto.Clone()
on every cache write, and as that is replaced with the call to proto.Marshal()
, the relative cost may not be significant. Still, a test to check impact on CPU during a load test is warranted.
Cache Cost Function
Now, the long story.
Background
As stated above, the cache was using more memory than the 'max cost' setting because the cost of each cached item was being set to the size of a pointer (8 bytes) rather than the size of the memory referenced by a pointer.
The first attempt at improving the cost function made the situation better, but there was still a substantial difference between the configured cache size and the total memory used. Below are flamegraphs for in-use space for a local spicedb instance, taken after running a 15 minute load test of check requests. Between 0 and 32 MB cache, the memory increased 59MB, 184% the increase in cache size. Between 32 and 64 MB cache, the memory increased 70MB, 219% the increase in cache size.
1 byte Cache (single instance, local)

32 MB Cache (single instance, local)

64 MB Cache (single instance, local)

Aside on Profiling
In the flamegraphs above, the in-use bytes within ristretto.(*Cache).processItems
are very close to the allocated cache size. Also, the bytes allocated within caching.(#Dispatcher).DispatchCheck
grow proportionally with the cache size.
Initially I thought this meant the DispatchCheck()
function was responsible for leaking memory. However, I no longer think that is the case.
Heap profiles work by sampling allocations. When a sample is taken, the stack responsible for the allocation is added to the profile. So, seeing DispatchCheck()
in the flamegraph doesn't mean that DispatchCheck()
is responsible for keeping bytes from GC, only that it was responsible for originally allocating those bytes.
Reviewing the spiceDB code, this makes sense - DispatchCheck()
creates the object that is stored in the cache (via proto.Clone()
), but then it is the cache that keeps that object from GC.
When ristretto stores an item, it allocates a wrapper struct, which explains why it is also in the profile.
Given this, the best way to measure memory used by the cache is to sum ristretto.(*Cache).processItems
and proto.Clone
. Doing so for the examples above gives 113MB for the 64MB cache (176% larger) and 59MB for the 32MB cache (184% larger).
Size Classes
One of the main breakthroughs I had was learning about class sizes in Go. Class sizes are predefined object sizes (8, 16, 24, 32, 48, etc). When allocating a 'small' object, Go takes the number of required bytes and then allocates the next size class larger than what is required. This is done to make GC tracking more efficient for small objects. See 'One more thing' section.
So, a cost function that returns only the bytes required for an object will systematically under-report the actual cost in memory!
This article indicates that append()
is aware of class sizes and can be used to find them at run time. This code demonstrates: https://go.dev/play/p/lRaSqzunZ73
After accounting for class sizes, I was able to write a cost function that exactly matched the allocated bytes, as reported by memstats.TotalAlloc
.
Keys Count Too
Still, even accounting for size classes, the cost function was not controlling memory like I wanted. How could my tests show a perfect match to the reported allocated memory, but still allow the cache to grow beyond max cost? The answer is fairly simple: cache keys are stored too, and take up memory. After including keys in the cost function, I got the following results (caching []byte
):
| test | total profiled space | cache profiled space | cache computed space | key count | keys/cache profiled MB |
| --- | --- | --- | --- | --- | --- |
| 8MB cache | 33.1 MB | 16.2 MB | 8 MB | 42,094 | 2,598 |
| 16MB cache | 40.4 MB | 24.3 MB | 16 MB | 84,097 | 3,460 |
| 32MB cache | 63.8 MB | 44.4 MB | 32 MB | 168,152 | 3,787 |
The difference in cache size between 8MB and 16MB max cost was 8.1MB! Between 16MB and 32MB, 20.1 MB, which is off by about 26%.
Final Cost Function (protobuf structs, not bytes)
This test was run with a cost function that accounted for keys and size classes. No changes were made to the objects stored in the cache for this test.
| test | total profiled space | cache profiled space | cache computed space | key count | keys/cache profiled MB |
| --- | --- | --- | --- | --- | --- |
| no cache (1 byte) | 15.6 MB | 0 MB | 0 MB | 0 | 0 |
| 16MB cache | 34.8 MB | 21.5 MB | 16 MB | 46,916 | 2,182 |
| 32MB cache | 55.2 MB | 37.8 MB | 32 MB | 93,825 | 2,482 |
This shows there is still some overhead for the cache, since going from a cache with only 1 byte max cost (effectively, no cache) to 16 MB cost added 21.5 MB to memory used by the cache. But, going from 16MB to 32MB added 16.3MB, off by ~2%.
Compared to the test which used a similar cost function, but stored bytes instead, this also shows that storing bytes is still more efficient, although less so than in the original test. This makes sense, because now that they key is included in the cost function, the space saved on the items themselves is a smaller proportion of the total cost per entry.
Misc Learnings
- Are there memory leaks?
- I don't think so. Once the cache reaches capacity and begins to evict items, memory use is stable.
- Is protocol buffers increasing memory footprint?
- The items stored in the cache are protobuf generated types and have some fields specific to protobuf (
protoimpl.MessageState
, protoimpl.SizeCache
, protoimpl.UnknownFields
). It is possible these fields are getting populated after the cost function runs and increasing memory footprint beyond what the cost function calculates. Running spicedb locally, I did see that this was the case - sending a message from the cache caused its size to increase significantly. However, subsequent sends shared the memory added by the first send. To further test if protobuf fields were increasing cost, I ran tests where a the cached object was never returned to callers, only deep copies. Memory use was similar enough that I don't think the protobuf fields have a significant impact.
- 32 MB Cache (main)

- 32 MB Cache (clone on return)

area/perf area/observability area/dispatch