Hi,
This pull request greatly improves performance on the write benchmarks.
I did this performance improvements, in 5 steps:
All of benchmark ran on my Macbook Pro (13-inch, Late 2011) with Intel Core i5 (2.4Ghz), 8gb memory and 120gb SSD
Filtering by measurement
To the fork struct I added measurements map from string to bool and compared it in the forkPoint.
And got the next improvement:
benchmark old ns/op new ns/op delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 8633314476 57042 -100.00%
Benchmark_Write_MeasurementNameMatches_1000-4 7915678886 8229547562 +3.97%
Benchmark_Write_MeasurementNameNotMatches_100-4 37434 22472 -39.97%
Benchmark_Write_MeasurementNameMatches_100-4 38474 41502 +7.87%
Benchmark_Write_MeasurementNameNotMatches_10-4 22950 23601 +2.84%
Benchmark_Write_MeasurementNameMatches_10-4 23109 24814 +7.38%
benchmark old allocs new allocs delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 57450 50 -99.91%
Benchmark_Write_MeasurementNameMatches_1000-4 57426 57424 -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4 49 49 +0.00%
Benchmark_Write_MeasurementNameMatches_100-4 49 49 +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4 49 49 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 49 49 +0.00%
benchmark old bytes new bytes delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 4264608 3950 -99.91%
Benchmark_Write_MeasurementNameMatches_1000-4 4261568 4261440 -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4 3889 3837 -1.34%
Benchmark_Write_MeasurementNameMatches_100-4 3889 3889 +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4 3838 3838 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 3838 3839 +0.03%
This performance numbers are compared to the baseline - benchmarks run on the master
Changing equality order
I tried to change check of:
if fork.dbrps[dbrp] && fork.measurements[p.Name] {
// ...
}
To first check the measurement and then dbrp, and got the next results:
benchmark old ns/op new ns/op delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 57042 29203 -48.80%
Benchmark_Write_MeasurementNameMatches_1000-4 8229547562 8787711023 +6.78%
Benchmark_Write_MeasurementNameNotMatches_100-4 22472 36940 +64.38%
Benchmark_Write_MeasurementNameMatches_100-4 41502 55299 +33.24%
Benchmark_Write_MeasurementNameNotMatches_10-4 23601 36820 +56.01%
Benchmark_Write_MeasurementNameMatches_10-4 24814 44957 +81.18%
benchmark old allocs new allocs delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 50 49 -2.00%
Benchmark_Write_MeasurementNameMatches_1000-4 57424 57438 +0.02%
Benchmark_Write_MeasurementNameNotMatches_100-4 49 49 +0.00%
Benchmark_Write_MeasurementNameMatches_100-4 49 50 +2.04%
Benchmark_Write_MeasurementNameNotMatches_10-4 49 49 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 49 49 +0.00%
benchmark old bytes new bytes delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 3950 3837 -2.86%
Benchmark_Write_MeasurementNameMatches_1000-4 4261440 4262336 +0.02%
Benchmark_Write_MeasurementNameNotMatches_100-4 3837 3888 +1.33%
Benchmark_Write_MeasurementNameMatches_100-4 3889 3953 +1.65%
Benchmark_Write_MeasurementNameNotMatches_10-4 3838 3838 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 3839 3838 -0.03%
This is compared between the first step and the second
As you can see the performance got better for Benchmark_Write_MeasurementNameNotMatches_1000-4 but worse for the benchmarks (+33 to +64)
Change the fork structure - map from dbrp&measurement to edges
I am skipping the forth step which is to take "dbrp" struct assignment in forkPoint out of the loop, and going to the biggest perf improvement.
Instead of checking all forks if they match criteria I pivoted it to map from the criteria (db,rp,measurement) to edges.
And we get this huge improvement:
benchmark old ns/op new ns/op delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 29203 20774 -28.86%
Benchmark_Write_MeasurementNameMatches_1000-4 8787711023 5675405636 -35.42%
Benchmark_Write_MeasurementNameNotMatches_100-4 36940 21771 -41.06%
Benchmark_Write_MeasurementNameMatches_100-4 55299 36193 -34.55%
Benchmark_Write_MeasurementNameNotMatches_10-4 36820 23315 -36.68%
Benchmark_Write_MeasurementNameMatches_10-4 44957 24562 -45.37%
benchmark old allocs new allocs delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 49 48 -2.04%
Benchmark_Write_MeasurementNameMatches_1000-4 57438 57436 -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4 49 48 -2.04%
Benchmark_Write_MeasurementNameMatches_100-4 50 49 -2.00%
Benchmark_Write_MeasurementNameNotMatches_10-4 49 49 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 49 49 +0.00%
benchmark old bytes new bytes delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 3837 3800 -0.96%
Benchmark_Write_MeasurementNameMatches_1000-4 4262336 4262208 -0.00%
Benchmark_Write_MeasurementNameNotMatches_100-4 3888 3800 -2.26%
Benchmark_Write_MeasurementNameMatches_100-4 3953 3888 -1.64%
Benchmark_Write_MeasurementNameNotMatches_10-4 3838 3838 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 3838 3839 +0.03%
The baseline is 'Changing equality order'
Another sign of performance improvement, while running "Benchmark_Write_MeasurementNameNotMatches_1000-4" on the master my 4 cores are 99% steady after this improvement only 2 cores are 59% ~ and the other 2 cores are 9%
Final Results
And the overall benchmark results, where the baseline is the master benchmark results and the new perf is the current status of this branch:
benchmark old ns/op new ns/op delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 8633314476 23139 -100.00%
Benchmark_Write_MeasurementNameMatches_1000-4 7915678886 6381307112 -19.38%
Benchmark_Write_MeasurementNameNotMatches_100-4 37434 23787 -36.46%
Benchmark_Write_MeasurementNameMatches_100-4 38474 34923 -9.23%
Benchmark_Write_MeasurementNameNotMatches_10-4 22950 24076 +4.91%
Benchmark_Write_MeasurementNameMatches_10-4 23109 25433 +10.06%
benchmark old allocs new allocs delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 57450 48 -99.92%
Benchmark_Write_MeasurementNameMatches_1000-4 57426 57442 +0.03%
Benchmark_Write_MeasurementNameNotMatches_100-4 49 48 -2.04%
Benchmark_Write_MeasurementNameMatches_100-4 49 49 +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4 49 49 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 49 49 +0.00%
benchmark old bytes new bytes delta
Benchmark_Write_MeasurementNameNotMatches_1000-4 4264608 3799 -99.91%
Benchmark_Write_MeasurementNameMatches_1000-4 4261568 4262592 +0.02%
Benchmark_Write_MeasurementNameNotMatches_100-4 3889 3800 -2.29%
Benchmark_Write_MeasurementNameMatches_100-4 3889 3889 +0.00%
Benchmark_Write_MeasurementNameNotMatches_10-4 3838 3838 +0.00%
Benchmark_Write_MeasurementNameMatches_10-4 3838 3838 +0.00%
Drawbacks
This benchmark come with one drawback, the creation and deletion of a task will be slower (I have no benchmarks, but we are doing more - we no longer have o(1) complexity) and the deletion is harder to read thanks to "Change the fork structure - map from dbrp&measurement to edges".
I am open to suggestions on how to improve the delFork method for better readability.