Salis lab has previously made a ribosomal binding site calculator, which can predict translation initiation rates from proteins.
However, it is slow (requiring a queue on a website) and closed source. In order to incorporate RBS calculation data in more complex applications, we need better performance and velocity of development. The best advancements in technology should be incorporated in an open-source manner.
Basic idea of RBS calculator
The basic idea behind the RBS calculator (this is a simplification) is that you take the binding energy of the the ribosomal 16S RNA to the mRNA's RBS site and subtract that from the binding energy of the mRNA to itself. There are a few other variables, but these are the basic ones (please check https://pubs.acs.org/doi/suppl/10.1021/acssynbio.0c00394/suppl_file/sb0c00394_si_001.pdf table 2 for equations)
mRNA is a large variable, so it must be calculated each time the simulation is run. The 16S RNA, on the other hand, is not very variable. There is approximately a power-law distribution of what organisms people use, so we can cache most of the 16S-RNA to RBS (which I will now call 16S-RBS) data in a lookup table.
Software and numbers we need
It is important to keep in mind we want this software to be fast. In order to get performance, there are 2 primary optimizations: 1 - using a faster algorithm for calculating RNA secondary structure (we use LinearFold, which folds RNA in linear time) and 2 - using a lookup table for slow RNA-to-RNA binding calculations.
In order to calculate mRNA folding, @vivekr has ported LinearFold to Golang. This package needs to be incorporated into Poly before we build the RBS calculator.
In order to calculate the 16S-RBS lookup table, we will likely need to operate outside of Golang (probably in python). LinearFold does not support (at this time) multiple separate RNAs binding to each other, so we'll have to do this work in a different algorithm. It will be a challenge to relate the two numbers from different software packages. Since the 16S RNA binding sequence is only 9 base pairs long, we theoretically only have to calculate its binding efficiency to 262,144 other RNAs.
There are other parameters that assist in doing RBS calculations (such as ΔGstandb, from https://pubs.acs.org/doi/suppl/10.1021/acssynbio.0c00394/suppl_file/sb0c00394_si_001.pdf). We'll likely need to build those into the calculator at some point, but perhaps not in version 1.
After we get a prototype-functioning RBS calculator, we can tune our model. One dataset from Salis Lab has 9862 sequences, and we can directly compare our calculator's outputs from the ones published by Salis Lab. We can also use empirical calculations from ~300,000 RBSs from Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping. Using a couple of these data sets, we should be able to massage our RBS calculator to get to "good enough"
While it probably won't be as absolutely efficient in organisms with large machine learning model datasets, we can present machine learning model datasets with our calculator's calculation as a parameter, and hopefully improve their abilities by giving them data.
The goal is to make something that is useful to scientists and engineers. Our calculator can still be mildly wrong, so long as it is fundamentally useful to practitioners.
After we build the Poly RBS calculator, Sporenet Labs (aka Keoni Gandall, aka me) plans to test its efficiency in a real laboratory environment. As the group who builds the thing, we'll all decide together what experiments we should run. Ideally, we'll be using Bxb1-GFP in E.coli with an oligo pool or a degenerate primer library + some Nanopore sequencing.