Solving the Testing Crisis with Math and Algorithms
Testing, testing, testing.
You've heard it countless times.
It's clear that testing is the foundation of an effective pandemic response.
And yet we have so far to go before we can catch up to where we should be with testing.
The necessary reagents (chemicals) and swabs are in short supply, there are bottlenecks on human labor, and there's only so much our processes and institutions can handle.
But can we change this?
Is testing just a domain for biologists?
Can we use a little math and computer science to get to #TestingForAll?
Let's dive in and see if we can.
The standard way to perform a test for SARS-CoV-2 (the specific species of coronavirus that causes the disease known as COVID-19) is to use a process called Reverse Transcription Polymerase Chain Reaction (RT-PCR):
- Collect a sample from the patient using a respiratory swab
- Break open the cells on the swab
- Extract all nucleic acids from the sample (including both human DNA and viral RNA)
- Convert all RNA into DNA using a "reverse transcription" process
- Take small DNA fragments (primers) that only bind to the SARS2 genome (like puzzle pieces) and use them to exponentially copy any potentially present SARS2 sequences through a DNA-doubling process known as polymerase chain reaction
- Measure the amount of DNA in the sample using a machine called a qPCR (quantitative polyermace chain reaction) machine
- If the amount of DNA is high, then the primers attached to a matching SARS2 genome, and one can conclude with a high probability that the patient is infected
Pretty clever process right?
It actually is remarkable how we can manipulate tiny little machines (molecules) that we cannot see and use them to take measurements and produce other machines.
The problem here, though, is that this process is very resource intensive.
The number one bottleneck is that the qPCR machines can only handle a certain number of samples per run, which takes about an hour. The LightCycler 96, for example, is about $34,000 and can process 96 samples per run. So if we have one of these machines, we can only test 96 patients per hour.
Can we do better? Perhaps math can help?
Efficiency Through Pooling
The testing process we just described is clever, but it is still heavily constrained.
Perhaps it is constrained because we are operating under assumptions that are meant to be challenged.
The first assumption we'll challenge is that each sample has to be tested one at a time.
What if we tested multiple samples at once? Could we save on testing time?
It turns out we can.
This process is known as pooled testing.
The simple version of this is to take a certain number of people, say 10, and then combine their samples together and test them all at once.
If the test comes up positive, then one of the samples is positive and we can proceed with individual testing.
If the test comes up negative, then all of the samples are negative and we don't have to do any further work.
This method makes a lot of sense if the prevalence in the population is very low, because the "all negative" result would be seen fairly often and save a lot of testing capacity, while the "at least one positive, test all again" case would be seen fairly infrequently.
For example, let's say only 0.1% of the population is positive.
If we test the normal way, then we'd need to test 1609 people to get an 80% chance of someone testing positive.
(1 - 0.001)^x = (1 - 0.8) x = 1609
If we test 10,000 people, then for the first round, we'd have 10,000 tests with the "individual tests" method and 1000 tests for the "pool 10 samples" method.
The "individual tests" method does not require a second round, but the "pool 10 samples" method does, so that we can handle the postive pools and hone in the individuals who are the real positives.
On average, with 0.1% of the population positive, the "pool 10 samples" method would yield 990 "all negative" results and 10 "at least one positive" results.
Thus the second round for the "pool 10 samples" method would require 10 additional tests.
This would mean a total of 1010 tests for the "pool 10 samples method.
Now let's imagine that 25% of the population is positive.
The "individual tests" method would still require 10,000 tests.
The "pool 10 samples" method, however, would yield vastly different results.
There would be an 94% chance, for example, of each batch of 10 having at least one positive caes.
Therefore, the average number of tests required for the "pool 10 samples" method in this case would be:
1000 + 0.94*1000*10 = 10,400
This is even higher than the "individual tests method," and would double the testing time required on average.
Thus, we wouldn't want to use the "pool 10 samples" method if we expect 25% of the population to test positive.
Can we do better?
It turns out that there are even more advanced methods of pooling that we can employ to further increase our testing efficiency.
Next, we will turn to a form of pooling that allows us to avoid some of efficiency destroying traps that standard pooling falls into with higher infection rates.
We'll call this method "two-dimensional pooling."
Here, we'll imagine a chess board, and place each of the samples on the chess board.
The chess board has 8 columns and 8 rows, so we'll create 8 "column pools" by pulling from samples in the same column, and 8 "row pools" by pulling from samples in the same row.
That means the "Column A" pool will have samples from A1, A2, A3, A4, A5, A6, A7 and A8, while the "Row 1" pool will have samples from A1, B1, C1, D1, E2, F1, G1, and H1.
With this method, we should be able to narrow in on the true positive cases when our pools come up positive.
Let's say B, E and F come up positive, while A, C, D, G and H come up negative.
And let's say 2, 4, 6, and 7 come up positive, while 1, 3, 5 and 8 come up negative.
We can visualize this as follows:
There are a few things going on here.
First, each column that comes up negative allows us to eliminate all of the cases in that column.
Second, each row that comes up negative allows us to eliminate all of the cases in that row.
Third, for each column that comes up positive and for each row that comes up positive, we know that *at least one* of the samples is positive. Each of the remaining squares was in one positive column pool and one positive row pool, but we don't know if that square was the sample that contributed in either case. Thus, we do not make any assumptions about these square and we put question marks over them.
Next, we perform a second round of testing on all of the squares that were in a positive row and positive column, and we determine precisely which samples are positive of the ones that remained.
From this we can conclude that B4, E2, F6, and F7 are positive, while the rest are negative.
That gives us a 6.25% positive rate.
Higher Order Pooling
Incredibly, pooling can get even more efficient still.
In the previous section, we explored how we can perform two-dimensional pooling.
Why can't we add even more dimensions?
It turns out we can.
Let's take that group of 64 samples and instead of using a two-dimensional grid of 8x8, we'll use a three-dimensional cube of 4x4x4 and can perform only 12 tests.
Taking this further, if we have 1024 samples, we can use an 10-dimensional hypercube of 2x2x2x2x2x2x2x2x2x2 and perform just 20 tests (or equivalently, we can use a 5 dimensional hypercube of 4x4x4x4x4, which also comes out to 20).
In this way, the number of tests we must perform is 2 times the log of the number of samples we need to test.
number_of_tests = 2 * log2(number_of_samples)
It's also important to note that we can perform all of the tests at once, saving an enormous amount of time.
Once we have the results, we can treat these bits as the data input to a data structure known in computer science as a bloom filter.
For example, let's say we use the 2x2x2x2 method for 16 samples.
We can number the samples 0 through 15 (#0 for the 1st and #15 for the last or 16th), and perform 2+2+2+2=8 tests on all the samples and label them A through H.
Test A will contain the first half of the samples (#0 through #7) while B will contain the second half of the samples (#8 through #15).
Tests C and D will contain another half of the samples, split up in a different way. C will contain the first quarter of the samples and the third quarter of the samples (#4 through #7), while D will contain the second quarter of the samples and the fourth quarter of the samples (#12 through #15). We will repeat this progressive splitting until we get to H.
Note that a quick way of doing this sample batching computationally is to convert each number to bits. For example, the first sample (#0) will be expressed as the 4-digit binary number 0000. The last sample (#15) will be expressed as the 4-digit binary number 1111.
Tests will be peformed on all samples that match a given bit:
Test A = all samples where the leftmost bit is 0
Test B = all samples where the leftmost bit is 1
Test C = all samples where the second highest bit is 0
Test D = all samples where the second highest bit is 1
Test E = all samples where the third highest bit is 0
Test F = all samples where the third highest bit is 1
Test G = all samples where the rightmost bit is 0
Test H = all samples where the rightmost bit is 1
Now imagine the tests come up as follows:
A = ➕
B = ➕
C = ➖
D = ➕
E = ➖
F = ➕
G = ➖
H = ➕
From here, we take each sample's binary representation and check it against the results.
The first sample is 0000, so it was included in tests A, C, E, G.
Since C is negative, 0000 is also negative, but more generally, all samples with 0 in the second digit (0000, 0001, 0010, 0011, 1000, 1001, 1010, and 1011).
Since E is negative, all remaining samples with 0 in the third digit are negative (0100, 0101, 1100, and 1101).
And since G is negative, all remaining samples with 0 in the fourth digit are negative (0110, 1110).
That leaves two samples left: 1111 (#15 AKA the 16th sample) and 0111 (#7 AKA the 8th sample).
These two samples are the likely positives.
It's important to note that this method does in fact increase the false rate, particularly as true positives exceed 1%.
Therefore, we may want to do another round of testing on the likely positives, which will only require us to do 2 more tests.
Millions at a Time with Barcoding
As if that's not good enough, we're going to take testing to an even higher level.
Imagine we could test a million samples all at the same time, in a single reaction.
There's actually a way to do testing that allows us to attach a barcode to each sample, then test the sample, and get a readout of all of the barcodes that are a match.
This may sound too good to be true, but it isn't.
This methodology has not yet proven to be as reliable as RT-PCR, which is the current industry standard, but it is a lot more promising in the long term because if its potential to massively parallelize testing.
Here's how the method works.
First, you take each DNA sample and attach a unique barcode to the end of it. The barcode is very specific sequence of DNA that can be distinguished from all other DNA, so when it is read, it is clear which sample is being read alongside it.
Second, you mix all of the samples together in one batch.
Third, you take the batched sample and you run it through a sequencing machine.
The sequencer will read all of the DNA segments in the batch and will output all of the sequences contained within.
Fourth, you analyze the DNA sequences and whenever you see a barcode, you mark that sequence as a particular patient sample. If the DNA attached to the barcode looks like the virus you are looking for, then that patient is positive.
Well there you have it.
We successfully demonstrated the massive parallelization of testing with mathematics and computer science.
Keep in mind that a lot of this is theoretical, and there are studies still being done to prove the accuracy of some of the more advanced methods, including sequencing-based testing.
And this would be a great time to shout out all of the incredible biologists and engineers working on next generation testing for coronavirus and other diseases.
Hopefully you got a glimpse of what can be accomplished when you combine powerful concepts from biology, mathematics, and computer science.