And the winner of Rootclaim’s COVID-19 origins debate and the $100,000 prize is…
Unfortunately, not us :). We would like to explain this result, but first, we would like to congratulate our opponent and Rootclaim’s first challenger, Peter Miller. Miller showcased an impressive understanding of the details during the debate, which was hard to match.
While Peter’s victory was well earned within the parameters of the debate, we believe it was also due to our failure to structure an effective debate.
Obviously, one can simply conclude the correct decision was reached and zoonosis is simply the likelier hypothesis. Without resorting to sore losing and given the importance of this issue, regardless of the debate, we would like to explain why we still believe the lab leak hypothesis is the most likely explanation for the origin of COVID-19 and, as our new and updated analysis shows, its likelihood only increased following the deeper analysis we did for the debate.
First, we’d like to clarify, that the judges did an amazing job, putting immense effort, thought, and talent into their decisions:
- Will Van Treuren is a microbiologist and immunologist with a PhD from Stanford. He works as Chief Science Officer at a biotech company developing new drugs to treat inflammatory diseases. Will’s written decision can be found here and here and a video summary is available here.
- Eric Stansifer is an applied mathematician with a PhD in the Earth sciences from MIT. He has previously done research in a mathematical virology research group, doing simulations of MS2 capsid assembly. Eric’s written decision can be found here and a video summary is available here (you can also read his blog here).
What went wrong?
So, if the judges did their job well and our opponent played by the rules, what went wrong? We believe two things tilted the debate in favor of our opponent and we will correct them in future debates:
First, the debate structure provided a major advantage to the debater with more memorized knowledge of the issue. The debate was live (via video) and Miller exemplified extensive knowledge and superb memory for many details, which we could not compete with in real-time. This was not an issue in the second session about genetics, where we were represented by Yuri Deigin, but our second mistake (below) made his good efforts irrelevant. While such superiority is worthy of victory in normal debates, Rootclaim strives to create a model for reasoning and inference that minimizes the problems with human reasoning. Unfortunately, we structured a debate that rewards it. To fix this, future debates will be held in an offline text format, with only a short video presentation at the end.
The second issue we identified was that we failed to incorporate a process of ongoing feedback from the judges, spending most of our time on issues that had little impact on the final decision. In their ruling, we found major mistakes in their understanding of our analysis, which could have been easily corrected had we built the debate with more direct ongoing feedback from the judges.
For example, we know from years of dealing with probabilistic inference that it is highly unintuitive, and it is a challenge to translate to human language. We therefore focused more on an intuitive understanding of the evidence, with probabilistic inference used only as a background framework.
In practice, we were surprised to see both judges found probabilistic inference to be the best way to reach a decision. We of course agree, but had we known this to be the case, we would’ve focused our efforts on explaining how to do probabilistic inference correctly, describing the major pitfalls we discovered over the years, and how to avoid them. As we failed to do so, errors in the judges’ probabilistic inference resulted in unrealistic numbers assigned to the evidence.
The mistakes were heavily skewed toward zoonosis, since our methodology involves steelmanning and maximizing the likelihoods of both hypotheses, while Miller used figures heavily biased toward zoonosis, in some cases using extreme estimates that are impossible to reach in a robust probabilistic analysis, as we explain below.
The Risks of Strawmanning
This mistake of assigning extreme numbers is similar to strawmanning in human debate, and can demolish an otherwise valid probabilistic analysis. Following is a semi-formal definition of the problem and how to avoid it:
- Our goal in a probabilistic analysis is to estimate Bayes factors.
- A Bayes factor is the ratio of conditional probabilities.
- A conditional probability p(E|H) is the probability the evidence E will occur, assuming H is true.
- In real-world situations, there are many ways E can occur, so p(E|H) should integrate over all those ways (using “1−∏(1−pi)”).
- In practice, focusing only on the most common way is usually accurate enough, and dramatically reduces the required work, as real world data tends to have extreme distributions, such as a power law distribution.
- This is the “best explanation” – the explanation that maximizes the likelihood of the hypothesis – and making a serious effort to find it is steelmanning.
- A mistake in this step, even just choosing the 2nd best explanation, could easily result in orders-of-magnitude errors.
- To reduce such mistakes, it is crucial to seriously meet the requirement above of “assuming H is true”. That is a very unintuitive process, as humans tend to feel only one hypothesis is true at any time. Rational thinkers are open to replacing their hypothesis in the face of evidence, but constantly switching between hypotheses is difficult.
- The example we like to give for choosing a best explanation is in DNA evidence. A prosecutor shows the court a statistical analysis of which DNA markers matched the defendant and their prevalence, arriving at a 1E-9 probability they would all match a random person, implying a Bayes factor near 1E9 for guilty.
But if we try to estimate p(DNA|~guilty) by truly assuming innocence, it is immediately evident how ridiculous it is to claim only 1 out of a billion innocent suspects will have a DNA match to the crime scene. There are obviously far better explanations like a lab mistake, framing, an object of the suspect being brought by someone to the scene, etc. The goal is to truly seek which explanation is most likely for each hypothesis, using the specifics of each case.
- Furthermore, it’s important to not only find the best explanation but honestly think about how well we understand the issue and estimate how likely it is there is some best explanation that still evades us (i.e. that we are currently estimating the 2nd best explanation or worse). This too is obvious to researchers who know not to go publish immediately upon finding something, but rather go through rigorous verification that their finding doesn’t have some other mundane explanation.
- So, the more complex the issue is, and the weaker our understanding of it, the less justified we are in claiming a low conditional probability. In frequentist terms, the question we should ask ourselves: How often did I face a similar issue only to later find there was a much more mundane explanation? Suppose it’s 1 in 10, then the lower bound on our p is 0.1 times however frequent that mundane explanation happens (say 0.2, for a total of 0.02)
Claiming something like p=0.0001 in a situation where we don’t have a perfect understanding of the situation is a catastrophic mistake.
- For well-designed replicated physics experiments p could reach very low (allowing for the five sigma standard), but when dealing with noisy complex systems involving biology, human behavior, exponential growth, etc. it is extremely hard to confidently claim that all confounders (i.e. better explanations for the finding) were eliminated, so claiming a very low p is an obvious mistake.
- The last guideline is to also examine our confidence in our process. As we examine best explanations, we also need to account for the possibility that we made mistakes in that process itself.
Suppose the explanations for the DNA match are only “by chance” and “lab mix-up”, and suppose we examined the lab procedures and talked to staff and determined a mix-up was very unlikely, it still doesn’t make “by chance” the most likely explanation, since it is still possible our analysis was wrong, and the combined probability of our mistake and a mix-up (say 0.01*0.01) is still much higher than a chance match (1E-9).
To summarize: Estimating the Bayes factor requires estimating conditional probabilities, which requires finding the best explanation under each hypothesis, which can easily succumb to several pitfalls that cause catastrophic errors. To avoid those: a) Seek and honestly evaluate best explanations under the assumption the hypothesis is true, b) Estimate the likelihood that there is some better explanation that is yet to be found – the more complex the issue is, the higher the likelihood, and c) Estimate the likelihood of mistakes in the estimates themselves.
The Main Mistakes
We therefore never provided extremely low conditional probabilities under zoonosis, and as a result didn’t have any extreme factors in our analysis. Unfortunately, the result of our steelmanning was that when our hypothesis’ explanation was favored, the effect on the final likelihood was much smaller than when Miller’s was. When the judges did not have the tools to conclude between the sides, their result was some average of the two, which of course, given the extreme, strawmanned numbers offered by Peter, favored zoonosis.
Again, to clarify, this is no fault of the judges and is fully our responsibility for structuring the debate incorrectly. We found many such mistakes throughout both judges’ decisions, but in the interest of time would like to focus on the three most important ones that are enough to make lab-leak far more likely, once corrected.
Mistake #1: p=0.0001 for an HSM early cluster
The first mistake in the judges’ decision was accepting an extremely low likelihood for the Huanan Seafood Market (HSM) to form an early cluster of infected patients if Covid originated in a lab. Now that we’ve demonstrated the importance of steelmanning, it’s obvious that it is a mistake to consider HSM to be a random location in Wuhan (i.e. will form an early cluster only once every 10,000 hypothetical SARS2 lab leaks in Wuhan).
Even though we were not able to provide a perfect model for why HSM is a likely early cluster location, the complexity of a virus spreading in an urban area, and especially the huge difference that a small exponential advantage at HSM will have on the final numbers, means there is no way to reach anywhere close to the level of confidence required to claim a number as extreme as p=0.0001.
Mistake #2: p(Lab leak)<0.01 in priors
The second major mistake in the judges’ decision, again involves using extremely low likelihood instead of steelmanning, this time in the prior likelihood for a lab leak. Each judge made different mistakes, but both reached numbers that, unknowingly to them, imply gain-of-function research is extremely safe, and all the expert warnings and government moratoriums on it were wrong – a level of confidence that is of course impossible to reach without making some outstanding breakthrough in the understanding of the field. See more details here:
- Severe underestimate (0.02) of the probability that at least one researcher in WIV will undertake a project that WIV clearly expressed interest in. The mistake here seems to come from wrongly thinking SARS2 has features that are not covered by DEFUSE. Interestingly, after Stansifer reached his decision, it was discovered that WIV was planned to do a lot more than officially written in DEFUSE.
- Severe underestimate (0.02) of the probability that a researcher working on a SARS2-like virus for weeks or months under BSL-2 would get infected. There is good reason to claim this could be an over 50% probability, and we gave it a conservative 15%, but 2% is highly overconfident.
- These two mistakes imply the probability of any work in the Wuhan’s Institute of Virology (WIV) causing a leak to be 1 in 17,000 years. Given that WIV was planning to do coronavirus GoF experiments under BSL-2 – meaning they’ll be dealing with a respiratory virus without even a face mask, this could easily be a 100x mistake.
- A redundant 0.01 factor was added for requiring WIV to have an unpublished backbone with 98% nucleotide similarity to SARS2. There is no such need. Since our prior was defined as a novel coronavirus pandemic, then all we need to estimate is the probability that a virus capable of that existed in WIV. Specifically, since DEFUSE describes searching for hACE2 matches and adding FCS, then the only question is whether WIV held a virus with a good hACE2 match.
We know BANAL-52 is identical in the RBD to SARS2, so if a relative of it was collected then they have a backbone and we’re done. But we should expand that to any virus with an hACE2 match, even one with 80% similarity to SARS2, so it’s very reasonable that at least one will be found. We gave this 50%.
Another way to look at this mistake: If we arbitrarily limit the engineered backbone to have 98% similarity to SARS2, we should apply the same limitation to the zoonotic progenitor, meaning we should discard from the prior any pandemic that is caused by viruses that doesn’t use hACE2, or those with good hACE2 match but using a different genetic sequence.
If we place this requirement on both hypotheses, the effect cancels out.
Mistake #3: Missing that the FCS estimate is heavily steelmanned
The third major mistake in the judges’ decision, was using a low estimate for the likelihood of the Furin Cleavage Site (FCS) occurring naturally. A naive analysis of the combination of the rare occurrences behind the FCS insertion (which you can read about in our thread here) places us comfortably in a Bayes factor of millions. Ironically, had we just submitted this strawmanned calculation, we could have won the debate. However, since our goal was to actually determine what hypothesis is most likely, we steelmaned this estimate as well, thinking of the most likely way this could happen, truly assuming zoonosis is true.
As explained, we have updated our debate structure to avoid these problems in the future. Rootclaim’s $100,000 challenge is still open to anyone, including on the COVID-19 origins issue, as we’re still standing behind our analysis and willing to put our money where our mouth is.
We have invited Peter to reapply, using the updated textual debate format with ongoing judge feedback, allowing the sides to fully convey their hypothesis in exactly the problematic areas. Miller has declined a rematch but we respect his decision to move on and invite others to take his place.
The idea behind our challenge and risking money is to provide a strong incentive for deep research and analysis. This was successful beyond our expectations with Miller now probably one of the people with the deepest and most encompassing knowledge about the origins of COVID-19.
In ‘A Journey to the Center of the Earth’, Jules Verne wrote that “Science is made up of mistakes, but they are mistakes which it is useful to make because they lead little by little to the truth”. You don’t go into the probabilistic inference business expecting certainty and In this spirit, we appreciate this loss as our compass to future success.