Tag: Rootclaim

Rootclaim’s COVID-19 Origins debate results

And the winner of Rootclaim’s COVID-19 origins debate and the $100,000 prize is…

Unfortunately, not us :). We would like to explain this result, but first, we would like to congratulate our opponent and Rootclaim’s first challenger, Peter Miller. Miller showcased an impressive understanding of the details during the debate, which was hard to match.

While Peter’s victory was well earned within the parameters of the debate, we believe it was also due to our failure to structure an effective debate. 

Obviously, one can simply conclude the correct decision was reached and zoonosis is simply the likelier hypothesis. Without resorting to sore losing and given the importance of this issue, regardless of the debate, we would like to explain why we still believe the lab leak hypothesis is the most likely explanation for the origin of COVID-19 and, as our new and updated analysis shows, its likelihood only increased following the deeper analysis we did for the debate. 

First, we’d like to clarify, that the judges did an amazing job, putting immense effort, thought, and talent into their decisions:

  • Will Van Treuren is a microbiologist and immunologist with a PhD from Stanford. He works as Chief Science Officer at a biotech company developing new drugs to treat inflammatory diseases. Will’s written decision can be found here and here and a video summary is available here.

  • Eric Stansifer is an applied mathematician with a PhD in the Earth sciences from MIT. He has previously done research in a mathematical virology research group, doing simulations of MS2 capsid assembly. Eric’s written decision can be found here and a video summary is available here (you can also read his blog here).

What went wrong?

So, if the judges did their job well and our opponent played by the rules, what went wrong? We believe two things tilted the debate in favor of our opponent and we will correct them in future debates: 

First, the debate structure provided a major advantage to the debater with more memorized knowledge of the issue. The debate was live (via video) and Miller exemplified extensive knowledge and superb memory for many details, which we could not compete with in real-time. This was not an issue in the second session about genetics, where we were represented by Yuri Deigin, but our second mistake (below) made his good efforts irrelevant. While such superiority is worthy of victory in normal debates, Rootclaim strives to create a model for reasoning and inference that minimizes the problems with human reasoning. Unfortunately, we structured a debate that rewards it. To fix this, future debates will be held in an offline text format, with only a short video presentation at the end.

The second issue we identified was that we failed to incorporate a process of ongoing feedback from the judges, spending most of our time on issues that had little impact on the final decision. In their ruling, we found major mistakes in their understanding of our analysis, which could have been easily corrected had we built the debate with more direct ongoing feedback from the judges. 

For example, we know from years of dealing with probabilistic inference that it is highly unintuitive, and it is a challenge to translate to human language. We therefore focused more on an intuitive understanding of the evidence, with probabilistic inference used only as a background framework.

In practice, we were surprised to see both judges found probabilistic inference to be the best way to reach a decision. We of course agree, but had we known this to be the case, we would’ve focused our efforts on explaining how to do probabilistic inference correctly, describing the major pitfalls we discovered over the years, and how to avoid them. As we failed to do so, errors in the judges’ probabilistic inference resulted in unrealistic numbers assigned to the evidence. 

The mistakes were heavily skewed toward zoonosis, since our methodology involves steelmanning and maximizing the likelihoods of both hypotheses, while Miller used figures heavily biased toward zoonosis, in some cases using extreme estimates that are impossible to reach in a robust probabilistic analysis, as we explain below.

The Risks of Strawmanning

This mistake of assigning extreme numbers is similar to strawmanning in human debate, and can demolish an otherwise valid probabilistic analysis. Following is a semi-formal definition of the problem and how to avoid it:

  1. Our goal in a probabilistic analysis is to estimate Bayes factors.
  2. A Bayes factor is the ratio of conditional probabilities.
  3. A conditional probability p(E|H) is the probability the evidence E will occur, assuming H is true.
  4. In real-world situations, there are many ways E can occur, so p(E|H) should integrate over all those ways (using “1−∏(1−pi)”).
  5. In practice, focusing only on the most common way is usually accurate enough, and dramatically reduces the required work, as real world data tends to have extreme distributions, such as a power law distribution. 
  6. This is the “best explanation” – the explanation that maximizes the likelihood of the hypothesis – and making a serious effort to find it is steelmanning. 
  7. A mistake in this step, even just choosing the 2nd best explanation, could easily result in orders-of-magnitude errors.
  8. To reduce such mistakes, it is crucial to seriously meet the requirement above of “assuming H is true”. That is a very unintuitive process, as humans tend to feel only one hypothesis is true at any time. Rational thinkers are open to replacing their hypothesis in the face of evidence, but constantly switching between hypotheses is difficult.
  9. The example we like to give for choosing a best explanation is in DNA evidence. A prosecutor shows the court a statistical analysis of which DNA markers matched the defendant and their prevalence, arriving at a 1E-9 probability they would all match a random person, implying a Bayes factor near 1E9 for guilty.
    But if we try to estimate p(DNA|~guilty) by truly assuming innocence, it is immediately evident how ridiculous it is to claim only 1 out of a billion innocent suspects will have a DNA match to the crime scene. There are obviously far better explanations like a lab mistake, framing, an object of the suspect being brought by someone to the scene, etc. The goal is to truly seek which explanation is most likely for each hypothesis, using the specifics of each case.
  10. Furthermore, it’s important to not only find the best explanation but honestly think about how well we understand the issue and estimate how likely it is there is some best explanation that still evades us (i.e. that we are currently estimating the 2nd best explanation or worse). This too is obvious to researchers who know not to go publish immediately upon finding something, but rather go through rigorous verification that their finding doesn’t have some other mundane explanation.
  11. So, the more complex the issue is, and the weaker our understanding of it, the less justified we are in claiming a low conditional probability. In frequentist terms, the question we should ask ourselves: How often did I face a similar issue only to later find there was a much more mundane explanation? Suppose it’s 1 in 10, then the lower bound on our p is 0.1 times however frequent that mundane explanation happens (say 0.2, for a total of 0.02)
    Claiming something like p=0.0001 in a situation where we don’t have a perfect understanding of the situation is a catastrophic mistake.
  12. For well-designed replicated physics experiments p could reach very low (allowing for the five sigma standard), but when dealing with noisy complex systems involving biology, human behavior, exponential growth, etc. it is extremely hard to confidently claim that all confounders (i.e. better explanations for the finding) were eliminated, so claiming a very low p is an obvious mistake.
  13. The last guideline is to also examine our confidence in our process. As we examine best explanations, we also need to account for the possibility that we made mistakes in that process itself.
    Suppose the explanations for the DNA match are only “by chance” and “lab mix-up”, and suppose we examined the lab procedures and talked to staff and determined a mix-up was very unlikely, it still doesn’t make “by chance” the most likely explanation, since it is still possible our analysis was wrong, and the combined probability of our mistake and a mix-up (say 0.01*0.01) is still much higher than a chance match (1E-9).

To summarize: Estimating the Bayes factor requires estimating conditional probabilities, which requires finding the best explanation under each hypothesis, which can easily succumb to several pitfalls that cause catastrophic errors. To avoid those: a) Seek and honestly evaluate best explanations under the assumption the hypothesis is true, b) Estimate the likelihood that there is some better explanation that is yet to be found – the more complex the issue is, the higher the likelihood, and c) Estimate the likelihood of mistakes in the estimates themselves.

The Main Mistakes

We therefore never provided extremely low conditional probabilities under zoonosis, and as a result didn’t have any extreme factors in our analysis. Unfortunately, the result of our steelmanning was that when our hypothesis’ explanation was favored, the effect on the final likelihood was much smaller than when Miller’s was. When the judges did not have the tools to conclude between the sides, their result was some average of the two, which of course, given the extreme, strawmanned numbers offered by Peter, favored zoonosis.

Again, to clarify, this is no fault of the judges and is fully our responsibility for structuring the debate incorrectly. We found many such mistakes throughout both judges’ decisions, but in the interest of time would like to focus on the three most important ones that are enough to make lab-leak far more likely, once corrected.

Mistake #1: p=0.0001 for an HSM early cluster

The first mistake in the judges’ decision was accepting an extremely low likelihood for the Huanan Seafood Market (HSM) to form an early cluster of infected patients if Covid originated in a lab. Now that we’ve demonstrated the importance of steelmanning, it’s obvious that it is a mistake to consider HSM to be a random location in Wuhan (i.e. will form an early cluster only once every 10,000 hypothetical SARS2 lab leaks in Wuhan).

Even though we were not able to provide a perfect model for why HSM is a likely early cluster location, the complexity of a virus spreading in an urban area, and especially the huge difference that a small exponential advantage at HSM will have on the final numbers, means there is no way to reach anywhere close to the level of confidence required to claim a number as extreme as p=0.0001.

Mistake #2: p(Lab leak)<0.01 in priors

The second major mistake in the judges’ decision, again involves using extremely low likelihood instead of steelmanning, this time in the prior likelihood for a lab leak. Each judge made different mistakes, but both reached numbers that, unknowingly to them, imply gain-of-function research is extremely safe, and all the expert warnings and government moratoriums on it were wrong – a level of confidence that is of course impossible to reach without making some outstanding breakthrough in the understanding of the field. See more details here:

Stansifer’s mistakes:

  • Severe underestimate (0.02) of the probability that at least one researcher in WIV will undertake a project that WIV clearly expressed interest in. The mistake here seems to come from wrongly thinking SARS2 has features that are not covered by DEFUSE. Interestingly, after Stansifer reached his decision, it was discovered that WIV was planned to do a lot more than officially written in DEFUSE.
  • Severe underestimate (0.02) of the probability that a researcher working on a SARS2-like virus for weeks or months under BSL-2 would get infected. There is good reason to claim this could be an over 50% probability, and we gave it a conservative 15%, but 2% is highly overconfident.
  • These two mistakes imply the probability of any work in the Wuhan’s Institute of Virology (WIV) causing a leak to be 1 in 17,000 years. Given that WIV was planning to do coronavirus GoF experiments under BSL-2 – meaning they’ll be dealing with a respiratory virus without even a face mask, this could easily be a 100x mistake.

Treuren’s Mistake:

  • A redundant 0.01 factor was added for requiring WIV to have an unpublished backbone with 98% nucleotide similarity to SARS2. There is no such need. Since our prior was defined as a novel coronavirus pandemic, then all we need to estimate is the probability that a virus capable of that existed in WIV. Specifically, since DEFUSE describes searching for hACE2 matches and adding FCS, then the only question is whether WIV held a virus with a good hACE2 match.

    We know BANAL-52 is identical in the RBD to SARS2, so if a relative of it was collected then they have a backbone and we’re done. But we should expand that to any virus with an hACE2 match, even one with 80% similarity to SARS2, so it’s very reasonable that at least one will be found. We gave this 50%.

    Another way to look at this mistake: If we arbitrarily limit the engineered backbone to have 98% similarity to SARS2, we should apply the same limitation to the zoonotic progenitor, meaning we should discard from the prior any pandemic that is caused by viruses that doesn’t use hACE2, or those with good hACE2 match but using a different genetic sequence.
    If we place this requirement on both hypotheses, the effect cancels out.

Mistake #3: Missing that the FCS estimate is heavily steelmanned

The third major mistake in the judges’ decision, was using a low estimate for the likelihood of the Furin Cleavage Site (FCS) occurring naturally. A naive analysis of the combination of the rare occurrences behind the FCS insertion (which you can read about in our thread here) places us comfortably in a Bayes factor of millions. Ironically, had we just submitted this strawmanned calculation, we could have won the debate. However, since our goal was to actually determine what hypothesis is most likely, we steelmaned this estimate as well, thinking of the most likely way this could happen, truly assuming zoonosis is true.

Conclusion

As explained, we have updated our debate structure to avoid these problems in the future. Rootclaim’s $100,000 challenge is still open to anyone, including on the COVID-19 origins issue, as we’re still standing behind our analysis and willing to put our money where our mouth is. 

We have invited Peter to reapply, using the updated textual debate format with ongoing judge feedback, allowing the sides to fully convey their hypothesis in exactly the problematic areas. Miller has declined a rematch but we respect his decision to move on and invite others to take his place. 

The idea behind our challenge and risking money is to provide a strong incentive for deep research and analysis. This was successful beyond our expectations with Miller now probably one of the people with the deepest and most encompassing knowledge about the origins of COVID-19.

In ‘A Journey to the Center of the Earth’, Jules Verne wrote that “Science is made up of mistakes, but they are mistakes which it is useful to make because they lead little by little to the truth”. You don’t go into the probabilistic inference business expecting certainty and In this spirit, we appreciate this loss as our compass to future success. 

Rootclaim accepts $500,000 challenge on COVID vaccine safety & efficacy

Have mRNA vaccines killed more people than they have saved?

That’s what American entrepreneur Steve Kirsch claims in his list of Covid-19 challenges. Today, Rootclaim has officially accepted his challenge in the amount of $500,000.

After reviewing all challenges we decided to accept challenge no. 6: “The Pfizer and Moderna mRNA vaccines have killed more people than they have saved from dying from COVID“. This addresses two of the most pressing and hotly debated issues of the pandemic: vaccine efficacy and vaccine safety. Advancing public discourse on these issues will likely save lives, and improve preparedness for future pandemics. 

After analyzing the available evidence, we conclude that despite several shortcomings, mRNA vaccines have saved many more lives than they cost.

While we challenge Kirsch on this specific item, we actually agree with a number of his other claims, including some that run counter to mainstream opinion. As Kirsch pointed out in his post, we agree with item 9 (“Lab origin is more likely”) and even offer our own challenge on the subject. Before examining vaccines, we studied the benefit of masks (items 7 and 10 in Kirsch’s list) and were surprised to find it is far from clear they are indeed effective, given the many factors involved in their practical use, such as most people wearing them poorly, virus transmission through the eyes, virus adaptation, and considerations of herd immunity. We are also generally in agreement on the importance of drug repurposing in COVID (related to challenge no. 8).

We have great admiration for Kirsch’s willingness to take a personal risk on his public claims. This is in sharp contrast to the many public figures constantly making overconfident statements on matters of great importance, without taking any risk. This is something we repeatedly encounter in our work. Some examples:

These examples demonstrate the low value of claims made when nothing is at risk: public discourse is awash with baseless, overconfident claims that carry no repercussions for their claimants if they turn out to be false. We believe that adding ‘skin in the game’ can dramatically reduce this problem, and therefore offer our own public debate challenge, which coincidentally happened to be very similar to Kirsch’s. So far no one has applied.

We therefore greatly appreciate Kirsch’s courage and leadership here. We see it as our responsibility to accept a challenge when we think the claim is wrong, and of course, take the loss if we fail.

It should be emphasized that regardless of who wins in this particular case, this is a victory for public discourse. First, by offering a reliable resolution to the important question of vaccine efficacy and safety, and more importantly, by setting a standard for settling controversies: an impartial, judged debate where both sides take a significant risk on the outcome. Hopefully, in the future, people making confident assertions on issues of importance without taking a risk will be ignored as background noise.

Update: As we were applying, we noticed Kirsch has recently added a note to his challenge page, terminating the bets due to no one applying. Since we were already in private discussions with Kirsch on the terms before this update, we would be very surprised to find this would apply to us. 

Update #2: We and Kirsch are making good progress on setting the parameters of the $500,000 challenge and we’re in the process of finalizing our agreed picks for two judges. Our preference will be for the most experienced, well-respected, and unbiased experts. 

MH 17: Weak Evidence Matters

Discounting Weak Evidence

One pitfall to avoid is prematurely discounting seemingly weak evidence. Weak evidence can take many forms. It could be evidence that seems very unlikely under all hypotheses. Or it could be evidence that is non-intuitive and doesn’t seem to fit what we consider “conclusive” evidence.

When evaluating evidence, it’s easy to get distracted looking for “irrefutable” evidence (more on that in an upcoming blog post). However, that’s a mistake. What’s really important is the ratio between how likely evidence is under the hypotheses.

Continue reading

Unraveling the Mystery of Trump’s Hair

What’s the Deal with Trump’s Hair?

Given Donald Trump’s flair for controversy, it’s not surprising that even his hair would be the subject of debate. In “What is the story behind Donald Trump’s hair?”, Rootclaim analyzes the most popular claims about the Donald’s trademark hair-covering to cut through the uncertainty.

Continue reading

False Dilemma Fallacies–Finding the Gray

The False Dilemma

The Fallacy of Presumption can take many forms. One common example is the false dilemma. This fallacy is also known as the false binary or false dichotomy. As its name indicates, the false dilemma divides a scenario into only two alternatives. When the situation calls for a yes/no answer, that works fine. But in many situations, things are more complex. Thus the false dilemma tricks you into choosing between two imprecise, inaccurate or otherwise flawed options.

Continue reading

© 2025 Rootclaim Blog

Theme by Anders NorenUp ↑