Rootclaim’s COVID-19 Origins debate results

And the winner of Rootclaim’s COVID-19 origins debate and the $100,000 prize is…

Unfortunately, not us :). We would like to explain this result, but first, we would like to congratulate our opponent and Rootclaim’s first challenger, Peter Miller. Miller showcased an impressive understanding of the details during the debate, which was hard to match.

While Peter’s victory was well earned within the parameters of the debate, we believe it was also due to our failure to structure an effective debate. 

Obviously, one can simply conclude the correct decision was reached and zoonosis is simply the likelier hypothesis. Without resorting to sore losing and given the importance of this issue, regardless of the debate, we would like to explain why we still believe the lab leak hypothesis is the most likely explanation for the origin of COVID-19 and, as our new and updated analysis shows, its likelihood only increased following the deeper analysis we did for the debate. 

First, we’d like to clarify, that the judges did an amazing job, putting immense effort, thought, and talent into their decisions:

  • Will Van Treuren is a microbiologist and immunologist with a PhD from Stanford. He works as Chief Science Officer at a biotech company developing new drugs to treat inflammatory diseases. Will’s written decision can be found here and here and a video summary is available here.

  • Eric Stansifer is an applied mathematician with a PhD in the Earth sciences from MIT. He has previously done research in a mathematical virology research group, doing simulations of MS2 capsid assembly. Eric’s written decision can be found here and a video summary is available here (you can also read his blog here).

What went wrong?

So, if the judges did their job well and our opponent played by the rules, what went wrong? We believe two things tilted the debate in favor of our opponent and we will correct them in future debates: 

First, the debate structure provided a major advantage to the debater with more memorized knowledge of the issue. The debate was live (via video) and Miller exemplified extensive knowledge and superb memory for many details, which we could not compete with in real-time. This was not an issue in the second session about genetics, where we were represented by Yuri Deigin, but our second mistake (below) made his good efforts irrelevant. While such superiority is worthy of victory in normal debates, Rootclaim strives to create a model for reasoning and inference that minimizes the problems with human reasoning. Unfortunately, we structured a debate that rewards it. To fix this, future debates will be held in an offline text format, with only a short video presentation at the end.

The second issue we identified was that we failed to incorporate a process of ongoing feedback from the judges, spending most of our time on issues that had little impact on the final decision. In their ruling, we found major mistakes in their understanding of our analysis, which could have been easily corrected had we built the debate with more direct ongoing feedback from the judges. 

For example, we know from years of dealing with probabilistic inference that it is highly unintuitive, and it is a challenge to translate to human language. We therefore focused more on an intuitive understanding of the evidence, with probabilistic inference used only as a background framework.

In practice, we were surprised to see both judges found probabilistic inference to be the best way to reach a decision. We of course agree, but had we known this to be the case, we would’ve focused our efforts on explaining how to do probabilistic inference correctly, describing the major pitfalls we discovered over the years, and how to avoid them. As we failed to do so, errors in the judges’ probabilistic inference resulted in unrealistic numbers assigned to the evidence. 

The mistakes were heavily skewed toward zoonosis, since our methodology involves steelmanning and maximizing the likelihoods of both hypotheses, while Miller used figures heavily biased toward zoonosis, in some cases using extreme estimates that are impossible to reach in a robust probabilistic analysis, as we explain below.

The Risks of Strawmanning

This mistake of assigning extreme numbers is similar to strawmanning in human debate, and can demolish an otherwise valid probabilistic analysis. Following is a semi-formal definition of the problem and how to avoid it:

  1. Our goal in a probabilistic analysis is to estimate Bayes factors.
  2. A Bayes factor is the ratio of conditional probabilities.
  3. A conditional probability p(E|H) is the probability the evidence E will occur, assuming H is true.
  4. In real-world situations, there are many ways E can occur, so p(E|H) should integrate over all those ways (using “1−∏(1−pi)”).
  5. In practice, focusing only on the most common way is usually accurate enough, and dramatically reduces the required work, as real world data tends to have extreme distributions, such as a power law distribution. 
  6. This is the “best explanation” – the explanation that maximizes the likelihood of the hypothesis – and making a serious effort to find it is steelmanning. 
  7. A mistake in this step, even just choosing the 2nd best explanation, could easily result in orders-of-magnitude errors.
  8. To reduce such mistakes, it is crucial to seriously meet the requirement above of “assuming H is true”. That is a very unintuitive process, as humans tend to feel only one hypothesis is true at any time. Rational thinkers are open to replacing their hypothesis in the face of evidence, but constantly switching between hypotheses is difficult.
  9. The example we like to give for choosing a best explanation is in DNA evidence. A prosecutor shows the court a statistical analysis of which DNA markers matched the defendant and their prevalence, arriving at a 1E-9 probability they would all match a random person, implying a Bayes factor near 1E9 for guilty.
    But if we try to estimate p(DNA|~guilty) by truly assuming innocence, it is immediately evident how ridiculous it is to claim only 1 out of a billion innocent suspects will have a DNA match to the crime scene. There are obviously far better explanations like a lab mistake, framing, an object of the suspect being brought by someone to the scene, etc. The goal is to truly seek which explanation is most likely for each hypothesis, using the specifics of each case.
  10. Furthermore, it’s important to not only find the best explanation but honestly think about how well we understand the issue and estimate how likely it is there is some best explanation that still evades us (i.e. that we are currently estimating the 2nd best explanation or worse). This too is obvious to researchers who know not to go publish immediately upon finding something, but rather go through rigorous verification that their finding doesn’t have some other mundane explanation.
  11. So, the more complex the issue is, and the weaker our understanding of it, the less justified we are in claiming a low conditional probability. In frequentist terms, the question we should ask ourselves: How often did I face a similar issue only to later find there was a much more mundane explanation? Suppose it’s 1 in 10, then the lower bound on our p is 0.1 times however frequent that mundane explanation happens (say 0.2, for a total of 0.02)
    Claiming something like p=0.0001 in a situation where we don’t have a perfect understanding of the situation is a catastrophic mistake.
  12. For well-designed replicated physics experiments p could reach very low (allowing for the five sigma standard), but when dealing with noisy complex systems involving biology, human behavior, exponential growth, etc. it is extremely hard to confidently claim that all confounders (i.e. better explanations for the finding) were eliminated, so claiming a very low p is an obvious mistake.
  13. The last guideline is to also examine our confidence in our process. As we examine best explanations, we also need to account for the possibility that we made mistakes in that process itself.
    Suppose the explanations for the DNA match are only “by chance” and “lab mix-up”, and suppose we examined the lab procedures and talked to staff and determined a mix-up was very unlikely, it still doesn’t make “by chance” the most likely explanation, since it is still possible our analysis was wrong, and the combined probability of our mistake and a mix-up (say 0.01*0.01) is still much higher than a chance match (1E-9).

To summarize: Estimating the Bayes factor requires estimating conditional probabilities, which requires finding the best explanation under each hypothesis, which can easily succumb to several pitfalls that cause catastrophic errors. To avoid those: a) Seek and honestly evaluate best explanations under the assumption the hypothesis is true, b) Estimate the likelihood that there is some better explanation that is yet to be found – the more complex the issue is, the higher the likelihood, and c) Estimate the likelihood of mistakes in the estimates themselves.

The Main Mistakes

We therefore never provided extremely low conditional probabilities under zoonosis, and as a result didn’t have any extreme factors in our analysis. Unfortunately, the result of our steelmanning was that when our hypothesis’ explanation was favored, the effect on the final likelihood was much smaller than when Miller’s was. When the judges did not have the tools to conclude between the sides, their result was some average of the two, which of course, given the extreme, strawmanned numbers offered by Peter, favored zoonosis.

Again, to clarify, this is no fault of the judges and is fully our responsibility for structuring the debate incorrectly. We found many such mistakes throughout both judges’ decisions, but in the interest of time would like to focus on the three most important ones that are enough to make lab-leak far more likely, once corrected.

Mistake #1: p=0.0001 for an HSM early cluster

The first mistake in the judges’ decision was accepting an extremely low likelihood for the Huanan Seafood Market (HSM) to form an early cluster of infected patients if Covid originated in a lab. Now that we’ve demonstrated the importance of steelmanning, it’s obvious that it is a mistake to consider HSM to be a random location in Wuhan (i.e. will form an early cluster only once every 10,000 hypothetical SARS2 lab leaks in Wuhan).

Even though we were not able to provide a perfect model for why HSM is a likely early cluster location, the complexity of a virus spreading in an urban area, and especially the huge difference that a small exponential advantage at HSM will have on the final numbers, means there is no way to reach anywhere close to the level of confidence required to claim a number as extreme as p=0.0001.

Mistake #2: p(Lab leak)<0.01 in priors

The second major mistake in the judges’ decision, again involves using extremely low likelihood instead of steelmanning, this time in the prior likelihood for a lab leak. Each judge made different mistakes, but both reached numbers that, unknowingly to them, imply gain-of-function research is extremely safe, and all the expert warnings and government moratoriums on it were wrong – a level of confidence that is of course impossible to reach without making some outstanding breakthrough in the understanding of the field. See more details here:

Stansifer’s mistakes:

  • Severe underestimate (0.02) of the probability that at least one researcher in WIV will undertake a project that WIV clearly expressed interest in. The mistake here seems to come from wrongly thinking SARS2 has features that are not covered by DEFUSE. Interestingly, after Stansifer reached his decision, it was discovered that WIV was planned to do a lot more than officially written in DEFUSE.
  • Severe underestimate (0.02) of the probability that a researcher working on a SARS2-like virus for weeks or months under BSL-2 would get infected. There is good reason to claim this could be an over 50% probability, and we gave it a conservative 15%, but 2% is highly overconfident.
  • These two mistakes imply the probability of any work in the Wuhan’s Institute of Virology (WIV) causing a leak to be 1 in 17,000 years. Given that WIV was planning to do coronavirus GoF experiments under BSL-2 – meaning they’ll be dealing with a respiratory virus without even a face mask, this could easily be a 100x mistake.

Treuren’s Mistake:

  • A redundant 0.01 factor was added for requiring WIV to have an unpublished backbone with 98% nucleotide similarity to SARS2. There is no such need. Since our prior was defined as a novel coronavirus pandemic, then all we need to estimate is the probability that a virus capable of that existed in WIV. Specifically, since DEFUSE describes searching for hACE2 matches and adding FCS, then the only question is whether WIV held a virus with a good hACE2 match.

    We know BANAL-52 is identical in the RBD to SARS2, so if a relative of it was collected then they have a backbone and we’re done. But we should expand that to any virus with an hACE2 match, even one with 80% similarity to SARS2, so it’s very reasonable that at least one will be found. We gave this 50%.

    Another way to look at this mistake: If we arbitrarily limit the engineered backbone to have 98% similarity to SARS2, we should apply the same limitation to the zoonotic progenitor, meaning we should discard from the prior any pandemic that is caused by viruses that doesn’t use hACE2, or those with good hACE2 match but using a different genetic sequence.
    If we place this requirement on both hypotheses, the effect cancels out.

Mistake #3: Missing that the FCS estimate is heavily steelmanned

The third major mistake in the judges’ decision, was using a low estimate for the likelihood of the Furin Cleavage Site (FCS) occurring naturally. A naive analysis of the combination of the rare occurrences behind the FCS insertion (which you can read about in our thread here) places us comfortably in a Bayes factor of millions. Ironically, had we just submitted this strawmanned calculation, we could have won the debate. However, since our goal was to actually determine what hypothesis is most likely, we steelmaned this estimate as well, thinking of the most likely way this could happen, truly assuming zoonosis is true.

Conclusion

As explained, we have updated our debate structure to avoid these problems in the future. Rootclaim’s $100,000 challenge is still open to anyone, including on the COVID-19 origins issue, as we’re still standing behind our analysis and willing to put our money where our mouth is. 

We have invited Peter to reapply, using the updated textual debate format with ongoing judge feedback, allowing the sides to fully convey their hypothesis in exactly the problematic areas. Miller has declined a rematch but we respect his decision to move on and invite others to take his place. 

The idea behind our challenge and risking money is to provide a strong incentive for deep research and analysis. This was successful beyond our expectations with Miller now probably one of the people with the deepest and most encompassing knowledge about the origins of COVID-19.

In ‘A Journey to the Center of the Earth’, Jules Verne wrote that “Science is made up of mistakes, but they are mistakes which it is useful to make because they lead little by little to the truth”. You don’t go into the probabilistic inference business expecting certainty and In this spirit, we appreciate this loss as our compass to future success. 

8 Comments

  1. This displays a stunning lack of understanding of what went wrong here. You frame Peter as “the debater with more memorized knowledge of the issue,” saying “The debate was live (via video) and Miller exemplified extensive knowledge and superb memory for many details, which we could not compete with in real-time.” The problem is not how good his memory is, but the attitude that Rootclaim takes to the importance of knowing the facts in the first place. Saar states clearly in the debate multiple times that facts are of secondary importance when compared with having a good inference method. The focus is always on the inference and never on the facts from which the inferences should be made. Many, many times in the debate Rootclaim will get a basic fact very wrong, and their response is always the same: any particular fact being wrong is unimportant because the probabilities are so overwhelming that any particular change cannot matter. However, it becomes clear over the course of the debate that the underlying problem isn’t with any one particular fact, it is the pervasive attitude that getting the facts right in the first place is not that important; this infects the whole inference process. When confronted with someone that did take that step seriously, Rootclaim flounders. It’s not about how good his memory is.

    • You seem to misrepresent our stance. We fully recognize the importance of facts and evidence. It is just that, in the real world, having a robust inference method is far more important. We often find there is an abundance of evidence, and the problem is that each side cherry picks and misinterprets it to support their own hypothesis. A good inference method that reduces bias, and accounts for the high uncertainty in the interpretation of evidence will outperform other methods even if those use more evidence. The lab leak hypothesis we presented was built in a way that is highly robust to evidential mistakes, and indeed the few that were corrected during the debate had little impact. You, however, seem to claim there were mistakes that should have flipped our conclusion – please point those out.

      Also worth noting that the zoonosis side has made at least as many factual mistakes during the debate (e.g. miscounting CGG arginines, misdating the early family cluster, calling Connor Reed a tabloid story, saying Reed shopped at HSM and claimed his cat died of Covid), we just didn’t make a big deal of it because we know that’s normal and doesn’t in itself mean an hypothesis is false.

      Another point you missed is that, regardless of any factual mistakes or which hypothesis is actually true, the judges very clearly made obvious mistakes in probabilistic inference (which, to be clear, was our fault for not discussing this enough). It is clearly wrong to claim that an HSM worker is as likely as any Wuhan resident to be the index case under lab leak, or that the prior for a gain-of-function pandemic is only once in thousands of years (i.e. all the experts were wrong to warn against it). When correcting these two mistakes in the judges’ calculations, both change to lab leak being the more likely hypothesis.

      • Nick Millington

        March 22, 2024 at 3:58 pm

        “It is just that, in the real world, having a robust inference method is far more important.”

        This is utterly incorrect. Inference is what causes people like yourself to run into mental mind holes, and it’s very, very indicative of a closed mindset.

        Your ‘inference process’ is just a superposition of a bunch of other peoples ‘inference processes’ and relies on statistical guesswork from the outset. You claim you’re being structured by relying on a statistical formalism (which if I were to guess comes from some kind of hyper-rationalist mindset to begin with) and then inventing reasons why your back worked inference conclusion is correct.

        Take this for example:

        “It is clearly wrong to claim that an HSM worker is as likely as any Wuhan resident to be the index case under lab leak,”

        No, it isn’t. It is clearly wrong to indicate the opposite. A wet market is a literal breeding ground for cross contamination, a highly packed mess of animals all perspiring and dying over each other, with lax safety standards in a highly complex environment.

        Under the lab leak scenario, by definition the most likely common vector is the individual importing into the wet market, followed by the individuals that they came into contact with during their period.

        Since we do not know who patient zero actually was, we must assume a number of things before even considering your silly Bayesianism:

        – COVID has and always had a high transmissibility, low lethality profile. We therefore cannot prove definitively or scientifically if there was a given patient zero, though there surely was
        – COVID transmits aerobically, meaning anyone within several metres of patient zero has a non zero chance of being infected. Following which, they may or may not have returned to the Wuhan market after that
        – The Wuhan market was identified not because it was a cluster of cases, but because it is apparent that this was a common cause of the virus. This does not indicate a single day’s exposure. The profile for a rapidly spreading virus with a high R value in a vector rich environment will appear broadly similar over the period of time of observation, which is necessarily flattened because of insufficient tracking of viral presence over that period

        Since it cannot be assumed apriori that an HSM worker would have more likelihood of having contact with patient zero than do daily visitors to the market, and that there is no evidence of a persistence in that region beyond that experienced in other regions, the only reasonable statistical assumptions are:

        – you do not know who patient zero was
        – you do not know where patient zero went
        – you do not know who patient zero infected is likely to have infected
        – you cannot possibly know if patient zero was in fact the index case, or not

        Assymptomatic COVID accounts for between 0.1 and 0.3 of all cases, with a good 0.4 cases beyond that appearing initially similar to the kind of mild respiratory illnesses that spread in human dense environments, with a latency period at this point of between 4 and 7 days, with no direct environmental controls or restrictions, and as a novel virus in a new environment.

        In the context of what you’re saying, for the purposes of such a messy, highly virile environment, the likelihood of the index case having any identifiable properties beyond “they were in the wet market at some point” is vanishingly small. The two cases effectively reduce to each other. A zoonotic infection looks exactly like a lab leak in that case.

        Which you admit to, here:

        “The first mistake in the judges’ decision was accepting an extremely low likelihood for the Huanan Seafood Market (HSM) to form an early cluster of infected patients if Covid originated in a lab. Now that we’ve demonstrated the importance of steelmanning, it’s obvious that it is a mistake to consider HSM to be a random location in Wuhan (i.e. will form an early cluster only once every 10,000 hypothetical SARS2 lab leaks in Wuhan).”

        A lab leak that spreads to that location requires a series of events to occur in approximate order:

        – an individual working at the institute to be working on a specific line of research
        – for that individual to suffer a breach of professional protocol that they then do not amend
        – for an individual to proceed, without getting sick, to the wet market, in the period during or after their containment protocol breach, but not before COVID itself becomes infectious
        – for the individual to specifically infect only individuals who are frequently at the wet market by repeated exposure
        – for the individual to, during the course of their travels, NOT infect other individuals with whom they have come into contact
        – at the Wuhan institute to have walked around the wet market during the period in which they are infectious
        – for that individual to have been the individual working on the virus to have not infected a single other person, beyond those working at the wet market
        – for that individual to have avoided infecting any other large group of people, despite not being aware of their infectious status

        Even a cursory examination of Wuhan will reveal that there are hundreds of highly populated high rise buildings, university campuses and dense infrastructure surrounding WIM. The probability of a lab leak directly infecting a market given randomised pedestrian and transit behaviour is approximately the same as it is for the lay population, and reduces to the basal level for that location.

        Simply put, you worked backwards trying to prove, using some distorted probabilistic nonsense, without considering or meaningfully modelling the paths of a likely patient zero, how likely it would be that a given individual would directly infect the wet market.

        You lost the debate because you thought you could bamboozle people with numbers, and you forgot that numbers pertain to real things which are examinable. Given how difficult it actually is to model behaviour of large groups of people, you should know better.

        • Thanks for your comment. It contains several mistakes and misunderstandings of our analysis.

          Instead of going through them one by one, we would like to first agree that the following statement is indeed correct: “It is clearly wrong to claim that an HSM worker is as likely as any Wuhan resident to be the index case under lab leak,”

          We will leave aside for now the complex internal dynamics of the virus spread within HSM after the first infection, and focus only on the probability of that first infection. There shouldn’t be any contention that people who interact daily with customers are much more likely to be infected early compared to a random Wuhan resident, who may normally just be interacting with a few people in their office and home.

          Assuming that’s obvious, then it’s clear HSM workers are more likely to be an index case (the first case noticed). We’re now just left with the question of how much more likely. Our analysis provides several ways to estimate that. Let us know if you see any problems with that estimate and can provide a better one, and if it’s better we will update accordingly.

      • Nick Millington

        March 22, 2024 at 4:12 pm

        For those that CBA reading and want a meaningful description of what’s happening here:

        The core of the argument is that it is possible by statistical reasoning to determine exactly where the virus came from by modelling its behaviour in the aftermath.

        The main argument against this, which is the correct argument, is that it’s virtually impossible to make any meaningful statements about the semi randomised behaviour of an unknown patient zero in a large city that would inevitably lead to that particular meat market being infected beyond any other location, following a series of increasingly unlikely events occuring to a single individual and/or individuals.

        The evidence indicates that said meat market was the source of the pandemic.

        The judges, correctly, identified that arguing that in a lab leak scenario, the likely initial clusters would have been distributed as they normally are around urban centres – where people are. The meat market is one such centre, but no more likely than any other of a similar degree of cleanliness or concentration of people.

        The argument our friend here is making relies on it being possible to prove that it was more likely for said meat market to become infected than the average person in the city. This is not borne out by much more complete statistical evidence from later infections.

        Since you can’t prove that a lab leak would actually be more likely to cause infections in that one specific location as opposed to other, randomised locations, it therefore prevents our mutual friends from implying that the fact it very clearly emerged from that wet market is not itself evidence for the fact that it is zoonotic in origin.

        That does not discount the lab leak hypothesis, by any means. It is possible that the series of events I described above occured. However, it is much more likely that it is zoonotic.

        • We answered your feedback on the market in your other comment.
          Note that the term ‘prove’ is irrelevant when dealing with uncertainty. We need to estimate the probability of a WIV leak forming an early cluster at HSM, and that requires examining all the factors involved, as we describe in our post.

      • “You seem to misrepresent our stance. We fully recognize the importance of facts and evidence. It is just that, in the real world, having a robust inference method is far more important.”

        It’s not misrepresentation, I’m calling you out. You claim to recognize the importance of facts and evidence, but over the course of the multiple hours this claim is exposed as wrong. When every incorrect fact that is pointed out is met with “well this doesn’t matter because of the big picture and the awesomeness of our inference method”, the acknowledgement of the importance of evidence is just lip service. You’re doing it again here right in this statement, calling the importance of the evidence secondary to being able to multiply probabilities together.

        “A good inference method that reduces bias, and accounts for the high uncertainty in the interpretation of evidence will outperform other methods even if those use more evidence.”

        Here it’s not about more or less evidence, but bad evidence. Evidence that is actually wrong. Your entire methodology fails here due to the most fundamental principle of any inference method: garbage in, garbage out.

      • One more thing:

        “Also worth noting that the zoonosis side has made at least as many factual mistakes during the debate…”

        Here you are admitting to many factual mistakes pointed out in the debate, and yet…

        “When correcting these two mistakes [that hurt lab leak] in the judges’ calculations, both change to lab leak being the more likely hypothesis.”

        …you want to only update two probabilities which would help you and leave the rest as is. And you claim to “reduce” bias…

Leave a Reply

Your email address will not be published.

*

© 2024 Rootclaim Blog

Theme by Anders NorenUp ↑