Continuing on from #1993 and #2020, covid-19 viral activation and invasion of lung pneumatocytes has unusual and undesirable features that reflect rapid recent evolution of its genome.
The research action centers on the spike protein because it seems to have acquired aggressive new properties from a specific upstream 12-base insertion (creating a 4 amino acid furin-like cleavage site motif) that greatly facilitates adhesion to the ACE2 receptor which facilitates fusion (mediated by a downstream spike domain) with the host cytoplasmic membrane, the entry point of viral RNA into the cell interior where it reproduces.
There are 182 complete covid-19 genomes as of today being studied with both wet lab and dry lab (bioinformatic) approaches. NextStrain collects all these and presents them as a branching phylogenetic tree that grows every day and sometimes gets rearranged.
This tree clusters closely related covid-19 genomes the same way that your desktop organizes related files into a nested folder hierarchy but using advanced statistical methods such as maximal likelihood models that have been under intense algorithmic development for half a century. However these trees can be made under many different assumptions and parameter sets. A tree that aligns amino acids (rather than nucleotides), eg those from the upstream half of the spike protein, might give a rather different topology from a whole genome nucleotide tree.
On the data side, the 182 genomes are mostly not the ones we want: the early ones. Many are just chains of descendants: A in Wuhan gave it to B in Milan and C in Vatican City, B gave it to D in Austria and E in Spain, C gave it to F, G and H in Dubai with 0-2 mutations at each step along the way. The real information lies in more covid-19 genomes from Wuhan but not descended from A.
This is useful early on in a pandemic for the tracebacks and self-quarantining that buy some (mostly squandered) preparedness time but as Sam documents above, that train left the station a month ago.
Molecular biologists want the genomes from the very earliest stages of viral spread in late Nov 2019 for five principal reasons:
-1- to work out the ancestral genome that first crossed the species barrier.
-2- to determine the carrier species because it may harbor many other coronavirus strains.
-3- to determine what adaptive changes took place that caused covid-19 to spread so virulently.
-4- to better understand mutational processes in covid-19 and future properties may evolve.
-4- to resolve whether mutational gain/loss of nucleotides represents an insertion or deletion.
However the epicenter of spread, which is not necessarily the epicenter of origin, has been bulldozed to the ground, its entire stock of wildlife incinerated and its infected denizens cremated without any genetic sampling. Under the circumstances, the focus was eradication; public health mandarins would hardly be bowing to requests for viral agent preservation from scientists.
Prior to the outbreak,
Wuhan had two institutes (not one) collecting coronavirus genomes from wild bat populations and requesting isolates from other virology labs around the world, for example the Manitoba, Canada BSL-4 facility.
Assembling such a resource makes research sense in a country like China with strong science and a costly history of viral outbreaks in both livestock and humans. For its part, the US maintained a massive collection of anthrax strains until the FBI autoclaved the entire set after a rogue worker mailed a weaponized one around.
In summary, only a few of the 182 genomes originated early on in Wuhan but because of privacy considerations neither preprints, GenBank annotations or GISAID metadata make clear if any of the people were affiliated with the two corona virus laboratories.There is very little specific clinical information about the eight original ICU patients that triggered the ophthalmologist's alert. We don't know if any of the covid-19 genomes represents the transmitting patient with acute angle glaucoma.
Regardless, the genomes at NextStrain fall into two early-diverging clades (strains) that split early on and never later hybridized (through RNA recombination). These were noticed back in February and denoted L and S clades (for distinguishing mutations that affected leucine and serine codons). The topology of that branch of the tree has been stable ever since.
The original authors were careful to say of the two strains, the L type “MIGHT be more aggressive and spread more quickly”. However nobody since has honored that cautionary statement. Because of transmission chains, subsequent internal mutational divergences in both clades, and lack of healthy human volunteers, this idea is very difficult to pursue. Note that every node on the tree defines, through its descendants, its own clade or strain.
The NextStrain tree is unrooted, meaning that deep ancestry is not indicated by outgroups (closely related corona and other viruses). This is so bizarre that other researchers immediately added a variety of outgroups and recomputed the tree to see which of L and S is closer in genomic sequence to the first covid-19 to escape its initial animal host. And that the 'more ancestral' sequence is said to be the smaller clade, S. That needs to be revisited now that the data set is so much larger.
The phylogenetic tree unambiguously resolves the upstream spike protein mutation as an insertion. This was correctly inferred in the ‘uncanny’ preprint where it is called the 4th ‘HIV’ region. That’s not entirely off the mark but it’s better called the putative gain-of-function furin-like cleavage site resulting from the new four basic amino acid motif.
This preprint was withdrawn by author request; it was not retracted (shame on you FAS) and could conceivably resurface after massive revisions. It never mentions weaponization. The pdf is still offered at biorxiv; there’s a good discussion of its myriad problems too by others in the field:
https://www.biorxiv.org/content/10.1101/2020.01.30.927871v1.full.pdfhttps://www.tandfonline.com/doi/full/10.1080/22221751.2020.1727299To date, there’s
still no good explanation for how the furin-friendly insertion arose in the spike protein. Some of the better spike protein analysis is provided in the links and images below.
https://tinyurl.com/th8zpq3 21 Jan 2020 discovery of furin site (in Chinese)
https://www.biorxiv.org/content/10.1101/2020.02.19.956581v1 images and structural analysis AC Walls et al
https://www.sciencedirect.com/science/article/pii/S0166354220300528?via%3Dihub French paper on furin site
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281273/ real furin motifs are longer
http://virological.org/t/evolutionary-epidemiological-analysis-of-93-genomes/405 GISAIS metadata for 93 genomes
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3324781/ RNA recombination
https://tinyurl.com/r9fm3zw remdesivir
https://tinyurl.com/uopplv2 L and S clades
https://www.sciencedirect.com/science/article/pii/S193131282030072X?via%3Dihub early paper