<![CDATA[A Curious Biochemist - Blog]]>Thu, 16 May 2024 06:12:14 -0700Weebly<![CDATA[Tracking the Number of Samples Containing the SARS-CoV-2 B117 Variant - An Update]]>Fri, 22 Jan 2021 17:23:31 GMThttp://acuriousbiochemist.com/blog/tracking-the-number-of-samples-containing-the-sars-cov-2-b117-variant-an-updateRecent news reports indicate that the B117 variant will become the dominant variant in the US by the middle of March 2021.  I have continued my analysis of the samples in the NCBI database.  The data is presented below:
We can see that the number of samples carrying the B117 variant are starting to increase rapidly (graph on the left).  A better measure of the spread of the B117 variant may be the percentage of  the variant in new samples deposited to the database since 11/17/2020 (graph on right).  As of 1/21/2021 the B117 variant reprints 1.2% of the samples deposited in the NCBI database since 12/20/2020.  
]]>
<![CDATA[Tracking the Number of Samples Containing the SARS-CoV-2 B117 Variant]]>Sat, 09 Jan 2021 21:52:31 GMThttp://acuriousbiochemist.com/blog/tracking-the-number-of-sars-cov-2-b117-strainsOver the past several of weeks news reports have come out describing the appearance of two variants of SARS-CoV-2 that are thought to be more transmissible compared to the variants that are currently circulating in the community.  One of these variants, known as 20B/501Y.V1 VOC 202012/01 or B.1.1.7) was first identified in the United Kingdom in September of 2020.  SARS-Cov-2-B117 has multiple mutations including a deletion of amino acids 69 and 70  and the substitution of Y for N at position 501 (N501Y) in the spike protein.  The second variant was identified in South Africa in October of 2020 and is known as 20C/501Y.V2 or B1.351).  SARS-CoV-2-B1351 also contains multiple mutations in the spike protein including the N501Y substitution, but unlike SARS-CoV-2-B117 does not contains the 69/70 deletion.

I wanted to extend the analysis of SARS-CoV-2 spike protein mutations we started in the CHEM 440 class to identify SARS-CoV-2-B117 variants in the NCBI database.  During our first pass through the spike protein sequences in the NCBI database we limited our analysis to sequences that were the same length as the reference spike protein sequence.  This made it easy to identify amino acid mutations.  Since SARS-CoV-2-B117 has two amino acid deletions in the spike protein, and is therefore  shorter than the reference protein sequence, it is not possible to find mutations using the python code we had written.  To analyze sequences that are shorter than the reference sequence, I updated the code to carry out pairwise alignments of each database entry with the reference sequence using the Biopython pairwise2 module.  Each alignment was checked to see if the query (i.e. the database entry spike protein) sequence contained a deletion at positions 69 and 70 and a N to Y substitution at position 501.  Database entries with these characteristics were identified as samples that contained the SARS-CoV-2-B117 variant.

Analysis of the SARS-CoV-2 spike protein sequences in the NCBI database resulted in the following findings:
The NCBI database downloaded on 11/17/2020 nor the one downloaded on 12/20/2020 contained any entries that were identified as carrying the SARS-CoV-2-B117 variant.  Two samples were identified as carrying the SARS-CoV-2-B117 variant on 1/03/2021.  Of these two samples one was collected in Colorado on 12_24_2020 and deposited in the NCBI database on 12_30_2020 while the other was collected in San Diego, CA on 12_29_2020 and deposited in the NCBI database on 12_30_2020.  By 1_6_2021 an additional five samples containing the SARS-CoV-2-B117 variant was deposited into the database.  Of these sample one was collected in Florida, three were collected in California and one was collected in Saratoga County, New York.  All five samples were collected between 12_19_2020 and 12_24_2020 and deposited in to the NCBI database on 1_1_2021.  Two additional samples, collected in Italy,  containing the B117 variant was deposited in to the database after 1_6_2021.

Given the nature of sample collection and sequencing it is highly unlikely that the number of samples identified here is any indication of the community prevalence of the B117 variant.  As public health officials increase testing and sequencing efforts the number of samples containing the B117 variant is expected to increase rapidly.  I will keep analyzing the NCBI database every three to four days to see if we can capture this increase.

Interestingly in an article published in April of 2020 (Wan Y, Shang J, Graham R, Baric RS, Li F. 2020. Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus. J Virol 94:e00127-20. https://doi.org/10.1128/JVI.00127-20.) Wan et.al. suggested that mutations at residue 501 of the spike protein could "significantly enhance the binding affinity between 2019-nCoV RBD and human ACE2. Thus, 2019-nCoV evolution in patients should be closely monitored for the emergence of novel mutations at the 501 position (to a lesser extent, also the 494 position)".  Early in the pandemic SARS-CoV-2 was known as 2019-nCoV.
]]>
<![CDATA[Using Circos to Visualize Mutational Data of the SARS-CoV-2 Spike Protein]]>Mon, 14 Dec 2020 19:21:01 GMThttp://acuriousbiochemist.com/blog/using-circos-to-visualize-mutational-data-of-the-sars-cov-2-spike-proteinI have been meaning to make the first entry in this blog for a long time but never quite got around to it.  I am finally taking the plunge with an entry involving the creation of a Circos plot to display mutations that have been identified in the spike protein of the SARS-CoV-2 virus.  
As part of my fall 2020 course in 'protein structure and function for the life sciences' (CHEM 440) at CSUSM  we looked at the mutations (amino acid changes) found in the spike protein of the SARS-CoV-2 virus.  From the NCBI virus database (https://www.ncbi.nlm.nih.gov/sars-cov-2/) we obtained the amino acid sequences for the SARS-CoV-2 spike protein from  approximately 31,000 patient samples isolated from around the globe.  These protein sequences were compared to the reference sequence  of the spike protein (YP_009724390), from the virus isolated from Wuhan China in December of 2019, to identify amino acids that have changed in the spike protein since the original identification of the virus.  We did the analysis using a python script we developed as part of the class discussion.  
Once the sequences were analyzed we discussed how best to display the data.  There are several different ways to look at the data.  For example, how many sequence positions in the protein have changed from the original strain?  How many positions have never changed?  Is a particular mutation found in a significant number of samples?  We started to generate plots to answer these questions.  A student mentioned that we may be able to use Circos to display all of the data in a visually appealing way.
Circos (http://circos.ca) is a way to visualize data using a circular layout.  Circos was originally developed to visualize genomic data but has been used to visualize data in a number of other fields.  Using the data we obtained from our sequence analysis we developed a visualization to display the domains of the protein, the sites of glycosylation, the sites involved in receptor binding, the number of unique mutations at a given position, and the number of samples in which a given mutation is found.  The Circos websites offers an extensive collection of tutorials with code examples that helped me develop the visualization shown below.


Picture
Circos Plot of information for the SARS-Cov-2 spike protein. Moving from the outer ring to the innermost ring: (1). The domains of the SARS-CoV-2 spike protein (NTD - N-terminal domain, ID - inter-domain, RBD - receptor binding domain, FP - fusion peptide, HR - heptad repeat, TM - transmembrane, and CTR - C-terminal) (2) Triangles - sites of glycosylation (3) circles - receptor binding residues (4) Histogram plot of the number of unique mutations observed for each position of the spike protein (5) scatter plot of the sequence positions with mutations that appear in between 500 - 31,000 samples (6) scatter plot of the sequence positions with mutations that appear in 100 - 500 samples, and (7) scatter plot of the sequence positions with mutations that appear in between 0-100 samples.
The discussion that lead to this analysis was not a main focus of the CHEM 440 class.  As part of the main content of the class students were introduced to the main concepts of protein structure, techniques used to investigate protein structure, and bioinformatics tools used to retrieve, analyze and visualize protein sequences (PIR database, pairwise and multiple sequence analysis, JalView, PyMol), and concepts of protein function.  Throughout the semester we discussed how these concepts introduce in class can be applied to investigate proteins found in the SARS-CoV-2 virus.  Based on these discussions I started a voluntary project to analyze SARS-CoV-2 spike protein sequences to identify mutations that have occurred since the virus was first identified. The main focus of this mini-project was writing a short python script to read a FASTA file, to compare each spike protein sequence to the reference sequence, and to identify the amino acid positions that have mutated.  I made this project voluntary because not all students in the class were interested in learning how to code using python.  Once the python script was developed and we had analyzed the spike protein sequences we discussed the major findings and how best to visualize the data.  A couple of students presented the data as a poster at the campus Student Research Poster Showcase.  These students are interested in continuing the analysis of this data outside of the classroom and the project has now become part of my online bioinformatics research efforts.
​Overall this mini-project was a nice way to introduce students to python and to make it relevant by looking at a current topic.
]]>