Skip to main content

Modelling the association between COVID-19 transmissibility and D614G substitution in SARS-CoV-2 spike protein: using the surveillance data in California as an example



The COVID-19 pandemic poses a serious threat to global health, and pathogenic mutations are a major challenge to disease control. We developed a statistical framework to explore the association between molecular-level mutation activity of SARS-CoV-2 and population-level disease transmissibility of COVID-19.


We estimated the instantaneous transmissibility of COVID-19 by using the time-varying reproduction number (Rt). The mutation activity in SARS-CoV-2 is quantified empirically depending on (i) the prevalence of emerged amino acid substitutions and (ii) the frequency of these substitutions in the whole sequence. Using the likelihood-based approach, a statistical framework is developed to examine the association between mutation activity and Rt. We adopted the COVID-19 surveillance data in California as an example for demonstration.


We found a significant positive association between population-level COVID-19 transmissibility and the D614G substitution on the SARS-CoV-2 spike protein. We estimate that a per 0.01 increase in the prevalence of glycine (G) on codon 614 is positively associated with a 0.49% (95% CI: 0.39 to 0.59) increase in Rt, which explains 61% of the Rt variation after accounting for the control measures. We remark that the modeling framework can be extended to study other infectious pathogens.


Our findings show a link between the molecular-level mutation activity of SARS-CoV-2 and population-level transmission of COVID-19 to provide further evidence for a positive association between the D614G substitution and Rt. Future studies exploring the mechanism between SARS-CoV-2 mutations and COVID-19 infectivity are warranted.


Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first reported in 2019 [1,2,3,4,5]. The COVID-19 pandemic poses a serious threat to global health and has spread to over 200 countries globally in a short period of time [6, 7]. In response to the ongoing COVID-19 pandemic, the World Health Organization (WHO) declared a public health emergency of international concern on January 30, 2020 [8]. As of September 6, 2020, over 27 million COVID-19 cases have been confirmed worldwide, with over 0.8 million deaths associated with COVID-19 [9].

The dynamics of the transmission of an infectious disease are largely determined by the pathogen’s infectiousness and the course of the transmission [10,11,12]. As a contagious disease with high transmissibility, the control of COVID-19 requires knowledge of the driving factors that may affect disease transmission [13,14,15,16]. Pathogenic mutations in SARS-CoV-2 are a major challenge for controlling COVID-19 [17, 18]. Early in February 2020, genetic variants with the D614G substitution on the SARS-CoV-2 spike (S) protein began to spread first in Europe [19] and globally and were suspected to potentially affect viral transmission [20]. Here, ‘D614G’ denotes the amino acid substitution that changes aspartic acid (D) to glycine (G) on codon 614 of the S protein of SARS-CoV-2. However, the evident relationship between the molecular-level mutation activity of SARS-CoV-2 and the population-level transmissibility of COVID-19 remains unrevealed.

It is biologically reasonable that mutations in viral genomes may alter the pathogenic profile in terms of viral fitness and functionality [21, 22] and consequently change its transmissibility. Previous literature about seasonal influenza epidemics [23] suggested that a few key amino acid substitutions may lead to remarkable changing dynamics of epidemiological outcomes at the population scale. In this study, we adopted a statistical framework to explore and examine the association between COVID-19 transmissibility and key mutation activities in the S protein of SARS-CoV-2.

Data and methods

SARS-CoV-2 sequencing data and COVID-19 surveillance data

The full-length human SARS-CoV-2 strains in California were collected via the Global Initiative on Sharing All Influenza Data (GISAID) [24] on May 24, 2020. A total of 524 strains were searched with collection dates ranging from January 22, 2020, to May 8, 2020. Table 1 summarizes the total number of strains in GISAID and the sample size included in this study for different periods. Since the number of sample stains varied by period, we set 9 successive periods and downloaded a stable number of strains for each period. In the period when more than 30 strains were available, we randomly sampled 30 strains. This sampling scheme is purposely designed to balance the weights due to different sample sizes that may affect the sliding window framework applied in quantifying the mutation activity (details in the next section). Sequences of all SARS-CoV-2 strains acquired are provided in the Additional file 1.

Table 1 Number of human SARS-CoV-2 strains in GISAID and the sample sizes included in this study for different periods

Multiple sequence alignment was performed using Clustal Omega (accessed via, and the SARS-CoV-2 strain ‘China/Wuhan-Hu-1/2019|EPI_ISL_402125’ was considered as the reference sequence. The surveillance data of the daily number of COVID-19 cases in California were collected from the R package “nCov2019” [25] and The New York Times, accessed via and, respectively. Figure 1a shows the daily number of COVID-19 cases in California in a time series.

Fig. 1

The number of COVID-19 cases (panel A), prevalence of the amino acids (AA) on the 614-th codon of the S protein (panel B), and time-varying reproduction number (Rt, panel C) in California from March to April 2020. Panel A shows the daily number of COVID-19 time series. Panel B shows the prevalence of the AAs, including Aspartic Acid (D, in cyan) and Glycine (G, in purple), on the 614-th codon of the S protein. Panel C shows the Rt estimated (black) from the number of cases data using renewable equation and fitted (red) by using the mutation activity on the 614-th codon of the S protein in panel B. In each panel, the vertical dashed blue line represents the date, March 19, when the ‘stay-at-home’ order was officially implemented in California

Instantaneous reproduction number and study period

We adopted the time-varying reproduction number (Rt) to quantify the instantaneous COVID-19 transmissibility in California. Using the framework in [26], we estimated the time-varying reproduction number (Rt) to quantify the instantaneous transmissibility of COVID-19 in California. Following the estimation framework developed in previous studies [26, 27], the epidemic growth of COVID-19 was modeled as a branching process, and thus, Rt can be expressed by using the renewable equation as follows:

$$ R(t)=\frac{C(t)}{\int_0^{\infty }w(k)C\left(t-k\right)\mathrm{d}k}, $$

where C(t) is the number of new COVID-19 cases reported at the t-th date. The function w(∙) is the distribution of the generation time (GT) of COVID-19. By averaging the GT estimates from the existing literature [28,29,30,31,32,33,34,35], we considered w as a Gamma distribution with a mean (±SD) value of 5.3 (±2.1) days. Slight variations in the settings of the GT did not affect our main findings.

For the selection of the study period, we considered both the quality of datasets and the increasing intensity (or effects) of local control measures. The selected study period for the COVID-19 surveillance data in California was from March 1, 2020, to April 30, 2020. During this study period, local COVID-19 surveillance was already following the governmental protocol, and the composition of disease control measures was relatively simple and adjustable in further multivariate analyses. In particular, an official ‘stay-at-home’ order was issued on (t0 =) March 19, 2020, in California (see, which may affect the patterns of Rt. Hence, we accounted for the effect of this local control measure in further multivariate analyses.

Our analyses depended on both (i) the quality of the data and (ii) the effects of the covariates, especially public health control measures that may decrease Rt. Thus, one of the other reasons, which limited us to consider time outside the study period from March 1, 2020, to April 30, 2020, is related to the prevalence of mutation activities in SARS-CoV-2. During this study period, D614G appears to be the only major amino acid (AA) substitution in the S protein. Thus, complex interactive effects of multiple mutations on infectivity are less likely. As such, our analysis is simplified and is restricted in examining the effect of a single AA substitution.

Quantifying the time-varying molecular-level mutation activity

In previous studies [36,37,38], a statistical framework was proposed to quantify genetic mutation activities associated with population-level outbreak situations by a metric, namely, the g-measure, on a real-time basis. The g-measure is an empirical time-varying metric calculated from the sequencing data of the pathogen and is determined by a predefined dominance prevalence threshold, θ, ranging from 0 to 1. The θ is the mutation prevalence threshold above which a molecular-level mutation (or substitution) is considered to affect the changing dynamics of the outbreak situation at the population level. The g-measure quantifies the level of key substitutions on a real-time basis, which allows one to explore its linkage to other time-varying variables [39].

We calculated the daily prevalence of amino acids (AA) on each codon in the S protein of SARS-CoV-2. We use pij(t) to denote the prevalence of the i-th type of amino acid (AA) on the j-th codon of the S protein at time (or date) t, for i = 1, 2, …, 20, j = 1, 2, …, 1273, and t ranging from January 22 to May 8, 2020. Then, for each AA (20 in total) on each codon (1273 in total) of the S protein, we empirically calculated the prevalence time series. A sliding window was applied to the whole study period, from January 22, 2020, to May 8, 2020, to address the problem of the insufficient daily sample size. Let W denote the window size that represents a constant period (e.g., one week or one month). Hence, for pij(t) on date t, we accounted for the proportion of the i-th AA out of all 20 types of AAs on the j-th codon within the time period of t ± W/2. In this study, we set W at 7 days for convenience, and we concluded that a variation in W did not affect our main results.

This sliding window scheme requires that the daily sample sizes of sequencing data are close in scale [38]. This guarantees that the prevalence series can reveal the real-world changing patterns of the mutation activity rather than bias towards a particular period with a large number of sequencing samples. Otherwise, as a simple example, during the periods before or after date t, i.e., from t − W/2 to t and from t to t + W/2, the prevalence may approach one period with a larger sample size.

Following the calculations from previous studies [36, 37, 39], the g-measure counts the segments of the prevalence series that start from 0 and increase and eventually hits the level of θ. Prevalence series that never excess θ are excluded from the g-measure calculation. For other prevalence series that excess θ at some time point, only those parts start from 0 and increase and hit the level of θ are included in the g-measure calculation. An illustration diagram of the g-measure calculation is presented in Fig. 2. Technically, the algorithm in Table 2 is used to find the indicator function, I(t), to identify the segments of the prevalence series for the g-measure calculation. Therefore, given θ, the g-measure on date t, denoted by g-measuret(θ), can be calculated as follows:

$$ {\mathrm{gmeasure}}_t\left(\theta \right)=\sum \limits_j\sum \limits_i{p}_{ijt}\bullet {I}_{ijt}\left(\theta \right) $$
Fig. 2

Illustration diagram of the analytical procedure of g-measure calculation. The prevalence time series of 4 different AA substitutions are denoted by p1(t), p2(t), p3(t) and p4(t), and indicated in red, green, orange and blue, respectively. Three scenarios of dominant prevalence threshold parameter, θ, are demonstrated with θ = 0.5, 0.7, and 0.9, respectively, which is indicated by the horizontal dashed line in each panel. The g-measure counts the segments of the prevalence series that start from 0 and increase over θ. In each panel, the prevalence series that accounts for g-measure is not shaded in grey region. In other words, the shaded regions are those part of prevalence series excluded from the g-measure calculation. For those prevalence series that never excess θ, they are excluded from g-measure calculation as labeled by ‘excluded’; otherwise, ‘included’ label is indicated

Table 2 Algorithm of g-measure indicator function, I(t)

The g-measure quantifies genetic mutation activities and is used to explore the association with Rt. The parameter θ is estimated with the likelihood framework that will be introduced in the remaining parts.

Figure 3 shows the g-measure time series of the S protein with different values of θ. Note that only the g-measure time series from March 1, 2020, to April 30, 2020, were used in further regression analyses.

Fig. 3

The g-measure time series of the SARS-CoV-2 spike (S) protein with different values of dominance threshold (θ) using Eqn (1)

Regression model and estimation of dominance prevalence threshold

We intended to explore the association between Rt and the mutation activity (measured by the g-measure) on the S protein. A multivariable regression model was fitted to examine the association between Rt and the g-measure considering the effect of local control measures in California.

Since Rt may be affected by disease control measures, we included a dummy variable with a discontinuity design to govern the effect of local control measures. In particular, the official ‘stay-at-home’ order was issued in California on March 19, 2020 (see Hence, in the generalized linear regression model with discontinuity design, we set the structural break in the trends of Rt on March 19, 2020, which was denoted as t0. In previous studies, Rt is commonly modeled as a Gamma process [26, 40, 41], and thus, the regression is formulated in Eqn (2).

$$ \mathbf{E}\left[\ln \left({R}_t\right)\right]=c+a\;{\mathrm{gmeasure}}_t+b\;\mathbf{I}\left(t>{t}_0\right)\;\left(t-{t}_0\right) $$

Here, E[∙] is the function of the expectation. I(∙) is an indicator function that uses the binary variable (0 or 1);if variable t is larger than the threshold value t0, then 1; otherwise, it is 0. c is the constant parameter, and a and b are the slope parameters. Again, we fixed the term t0 to be March 19, 2020. The percentage change rate (η) of Rt associated with a 0.01 increase in the g-measure can be calculated directly from the slope parameter a. Thus, the term η is the effect size to be estimated of mutation activity on COVID-19 transmission, and we have η = [exp(a × 0.01) – 1] × 100%.

Following previous studies [40, 41], we considered Rt to follow a Gamma process with both means Rt and SDs vt determined by the renewable equation. For a given time t, the Gamma distribution is denoted by h(∙|Rt, vt), and we model exp. [c + a∙gmeasuret + bI(t > t0)∙(t − t0)], which is the exponential of the right-hand side of Eqn (2), following the distribution h(∙|Rt, vt). Thus, h(∙|Rt, vt) is a function of parameters a and θ in Eqns (1) and (2), respectively, i.e., h(a, θ | Rt, vt). In other words, both Rt and vt were reconstructed directly from the number of cases in a time series (i.e., the raw data) and then served as the known parameters in the likelihood function L, which is given as follows:

$$ L\left(a,\theta \right)={\prod}_th\left(a,\theta |{R}_t,{v}_t\right). $$

The dominance prevalence threshold parameters θ and a, and equivalently η, can be estimated based on this likelihood framework and the regression model. Then, we calculated the maximum likelihood estimation (MLE) of θ to determine the g-measure for regression analysis. Using the likelihood framework, we estimated the MLE of the dominance prevalence threshold parameter θ, which was adopted to determine the g-measure and to examine the association with Rt. The 95% CIs of the regression parameters were estimated by their point estimates plus or minus Student’s t distributed quantile multiplied by their standard errors. Since η and a are one-to-one mappings, the 95% CI of η can also be directly calculated from the 95% CI of a.

We employed Efron’s pseudo R-squared and likelihood-based partial R-squared to evaluate the goodness-of-fit of the regression model. A likelihood-ratio (LR) test on the scenarios with (as the full model) and without (as the baseline model) the g-measure was used to examine the reasonability of the model structure.

Sensitivity analysis

Sensitivity analysis was carried out on the robustness and significance of the association between Rt and mutation activity. We conducted a sensitivity check on the effect size of mutation activity on the S protein in association with the changing dynamics of COVID-19 transmissibility in terms of the reconstructed Rt. We considered three alternative regression formulas, which are similar to the main model Eqn (2), as follows:

$$ \mathbf{E}\left[\ln \left({R}_t\right)\right]=c+a\;{\mathrm{gmeasure}}_t $$
$$ \mathbf{E}\left[\ln \left({R}_t\right)\right]=c+a\;{\mathrm{gmeasure}}_t+b\;\mathbf{I}\left(t>{t}_0\right) $$
$$ \mathbf{E}\left[\ln \left({R}_t\right)\right]=c+a\;{\mathrm{gmeasure}}_t+b\;\mathbf{I}\left(t>{t}_0\right)\;\left(t-{t}_0\right)+d\;\mathbf{I}\left(t\le {t}_0\right)\;\left(t-{t}_0\right) $$

To check the robustness and significance of the estimates, we examined the consistency of both the sign (i.e., + or −) and the statistical significance (in terms of p-value < 0.05) of the regression coefficient a in the four models in Eq. (2)–(5).

Results and discussion

We reconstructed the daily instantaneous reproduction number (Rt) from the epidemic curve, as shown in Fig. 1c (black dots). We observed that the overall trends of Rt were relatively steady in the first half of March but gradually decrease thereafter since the local ‘stay-at-home’ order was issued in California on March 19, 2020, which was adjusted in Eqn (2). During the first half of March, which was regarded as the early phase of the outbreak, the reproduction number ranged from 1.5 to 3, and this range is generally consistent with previous estimates [2, 3, 6, 12, 29, 33, 42,43,44,45,46,47,48].

We estimated the dominance prevalence threshold (θ) at 0.8, as shown in Fig. 4, which was adopted to examine the association between the g-measure and Rt. When θ = 0.8, we found that the g-measure of the S protein appeared to be solely contributed to by the D614G substitution (see Fig. 1b), which also holds for all θ values > 0.75. In other words, the D614G substitution is considered a key mutation and is likely dominant in accounting for the changes in COVID-19 transmissibility due to a mutation at the molecular level.

Fig. 4

The likelihood profile of the dominant prevalence threshold parameter, θ, using the likelihood framework associated with regression model in Eqn (2)

Using the regression model in Eqn (2), we found a significant positive association between the g-measure and Rt when θ = 0.8 (as estimated). Hence, the changing dynamics of Rt are likely associated with the key mutations that are solely contributed to by the D614G substitution. We estimated that each 0.01 increase in the prevalence of glycine (G) on codon 614 is positively associated with a 0.49% (95% CI: 0.39 to 0.59) increase in Rt, which, in terms of the partial R-squared, explains 61% of the Rt variation after accounting for the control measures. Figure 1c shows the fitting results by using the regression model in Eqn (2). By examining the patterns in Fig. 1, we found that the prevalence of the D614G substitution matches the trends of Rt in March 2020. However, we noticed that since (roughly) April 15, 2020, the prevalence of the D614G substitution increased, but Rt remained constant. The reasons may include that the increase in transmissibility was counteracted by the effects of local nonpharmaceutical interventions that reduced the transmission of COVID-19. Sensitivity analysis with alternative model structures in Eqns (3)–(5) indicates that the positive association between the D614G substitution and Rt holds robustly and significantly (data not shown).

The significant positive association between the D614G substitution and Rt is biologically reasonable and consistent with findings in previous studies. The few (but key) AA substitutions may vary the three-dimensional structure of the protein as well as influence the receptor binding process in which a pathogen invades host cells. Previous analysis implied that the D614G substitution may alter the conformation of the S protein and thus may theoretically functionally enhance receptor binding capacity [19, 20, 49, 50], leading to an increase in SARS-CoV-2 transmissibility and pathogenicity [51]. Similarly, we learn from the influenza virus that major antigenic changes can be caused by a single AA substitution related to the receptor binding domain (RBD) [52]. Our analytical framework is data-driven and can be extended to study other infectious diseases.

For the limitations of this study, we have the following remarks. First, the reconstruction of Rt relies on the setting of the generation time (GT). We modeled the distribution of COVID-19 GT as a fixed Gamma distribution, which follows previous findings [28,29,30,31,32]. In a real-world situation, the time interval between the transmission generations could be variable [42, 53], which may affect the estimation of Rt. However, the changes in Rt estimates due to slight variations in GT are negligible [42]. We remark that this issue is unlikely to affect our main conclusion, and our model can be extended to a more complex context with the available time-varying GT data. Second, for the Rt estimation parts, C(t) should be the number of COVID-19 cases onset at time t. However, because the data by onset are unavailable, we adopted the current dataset by reporting data as a proxy for the COVID-19 incidence time series. If one considers a constant reporting lag, the Rt estimates will be in exactly the same trends but shifted for the reporting lag. Considering that a similar reporting delay also occurred for the SARS-CoV-2 sequencing data, the effects of the two reporting lags may be counteracted. We note that this approximation in our analysis is unlikely to affect the conclusion in this study. In addition, with detailed reporting lag information of each individual case, adjustment for the reporting delay can surely be carried out based on our current analytical framework. Third, as a data-driven study, the estimated association should be interpreted with caution. With ecological settings, our analysis provides statistical evidence about the likelihood of causality, but the findings in this study cannot guarantee causality, which needs further biomedical experiments with a more sophisticated context.


Our findings show a link between the molecular-level mutation activity of SARS-CoV-2 and population-level COVID-19 transmission to provide further evidence for a positive association between the D614G substitution and Rt. Future studies exploring the mechanism between SARS-CoV-2 mutations and COVID-19 infectivity are warranted.

Availability of data and materials

All data used in this work are publicly available.



Amino acid


Coronavirus disease 2019


amino acid substitution of aspartic acid (D) to glycine (G) on codon 614 (of the S protein of SARS-CoV-2)


Generation time


Likelihood ratio


Global Initiative on Sharing All Influenza Data


Maximum likelihood estimation


Receptor binding domain


Severe acute respiratory syndrome coronavirus 2


Standard deviation

95% CI:

95% confidence interval


  1. 1.

    Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet (London, England). 2020;395(10223):497–506.

    CAS  Article  Google Scholar 

  2. 2.

    Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. N Engl J Med. 2020;382(13):1199–207.

    CAS  Article  Google Scholar 

  3. 3.

    Zhao S, Musa SS, Lin Q, Ran J, Yang G, Wang W, et al. Estimating the unreported number of novel coronavirus (2019-nCoV) cases in China in the first half of January 2020: a data-driven Modelling analysis of the early outbreak. J Clin Med. 2020;9(2):388.

    Article  Google Scholar 

  4. 4.

    Leung K, Wu JT, Liu D, Leung GM. First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment. Lancet (London, England). 2020;395(10233):1382–93.

    CAS  Article  Google Scholar 

  5. 5.

    Parry J. China coronavirus: cases surge as official admits human to human transmission. BMJ (Clin Res ed). 2020;368:m236.

    Google Scholar 

  6. 6.

    Wu JT, Leung K, Leung GM. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. Lancet (London, England). 2020;395(10225):689–97.

    CAS  Article  Google Scholar 

  7. 7.

    Zhao S, Zhuang Z, Cao P, Ran J, Gao D, Lou Y, et al. Quantifying the association between domestic travel and the exportation of novel coronavirus (2019-nCoV) cases from Wuhan, China in 2020: a correlational analysis. J Travel Med. 2020;27(2):taaa022.

  8. 8.

    World Health Organization. Statement on the second meeting of the International Health Regulations Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV), World Health Organization (WHO). 2020 [Available from:

  9. 9.

    World Health Organization. Novel Coronavirus (2019-nCoV) situation reports, released by the World Health Organization (WHO). 2020 [Available from:

  10. 10.

    Tuite AR, Fisman DN. Reporting, epidemic growth, and reproduction numbers for the 2019 novel coronavirus (2019-nCoV) epidemic. Ann Intern Med. 2020;172(8):567–8.

    Article  Google Scholar 

  11. 11.

    Zhao S, Cao P, Gao D, Zhuang Z, Cai Y, Ran J, et al. Serial interval in determining the estimation of reproduction number of the novel coronavirus disease (COVID-19) during the early outbreak. J Travel Med. 2020;27(3):taaa033.

  12. 12.

    Riou J, Althaus CL. Pattern of early human-to-human transmission of Wuhan 2019 novel coronavirus (2019-nCoV), December 2019 to January 2020. Euro Surv. 2020;25(4):2000058.

  13. 13.

    Zhao S. To avoid the noncausal association between environmental factor and COVID-19 when using aggregated data: simulation-based counterexamples for demonstration. Sci Total Environ. 2020:141590.

  14. 14.

    Kutter JS, Spronken MI, Fraaij PL, Fouchier RA, Herfst S. Transmission routes of respiratory viruses among humans. Curr Opin Virol. 2018;28:142–51.

    CAS  Article  Google Scholar 

  15. 15.

    Lau H, Khosrawipour V, Kocbach P, Mikolajczyk A, Schubert J, Bania J, et al. The positive impact of lockdown in Wuhan on containing the COVID-19 outbreak in China. J Travel Med. 2020;27(3):taaa037.

    Article  Google Scholar 

  16. 16.

    Zhao S, Musa SS, Hebert JT, Cao P, Ran J, Meng J, et al. Modelling the effective reproduction number of vector-borne diseases: the yellow fever outbreak in Luanda, Angola 2015-2016 as an example. PeerJ. 2020;8:e8601.

    Article  Google Scholar 

  17. 17.

    Baum A, Fulton BO, Wloga E, Copin R, Pascal KE, Russo V, et al. Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies. Science. 2020;369(6506):1014–8.

    CAS  Article  Google Scholar 

  18. 18.

    van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect Genet Evol. 2020;83:104351.

    Article  Google Scholar 

  19. 19.

    Wan Y, Shang J, Graham R, Baric RS, Li F. Receptor Recognition by the Novel Coronavirus from Wuhan: an Analysis Based on Decade-Long Structural Studies of SARS Coronavirus. J Virol. 2020;94(7).

  20. 20.

    Benvenuto D, Demir AB, Giovanetti M, Bianchi M, Angeletti S, Pascarella S, et al. Evidence for mutations in SARS-CoV-2 Italian isolates potentially affecting virus transmission. J Med Virol. 2020.

  21. 21.

    Rimmelzwaan GF, Berkhoff EGM, Nieuwkoop NJ, Fouchier RAM, Osterhaus A. Functional compensation of a detrimental amino acid substitution in a cytotoxic-T-lymphocyte epitope of influenza a viruses by comutations. J Virol. 2004;78(16):8946–9.

    CAS  Article  Google Scholar 

  22. 22.

    Rimmelzwaan GF, Berkhoff EGM, Nieuwkoop NJ, Smith DJ, Fouchier RAM, Osterhaus A. Full restoration of viral fitness by multiple compensatory co-mutations in the nucleoprotein of influenza a virus cytotoxic T-lymphocyte escape mutants. J Gen Virol. 2005;86(6):1801–5.

    CAS  Article  Google Scholar 

  23. 23.

    Gog JR, Rimmelzwaan GF, Osterhaus ADME, Grenfell BT. Population dynamics of rapid fixation in cytotoxic T lymphocyte escape mutants of influenza a. Proc Natl Acad Sci. 2003;100(19):11143–7.

    CAS  Article  Google Scholar 

  24. 24.

    Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data–from vision to reality. Euro Surv. 2017;22(13):30494.

  25. 25.

    Wu T, Ge X, Yu G, Hu E. Open-source analytics tools for studying the COVID-19 coronavirus outbreak. medRxiv. 2020:2020.02.25.20027433.

  26. 26.

    Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am J Epidemiol. 2013;178(9):1505–12.

    Article  Google Scholar 

  27. 27.

    Wallinga J, Teunis P. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. Am J Epidemiol. 2004;160(6):509–16.

    Article  Google Scholar 

  28. 28.

    Ferretti L, Wymant C, Kendall M, Zhao L, Nurtay A, Abeler-Dorner L, et al. Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing. Science. 2020;368(6491):eabb6936.

    CAS  Article  Google Scholar 

  29. 29.

    He X, Lau EHY, Wu P, Deng X, Wang J, Hao X, et al. Temporal dynamics in viral shedding and transmissibility of COVID-19. Nat Med. 2020;26(5):672-5.

  30. 30.

    Zhao S. Estimating the time interval between transmission generations when negative values occur in the serial interval data: using COVID-19 as an example. Math Biosci Eng. 2020;17(4):3512–9.

    Article  Google Scholar 

  31. 31.

    Ganyani T, Kremer C, Chen D, Torneri A, Faes C, Wallinga J, et al. Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020. Euro Surv. 2020;25(17):2000257.

    Google Scholar 

  32. 32.

    Tindale LC, Stockdale JE, Coombe M, Garlock ES, Lau WYV, Saraswat M, et al. Evidence for transmission of COVID-19 prior to symptom onset. Elife. 2020;9:e57149.

    CAS  Article  Google Scholar 

  33. 33.

    Zhao S, Gao D, Zhuang Z, Chong MKC, Cai Y, Ran J, et al. Estimating the serial interval of the novel coronavirus disease (COVID-19): a statistical analysis using the public data in Hong Kong from January 16 to February 15, 2020. Front Phys. 2020;8:347.

  34. 34.

    Wang K, Zhao S, Liao Y, Zhao T, Wang X, Zhang X, et al. Estimating the serial interval of the novel coronavirus disease (COVID-19) based on the public surveillance data in Shenzhen, China, from 19 January to 2020. Transbound Emerg Dis. 2020;67(6):2818-22

  35. 35.

    Ma S, Zhang J, Zeng M, Yun Q, Guo W, Zheng Y, et al. Epidemiological parameters of coronavirus disease 2019: a case series study. J Med Internet Res. 2020;22(10):e19994.

  36. 36.

    Wang MH, Lou J, Cao L, Zhao S, Chan PKS, Chan MC-W, et al. Characterization of the evolutionary dynamics of influenza A H3N2 hemagglutinin. bioRxiv. 2020:2020.06.16.155994.

  37. 37.

    Wang MH, Lou J, Zee BCY, Chong KC. US Provisional Patent No. 62/687645, 2019 PCT/CN2019/091652. Measurement and Prediction on Influenza Virus Genetic Mutation Patterns. . US Patent. 2018.

  38. 38.

    Zhao S, Lou J, Cao L, Chen Z, Chan RW, Chong MK, et al. Quantifying the importance of the key sites on haemagglutinin in determining the selection advantage of influenza virus: using a/H3N2 as an example. J Inf Secur. 2020;81(3):452–82.

    CAS  Google Scholar 

  39. 39.

    Lou J, Zhao S, Cao L, Chong MK, Chan RW, Chan PK, et al. Predicting the dominant influenza A serotype by quantifying mutation activities. Int J Infect Dis. 2020;100:255-7.

  40. 40.

    Ali ST, Kadi A, Ferguson NM. Transmission dynamics of the 2009 influenza a (H1N1) pandemic in India: the impact of holiday-related school closure. Epidemics. 2013;5(4):157–63.

    Article  Google Scholar 

  41. 41.

    Lloyd-Smith JO, Schreiber SJ, Kopp PE, Getz WM. Superspreading and the effect of individual variation on disease emergence. Nature. 2005;438(7066):355–9.

    CAS  Article  Google Scholar 

  42. 42.

    Ali ST, Wang L, Lau EHY, Xu XK, Du ZW, Wu Y, et al. Serial interval of SARS-CoV-2 was shortened over time by nonpharmaceutical interventions. Science. 2020;369(6507):1106–9.

    CAS  Article  Google Scholar 

  43. 43.

    Chinazzi M, Davis JT, Ajelli M, Gioannini C, Litvinova M, Merler S, et al. The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak. Science. 2020;368(6489):395–400.

    CAS  Article  Google Scholar 

  44. 44.

    Gatto M, Bertuzzo E, Mari L, Miccoli S, Carraro L, Casagrandi R, et al. Spread and dynamics of the COVID-19 epidemic in Italy: effects of emergency containment measures. Proc Natl Acad Sci U S A. 2020;117(19):10484–91.

    CAS  Article  Google Scholar 

  45. 45.

    Jung SM, Akhmetzhanov AR, Hayashi K, Linton NM, Yang Y, Yuan B, et al. Real-Time Estimation of the Risk of Death from Novel Coronavirus (COVID-19) Infection: Inference Using Exported Cases. J Clin Med. 2020;9(2):523.

  46. 46.

    Musa SS, Zhao S, Wang MH, Habib AG, Mustapha UT, He D. Estimation of exponential growth rate and basic reproduction number of the coronavirus disease 2019 (COVID-19) in Africa. Infect Dis Poverty. 2020;9(1):96.

    Article  Google Scholar 

  47. 47.

    Ran J, Zhao S, Han L, Liao G, Wang K, Wang MH, et al. A re-analysis in exploring the association between temperature and COVID-19 transmissibility: an ecological study with 154 Chinese cities. Eur Respir J. 2020;56(2):2001253.

    CAS  Article  Google Scholar 

  48. 48.

    Xu XK, Liu XF, Wu Y, Ali ST, Du Z, Bosetti P, et al. Reconstruction of transmission pairs for novel coronavirus disease 2019 (COVID-19) in mainland China: estimation of super-spreading events, serial interval, and hazard of infection. Clin Infect Dis. 2020;71(12):3163–7.

  49. 49.

    Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell. 2020;182(4):812–27.

  50. 50.

    Yurkovetskiy L, Wang X, Pascal KE, Tomkins-Tinch C, Nyalile TP, Wang Y, et al. Structural and functional analysis of the D614G SARS-CoV-2 spike protein variant. Cell. 2020;183(3):739–51.

    CAS  Article  Google Scholar 

  51. 51.

    Volz E, Hill V, McCrone JT, Price A, Jorgensen D, O’Toole Á, et al. Evaluating the effects of SARS-CoV-2 spike mutation D614G on transmissibility and pathogenicity. Cell. 2020;184(1):64-75.

  52. 52.

    Koel BF, Burke DF, Bestebroer TM, van der Vliet S, Zondag GC, Vervaet G, et al. Substitutions near the receptor binding site determine major antigenic change during influenza virus evolution. Science. 2013;342(6161):976–9.

    CAS  Article  Google Scholar 

  53. 53.

    Zhao S, Cao P, Chong MKC, Gao D, Lou Y, Ran J, et al. COVID-19 and gender-specific difference: analysis of public surveillance data in Hong Kong and Shenzhen, China, from January 10 to February 15, 2020. Infect Control Hosp Epidemiol. 2020;41(6):750–1.

    Article  Google Scholar 

Download references


This study is conducted using the resources of Alibaba Cloud Intelligence High Performance Cluster computing facilities, which is made free for COVID-19 research.


The funding agencies had no role in the design and implementation of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.


This work is supported by CUHK grants [PIEF/Ph2/COVID/06, 4054456] and the Health and Medical Research Fund (HMRF) Commissioned Research on COVID-19 [INF-CUHK-1] of Hong Kong, China and is partially supported by the National Natural Science Foundation of China (NSFC) [31871340,71974165].

Author information




SZ and MHW conceived the study. JZ collected the data. SZ and JL carried out the analysis and drafted the first manuscript. SZ, JL and MHW discussed the results. All authors critically read and revised the manuscript and gave final approval for publication.

Corresponding author

Correspondence to Maggie H. Wang.

Ethics declarations

Ethics approval and consent to participate

The number of COVID-19 cases and sequencing data are collected via public domains, and thus, neither ethical approval nor individual consent is applicable.

Consent for publication

Not applicable.

Competing interests

MHW is a shareholder of Beth Bioinformatics Co., Ltd. BCYZ is a shareholder of Beth Bioinformatics Co., Ltd. and Health View Bioanalytics, Ltd. The other authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, S., Lou, J., Cao, L. et al. Modelling the association between COVID-19 transmissibility and D614G substitution in SARS-CoV-2 spike protein: using the surveillance data in California as an example. Theor Biol Med Model 18, 10 (2021).

Download citation


  • COVID-19
  • Spike protein
  • Mutation
  • Transmission
  • Statistical modeling