Modeling methods for estimating HIV incidence: a mathematical review

Estimating HIV incidence is crucial for monitoring the epidemiology of this infection, planning screening and intervention campaigns, and evaluating the effectiveness of control measures. However, owing to the long and variable period from HIV infection to the development of AIDS and the introduction of highly active antiretroviral therapy, accurate incidence estimation remains a major challenge. Numerous estimation methods have been proposed in epidemiological modeling studies, and here we review commonly-used methods for estimation of HIV incidence. We review the essential data required for estimation along with the advantages and disadvantages, mathematical structures and likelihood derivations of these methods. The methods include the classical back-calculation method, the method based on CD4+ T-cell depletion, the use of HIV case reporting data, the use of cohort study data, the use of serial or cross-sectional prevalence data, and biomarker approach. By outlining the mechanistic features of each method, we provide guidance for planning incidence estimation efforts, which may depend on national or regional factors as well as the availability of epidemiological or laboratory datasets.


Background
Since the first patient with acquired immunodeficiency syndrome (AIDS) was reported in 1981 [1], its causative agent, human immunodeficiency virus (HIV), has led to 77 million HIV infections globally and remains a major public health issue [2]. To strategically assess the impact of interventions and to guide policy makers in achieving improved control of HIV/AIDS, it is critical to quantify the dynamics of HIV epidemics accurately and reliably. HIV incidence (i.e., the transient number of new infections) and prevalence (i.e., the fraction of infected individuals at a given point in time) are two major indicators that are used to assess and interpret the transmission dynamics of HIV. HIV incidence and prevalence have been estimated using mathematical and statistical modeling approaches by many academic and governmental research groups. For instance, the Joint United Nations Program on HIV/AIDS (UNAIDS) regularly provides updates of national and global estimates, indicating that 1.8 million *Correspondence: nishiurah@med.hokudai.ac.jp 2 Graduate School of Medicine, Hokkaido University, Kita 15 Jo Nishi 7 Chome, Kitaku, 0608638 Sapporo, Japan Full list of author information is available at the end of the article people were newly infected with HIV and 940,000 deaths occurred in the year 2017 [2].
Unlike many acute infectious diseases, HIV infection progresses slowly in vivo and has a complex natural history. During the first 2-4 weeks following infection, the virus replicates rapidly and this period is referred to as the acute stage [3,4]. Thereafter, viral loads are greatly reduced and reach a quasi-steady state, which is called the asymptomatic stage. During the asymptomatic stage, the viral load reflects the steady state achieved between high rates of viral replication and virus clearance, and is maintained at a remarkably stable level (i.e., the viral load set point) over a number of years. If untreated, the median length of asymptomatic stage may range from 8-11 years. Infected individuals in the asymptomatic stage do not show overt symptoms but can transmit HIV infection through high-risk behaviors. Subsequently, the viral load increases slowly, resulting in the onset of AIDS [5][6][7]. Because their immune systems are severely damaged, individuals with AIDS experience a number of opportunistic infections and are at high risk of death without treatment.
Owing to the lengthy asymptomatic stage without symptoms, many individuals do not realize that they are infected for a number of years. Moreover, through sexual contact and intravenous drug use, infections often remain undetected due to the reliance on voluntary testing following those high risk exposures [8,9]. This issue both leads to increased HIV transmission and complicates modeling exercises, increasing the difficulty of explicitly quantifying the epidemiological dynamics of HIV/AIDS. Furthermore, owing to the widespread use of antiretroviral therapy (ART), prevalence estimation is controversial: even where prevalence can be estimated, this estimate may not reflect the current dynamics of HIV epidemics and may reflect only the degree of spread from many years in the past [10]. It is generally recognized that estimation of HIV incidence can provide greater insights into the real-time evaluation of HIV epidemics. Nevertheless, the long asymptomatic stage also causes challenges in estimating HIV incidence.
Starting in the 1980s, a large number of modeling studies have aimed to estimate HIV incidence, and a variety of useful methods have been proposed for this purpose. These diverse methods have played important roles in HIV incidence estimation in different parts of the world. However, only brief comparative notes have been published elsewhere [10][11][12], aiming for improvement in practical estimation settings. In this review, we comprehensively describe the major methods that have been used for HIV incidence estimation, including (i) the classical back-calculation method, (ii) the method based on CD4+ T-cell depletion, (iii) the use of HIV case reporting data, (iv) the use of cohort study data, (v) the use of serial or cross-sectional prevalence data, and (vi) biomarker approach. We focus on the structural mechanisms of modeling as well as the mathematical derivation of likelihood functions, and compare the advantages and disadvantages of existing methods. Our review is targeted to a general audience in theoretical biology. Finally, we summarize important implications for future development of estimation methods for HIV incidence.

Back-calculation
Back-calculation, one of the most widely-used statistical modeling approaches, exploits the distribution of incubation periods of AIDS. The back-calculation method uses epidemiological surveillance data to reconstruct HIV infections over time. The basic idea of the method can be described as follows. If the rate of incident HIV infections at time s is I(s) , and the probability density function of the incubation period f (s) is known, then AIDS incidence at time t, denoted by A(t), can be described by Conversely, if the dataset for A(t) is available from surveillance data and f (s) can be determined from the literature, HIV incidence can be "back-calculated" by rearranging (1) to If F(t) denotes the cumulative distribution function of the incubation period, one can describe the expected number of AIDS diagnoses over the time interval Here, the last equality holds because F( The classic method using AIDS data The back-calculation method was first proposed by Brookmeyer et al. [13][14][15] who used AIDS incidence data to estimate discrete HIV incidence using the maximum likelihood estimation method. Let T 0 , T 1 , · · · , T L denote discrete times, N denote the total number of infections before T L , and X i denote the number of diagnosed AIDS cases in the ith time interval [ T i−1 , T i ]. Then, N is the sum of all infected cases that have been diagnosed, X. = L i=1 X i , and those infected before T L but have not been diagnosed are indicated by X L+1 = N − X.. Suppose that X = (X 1 , X 2 , · · · , X L , X L+1 ) follows a multinomial distribution with sample size N, where probabilities (p 1 , p 2 , · · · , p L , 1 − p . ) can be calculated according to Eq. (3), and p . = L j=1 p j . In fact, is the probability density function for these N individuals being infected at time s. Denoting the observed AIDS incidence in each time interval as x 1 , x 2 , · · · , x L , the likelihood function can be described as follows: N!
The back-calculation method can estimate the historical incidence of infection that was already diagnosed and also the number of infections that have yet to be diagnosed.
Becker et al. [16] proposed a non-parametric approach to this method using the discrete version of Eq. (2). Let the number of HIV infections in the ith time interval be I i and the probability mass function of the incubation period be f d . Then, the expected number of AIDS diagnoses in interval i can be described as Let μ i = E(X i ) and λ j = E(I j ). Then, Assuming that HIV infections are generated by a nonhomogeneous Poisson process, X i (i = 1, · · · , L) would follow Poisson distributions with means μ i . Then, the likelihood function is where x i is the observed frequency of AIDS cases.
It should be noted that this method assumes that the distribution of the incubation period does not vary over time. In fact, it is easy to modify Eq.
is the probability density function for an individual who was infected at time s and diagnosed at time t. Thus, f (t − s | s) describes the timedependent distribution of incubation period. Similarly, the discrete version of Eq. (5) becomes E(X i |I 1 , I 2 , · · · , I i ) = i j=1 I j f i−j,j with f i−j,j representing the probability for an individual infected during time interval [ t j−1 , t j ) and diagnosed during time interval [ t i−1 , t i ).
In this way, the mathematical expression of the backcalculation method is straightforward, but the estimation of I i using this method is challenging [17] because of the high dimension of I i which leads to instability. To estimate the incidence of HIV I i , several published studies [18,19] have used either A(t) or I(t) as flexible parametric functions. Rosenberg et al. [20] estimated the infection curve I(s) directly, assuming that I(s) is a member of the general family G = {g 1 (s), · · · , g I (s)}, where g i (s) are integrable real functions. That is, Specifically, this method includes splines and step functions.
It should be noted that for models involving spline and step functions, another weakness is the potential of overfitting and the ill-posed inverse problem. Overfitting arises when too many knots in the spline are applied. The ill-posed problem arises when the step function is too discrete and when the estimated HIV incidence becomes overly sensitive to temporal fluctuations of data points. Moreover, when the HIV epidemic has just started and the trend has not been stable, the back-calculated incidence in the most recent years would be more uncertain than that based on long-lasting epidemic dynamics. This is caused by small number of diagnosed infections in recent observed times, yielding substantial uncertainties. However, in many existing settings in developed countries, the HIV epidemic has continued for substantial number of years, and in such an occasion, the uncertainties in the estimated recent infections are not as large as that estimated in the early epidemic phase with dramatic peaks and troughs, as shown by Yan and Zhang [21]. In addition, this method has been criticized because the estimation strongly depends on the distribution of the incubation period, which needs to be determined from other cohorts. Estimation of the incubation period encountered critical challenges in the 1990s, as the introduction of ART extended the length of the incubation period, inevitably changing this distribution. To account for the effect of ART, extended methods were proposed [22][23][24][25][26].

Using both HIV and AIDS diagnoses
In addition to AIDS incidence, the frequency of diagnosed HIV infections has become available as part of epidemiological surveillance data, greatly assisting researchers to extend the back-calculation method [27][28][29][30][31][32][33][34][35][36][37][38][39][40]. Early studies used only HIV diagnoses of individuals who later progressed to AIDS [27][28][29][30][31][32]. Subsequently, several other methods were proposed to incorporate all HIV diagnoses, including infected individuals who have not yet developed AIDS [33][34][35][36][37][38][39][40]. Yan et al. [41] proposed an approach which uses the number of new HIV diagnoses to back-calculate historical HIV incidence, partially aided by supplementary data from the old AIDS case surveillance system in populations where there were such system in the 1980s. The estimate is also calibrated with supplementary data based on "recent infections", that is, the proportion among newly diagnosed HIV that is recently infected according to enhanced surveillance or laboratory assays. This method was used to estimate HIV incidence among men who have sex with men in Australia [42,43]. Adding information on HIV diagnoses to the back-calculation method enables estimation of HIV incidence in recent years and reduces the uncertainty associated with this estimate to some degree. Moreover, the method enables joint estimation of HIV diagnosis rate [44]. However, challenges associated with estimating or assuming a time from infection to HIV diagnosis remain.

Including CD4+ T-cell counts at diagnosis
Upon HIV diagnosis, CD4+ T-cell count data has now become widely available. Various studies have defined HIV/AIDS progression based on CD4+ T-cell counts and employed Markov process models [33,38] to estimate HIV incidence. Birrell et al. [45] formulated a CD4-stage structured model to use CD4+ T-cell counts at diagnosis.
where e j = P T j e j−1 + (λ j , 0, 0, 0) T . λ j is the expected number of new HIV infections in time interval j, and P j is the transition matrix describing the proportion of individuals transitting between different stages during time interval j. Then, where ρ k,k+1 is the transition probability from stage k to k + 1. Let X j and Y j (j = 1, · · · , L) denote AIDS and HIV diagnoses during the time interval j, respectively, which are assumed to follow independent Poisson distributions with means μ HIV j and μ AIDS j , respectively. Then, the likelihood function for HIV and AIDS diagnoses can be calculated as CD4+ T-cell count data at diagnosis is also available for a subset of the above-diagnosed HIV-positive individuals. The CD4+ T-cell count data at diagnosis are divided into four sets: [ 500, ∞), [ 350, 500), [ 250, 300), and [ 0, 200). Let C j = (C 1,j , C 2,j , C 3,j , C 4,j ) or C k,j (k = 1, 2, 3, 4) be the number of HIV-positive individuals whose CD4 counts fall into the kth CD4 stage during the jth time interval, and N j = 4 k=1 C k,j . That is, N j individuals are diagnosed with HIV during the time interval j with the state variable, CD4-at-diagnosis data. We assume that these N j HIV-positive individuals with CD4 data are multinomially distributed as Then, the likelihood of observing CD4-at-diagnosis data can be given as The full likelihood is the product of L 1 and L 2 : This method can make full use of all the available data, including HIV and AIDS diagnoses as well as CD4+ Tcell counts at diagnosis. Using this method, one cannot only estimate the incidence of HIV infections but also the diagnosis rates at different CD4 stages and during different time intervals, providing insightful information to comprehensively evaluate the epidemiology of HIV/AIDS. Using this model, Birrell et al. estimated HIV incidence in England and Wales [46], and found that the mean time to diagnosis had shortened from 2001 to 2010 owing to expansion of HIV testing. However, this method is also highly dependent on the progression rate between different stages. Moreover, the quantities requiring estimation have much higher dimensions yielding additional difficulties. Birrell et al. [45] employed the Bayesian estimation technique, ensuring the stability of estimates.

Using CD4+ T-cell count data at diagnosis based on a CD4+ T-cell depletion model
In addition to the back-calculation method, another major HIV incidence estimation method is to jointly use HIV diagnosis data and the first CD4 count data while employing the CD4+ T-cell depletion model [47][48][49]. This method first estimates the distribution of diagnosis delays (i.e., the time from infection to diagnosis), and then estimates the incidence of HIV from the depletion of CD4+ T-cells [49]. Here, HIV incidence refers to the number of new infections during each time interval, including both diagnosed and undiagnosed infections by the end of the study period. The CD4+ T-cell depletion model that was adopted by Lodi et al. and Touloumi et al. [50,51] can be expressed as where t denotes the time from infection to the date of the first CD4+ T-cell count determination. Then, the time from date of infection to CD4 testing for an individual i can be estimated by , and are variable from person to person. Using standard survival analysis techniques, the diagnosis delay probability P(x) was estimated, which is the probability that an infected person would be diagnosed within x time units after infection. To statistically estimate undiagnosed infections, the authors further defined the diagnosis delay weight as W (x) = 1/P(x). Let t 0 and t N be the start and end times of the study period. The estimated infection time for each diagnosed individual may be either before or after t 0 . Suppose the estimated number of infections in the ith year after t 0 is n i (using the CD4+ T-cell depletion model), where i = 1, 2, · · · , N, and the time of infection for each case is DI j , j = 1, 2, · · · , n i . Then, the number of new infections in the ith year after t 0 can be estimated as A certain number of individuals remain to be infected but are not diagnosed before t 0 . Let U denote the number of such individuals. These individuals may either be diagnosed between t 0 and t N , or not diagnosed until the end of the study period t N . Let u i , i = 1, 2, · · · , be the number of newly diagnosed cases among these persons in the ith year after t 0 . Then, U = i≥1 u i is the total number of diagnoses observed during the study period. In addition, u i for i > N are cases who are not diagnosed until the end of the study period. H i is further defined as the total number of cases diagnosed during the ith year after t 0 (including those infected before and after t 0 ), where i = 1, · · · , N. Thus, r i = u i /H i is the proportion of new diagnosed cases in the ith year after t 0 who are infected before t 0 . Both H i and r i are treated as linear regression functions of time t, so H i and r i for i > N can then be predicted, and u i for i > N can at last be calculated as u i = H i × r i . For persons who are infected but not diagnosed before t 0 , another diagnosis delay weight is defined as W = U/ N i=1 u i . Suppose the estimated number of infections in the ith year before t 0 is m i (using the CD4 depletion model). Then, the number of new infections in the ith year before t 0 can be estimated as In fact, the method based on a CD4+ T-cell depletion model is also a kind of back-calculation method (it is sometimes referred to as the extended backcalculation method) because it also uses HIV/AIDS or CD4 T-cell counts at diagnosis to 'back-calculate' the time of infection among infected individuals. In the classical back-calculation method, only the total number of HIV/AIDS cases is required. In the extended backcalculation method, CD4+ T-cell counts at diagnosis are required at the individual level. For non-experts, the extended back-calculation method is easier to carry out owing to its low computational complexity compared with the classical back-calculation method. Nevertheless, similar to the classical back-calculation method, the validity of the extended back-calculation method is highly dependent on the CD4+ T-cell depletion model. In many countries and geographic areas, the empirical data required to estimate parameters of the CD4+ T-cell depletion model are extremely scarce. In China for example, after the test-and-treat policy became widespread, it became much more difficult to empirically observe CD4+ T-cell count data during natural infection in the absence of ART.

Simple method using HIV case reporting data
In 2017, Xia et al. [52] proposed a very simple novel method by which even non-experts can estimate HIV incidence using HIV case reporting data. The method assumes that HIV incidence and case finding are stable within each 3-year period. The timeframe of interest is broken down into overlapping 3-year periods (e.g. and where ε 1 is small, R is the case finding rate in a year, D i is the number of new diagnoses in year i, U i is the number of undiagnosed cases at the beginning of year i, and I 1 is the HIV incidence in year i (i = 1, 2, 3). Then, This method is simple enough for non-experts. Moreover, it is very easy to carry out, requiring only HIV case reporting data. However, the method is applicable only if both incidence and diagnosis rates are stable over three years.

Cohort studies
Another strategy for estimation of HIV incidence is to use cohort studies of uninfected individuals [53]. Since it is difficult to follow sufficient individuals at the national level, a cohort study design is employed for estimating incidence among subpopulations [11]. This method enables researchers to directly measure HIV incidence in the sample population, but biases are introduced when estimating incidence by cohort. These biases are mainly caused by two sources of error [11]. First, individuals who receive follow-up visits may not be representative of the population. Second, individuals who adhere to the followup visits may obtain counseling repeatedly, and thus, their knowledge of HIV may improve over time which could affect risk of acquiring HIV.

Prevalence data
Incidence and prevalence are two important metrics for evaluating HIV epidemics. In fact, these two measures are related to one another. Two different types of prevalence data have been used to estimate HIV incidence: serial prevalence and cross-sectional prevalence [54][55][56][57][58][59]. In this section, we review two different incidence estimation methods using serial and cross-sectional prevalence data.

UNAIDS has developed an Estimation and Projection
Package (EPP) which can be used to obtain HIV prevalence and projections [57]. Another software program, SPECTRUM [58], internally linked with EPP, can be employed to calculate the HIV incidence using the AIDS Impact Model (AIM) module. Here, we summarize the simplified methodology implemented in SPECTRUM. Let H a,t , A a,t and P a,t denote the number of HIV infections, the total number of adults in the population and HIV prevalence of individuals aged a at time t, respectively. SPECTRUM has been updated several times since its initial 2004 release [60][61][62][63], and the last update took place in 2017 [64]. Other studies using the similar modeling approach have been conducted to estimate HIV epidemic [65,66]. Hallett et al. [10] indicated that this method can estimate HIV incidence from the earliest stages of the epidemic, which is helpful to evaluate HIV epidemics over time. However, if large amounts of data are available, the estimate will involve a large uncertainty as the variation range of the incidence curve is very large. Since SPECTURM need to use EPP to generate the prevalence estimate and projections, and subsequently estimate the incidence of new HIV infections, any change in the incidence can only be detected through prevalence changes that may be observed over several years in later time. An additional disadvantage of this method is the difficulty in choosing an appropriate dataset from which prevalence is estimated. The estimation of HIV incidence could be significantly biased if the prevalence for the entire population is not estimated properly. For the long time, epidemiologists have used the data from antenatal clinics to estimate the prevalence in the entire population in sub-Saharan Africa [57]. As the HIV prevalence then appeared to be greater than that of the general population, national population-based household HIV surveys data are additionally used to calibrate overall population prevalence [67][68][69]. In fact, EPP began to include such household survey data in the estimation [70,71]. Besides, household survey could miss a large part of the population that was affected by the HIV epidemic, and may on the other hand yield substantially small estimate of the prevalence. Synthesizing the use of different datasets over time could act as a cause of biased estimation.

Calculating incidence from cross-sectional prevalence
Hallett et al. [59] proposed a method to estimate the age-specific incidence of HIV from cross-sectional prevalence data. They first estimated incidence based on cohort mortality rates of infected individuals as well as survival distributions following HIV infection, then calculated agespecific incidence according to the relationship between these two measures.
In the following, we first summarize the relationship between age-specific incidence and cohort incidence. The age group i is defined as individuals aged from a i − r 2 to a i + r 2 . Thus, the age group i is centered at a i with a width of r years. The total number of individuals and HIVinfected individuals in age group i at time j are denoted by N i,j and H i,j , respectively. Then, the prevalence is We assume that cross-sectional prevalence is measured with an interval of T years in such age groups. Thus, age cohorts can be constructed as aged a i − r 2 to a i + r 2 at the start and a i − r 2 + T to a i + r 2 + T at the end of each interval. Now the cohort incidence, which is denoted bỹ λ i , can be illustrated by diagonal parallelogram (regions A and B in Fig. 1). The conventional age-specific incidence Fig. 1 Diagram of age cohort experience of incidence and conventional age-specific incidence rate for age-group i, which is denoted by λ i , is illustrated by regions B and C in Fig. 1. As Fig. 1 shows, region C can be seen as part of the incidence of cohort i − 1. Denote the areas of regions A and B as S A , S B . The total area for the diagonal parallelogram is Tr (that is, S A + S B ). The fractions contributed by cohort i and i − 1 are 1 − T/2r (S B /(S A + S B )) and T/2r (S A /(S A + S B )), respectively. Then, the conventional age-specific incidence rate can be calculated using the following equation: The derivation of the above formula assumes that T ≤ r. When T > r, a similar method can be used for deriving a different formula, which is omitted here. In the following, the methodological background of the cohort incidence estimationλ i is described. Letπ i be the fraction of infected individuals in the ith age-group who survive from the start to the end of the interval, andμ i be the mortality rate during this interval for individuals in the ith age-group who are uninfected. Then, the number of seroconverting individuals in age group i during the interval T can be approximated as and the number of person-years spent by age-group i during the interval T is approximated as Then, the cohort incidence can be derived as where Q j denotes the change in the size of the cohort over the time interval T.
The authors further defined age cohort 0, calculating the prevalence at the start and end of the interval, and subsequently, the cohort incidence for this age cohort.π i can either be estimated based on the age-specific cohort mortality rates of infected individuals, or estimated using the distribution of survival time after infection, although we omit the details in this review. To use this method, age-specific cross-sectional prevalence data are required. Moreover, the duration between two cross-sectional measurements of prevalence should be small to ensure that the incidence and prevalence do not change significantly during this time interval. Because people with long survival time are preferably included in cohorts, the time-length bias is inevitable with this method. Both methods using serial and cross-sectional prevalence could be affected by the increasing coverage of ART [10,59].

Biomarker approach for cross-sectional incidence estimation
It is widely recognized that recent infection rates are difficult to estimate using the back-calculation method owing to the long incubation period of AIDS, while cohort studies have difficulty following a sufficient number of highrisk uninfected persons. To complicate these issues, ART can considerably extend the incubation period, adding complexity to the majority of estimation methods mentioned above. As a possible alternative, a biomarker-based approach using cross-sectional incidence estimation was proposed and has clear advantages in estimation of recent infections [72][73][74][75][76][77][78][79][80][81][82][83][84][85][86]. This approach uses biomarkers from biological samples collected in cross-sectional studies to identify recent HIV infections.
Using diagnostic tests for the p24 antigen during the pre-seroconversion period.
In 1995, Brookmeyer et al. [72] proposed a simplistic modeling approach that uses diagnostic tests for HIV-1 p24 antigen to determine the prevalence of individuals who are p24 antigen-positive among HIV-seronegative individuals. Let μ be the mean duration of the p24 antigen-positive period before seroconversion, I be the infection risk per unit time for each uninfected individual (that is, the current incidence rate), and p be the expected proportion of individuals who are p24 antigen-positive among individuals whose HIV-antibody test results are negative or indeterminate. Then, p can be approximated as Iμ, and I can be estimated as Here, μ is referred to as the window period during which infected individuals have not yet seroconverted, but are still identifiable using biomarker(s). Supposing thatp is the number of individuals who are p24 antigen-positive during the window period, and n is the total number of individuals in the cross-sectional survey whose HIV antibody tests results are negative or indeterminate (i.e., p = p n ). Then, we have The confidence interval for the incidence rate can be further estimated by assuming thatp follows a Poisson distribution with expectation nIμ.

Using HIV enzyme immunoassay (EIA), antibody avidity index or genetic diversity
For the method proposed by Brookmeyer et al. [72], all individuals whose HIV antibody tests are negative need to undertake diagnostic testing for p24 antigen. Since the duration of the p24 antigen-positive pre-seroconversion period (window period) μ is very short (mean duration 22.5 days [72]), a large number of individuals need to be tested in situations where I (the population incidence rate) is high or n (the number of individuals that can be tested) is large. Janssen et al. [73] developed a new method to employ a testing algorithm based on either a sensitive assay (3A11) or a less-sensitive assay (3A11-LS). For a given cohort study, let T be the mean duration between seroconversion for the two assays (i.e., the window period), n be the number of individuals who are 3A11 reactive and 3A11-LS non-reactive, and N be the number of individuals who are HIV-negative or 3A11 reactive/3A11-LS non-reactive. Then, the incidence rate is The window period using the sensitive/less sensitive assay testing algorithm is longer (i.e., 129 days) [73]. However, the algorithm does not perform well in populations infected with non-B HIV-1 subtypes [74]. Parekh et al. [75] proposed a subtype-independent assay called BED capture EIA (BED-CEIA; named after HIV subtypes B, E, and D), which can be used for detecting recent infections in populations infected by multiple HIV-1 subtypes. The mean BED window period is 156 days. Using the BED assay, Karon et al. [76] further proposed a method which can take into account information on history of HIV testing. Here, testing history refers both to whether an individual has undertaken HIV testing prior to HIV infection as well as the testing frequency. Since the antibody avidity index is always low during early infection, another method for estimation of recent infections based on the avidity index was proposed [77]. Genetic diversity of HIV has also been used as a biomarker to estimate HIV incidence [78][79][80][81], since it changes as the disease progresses. Other published studies [78,79] identified recent HIV-1 infections based on data from traditional or nextgeneration DNA sequencing. Another research team [80,81] developed a method based on a high-resolution melting (HRM) diversity assay to determine HIV diversity without sequencing.

Multiassay algorithms (MAAs)
The above serological assays have limitations because of their low accuracy in distinguishing recent from chronic infections. Some chronic infections may be misclassified as recent infection, and thus these methods may overestimate HIV incidence [82,83]. Laeyendecker et al. [82,83] demonstrated that factors such as low viral loads, low CD4+ T-cell counts, and > 2 years of ART were associated with misclassification by the BED-CEIA. Avidity assays, which identify recent infections by studying the maturity of the antibody response against HIV, also have difficulties in distinguishing recent infections for HIV-1 incidence estimation [87,88]. Laeyendecker et al. [84] and Brookmeyer [85] developed a MAA to estimate HIV incidence. The MAA integrates data from BED-CEIAs, antibody avidity assays, HIV viral loads and CD4+ T-cell counts. These algorithms are described in Fig. 2a and b, respectively.
All biomarker approaches estimate incidence at a time prior to sample collection, and the concept of the shadow describes the lag-time [85,[89][90][91]. Shadow and mean window period are two distinct but important concepts for evaluating the statistical accuracy of current HIV incidence estimates. Estimation approaches with large mean window periods will have smaller standard errors, and those with small shadows can better estimate more recent incidence [85,[89][90][91]. Thus, estimation approaches involving a larger mean window period and a smaller shadow are desirable.
The difference between the MAAs proposed by Laeyendecker et al. [84] and Brookmeyer [85] is that they use different cut-offs for CD4+ T-cell counts, BED-CEIAs, avidity and viral loads. Thus, the two algorithms have different mean window periods (141 days, 95% confidence interval (CI) (94,150) vs. 159 days, 95% CI (134, 186), respectively) and shadows (128 days vs. 184 days, respectively). As Fig. 2a and b show, both of these algorithms require CD4+ T-cell counts, which are difficult to obtain in some settings. Thus, Laeyendecker et al. [85] developed another MAA using only three biomarkers (BED, avidity, and viral load) as shown in Fig. 2c. This three-biomarkerassay does not require CD4+ T-cell count data, and thus is less expensive. However, the mean window period for the three-biomarker-assay is 58 days shorter than that of the four-biomarker-assay. Hence, to achieve the same incidence standard error, the three-biomarker-assay requires a sample size about 57% larger.
Cousins et al. [86] proposed a new MAA in which a HRM diversity assay is used in place of CD4+ T-cell count data, as shown in Fig. 2d. The mean window period and shadow for the HRM-based MAA are 154 days (95% CI 128, 180 days) and 179 days (95% CI 135, 243 days), respectively. The performance of the HRM-based MAA was shown to be nearly identical to that of the MAA including CD4+ T-cell count data.
For all MAAs, HIV incidence is calculated using the following equation: where n is the number of MAA-positive subjects, N is the total number of individuals who are HIV seronegative, and T is the mean window period. Several narrative reviews have been published describing incidence estimation approaches that use biomarker data [10,11,74,88,[91][92][93]. Technical challenges of biomarker approach include misclassification of chronic infections as recent infections and a large variation in testing results between individuals [10], although the accuracy of recent infection estimate has been markedly improved by using MAAs. As reviewed by Murphy et al. [93], biomarker-based incidence could achieve high precision if false recency ratio is sufficiently close to zero. Moreover, as the biomarker approaches include a variety of biomarkers, the complexity to identify recent infections has become more and more complex over time, which may sometimes even require specialized equipments. Early treatment and the use of pre-and postexposure prophylaxis also bring new challenges to the biomarker approaches. In recent years, the incidence in some populations or sub-populations have been estimated by using biomarker approaches [94][95][96], and sometimes the biomarker method was combined with other existing modelling approaches [97,98]. Because of financial constraints, insufficient coordinated action among funding bodies, governments and developers could also act as a hazard for propagating this approach [93], frequently involving problems in purchasing agreement and limited financial support for quality control and training.

Discussion
In this review, six major methods for estimating HIV incidence were briefly described. These included the backcalculation method, methods using CD4+ T-cell depletion models, methods using HIV case reporting data, methods based on cohort studies, methods using prevalence data, and biomarker-based approaches. Back-calculation methods can be divided into three subgroups according to the data used: (i) AIDS diagnosis data only, (ii) both HIV and AIDS diagnosis data, and (iii) HIV/AIDS diagnosis data as well as CD4+ T-cell counts at diagnosis. Similarly, methods using prevalence data can be further divided into methods based on serial and cross-sectional data. Our primary foci were the background mechanism of estimation, the required data types, the scope of application, the model formulation, the derivation of the maximum likelihood function, and the advantages and disadvantages of applying each method in practice.
Back-calculation methods are widely used to estimate the incidence and prevalence of HIV in various parts of the world [43,46]. These methods were initially developed using AIDS diagnosis data only, but were later extended to use both HIV and AIDS diagnoses, and then to further account for CD4+ T-cell counts at diagnosis. The backcalculation method has also been modified to include the effect of ART on the distribution of the incubation period. Back-calculation methods have clear advantages and disadvantages compared with other methods [99]. First, the back-calculation method requires only data from case reporting systems, and does not necessarily require laboratory testing and individual-level data. However, the incidence estimate in recent years tends to be unstable, especially where the HIV epidemic has just started, and accuracy of the estimate is influenced by the distribution of the incubation period (or the progression rate) as well as the testing rate.
Compared with back-calculation method, it would be easier to implement the statistical estimation using CD4+ T-cell depletion among non-experts. However, it assumes that the distribution of delays in diagnosis does not change over time. Thus, it may overestimate HIV incidence if HIV testing rates increase over time. As mentioned above, cohort studies have many difficulties and may introduce some biases when incidence is directly estimated among high-risk populations with close follow-up. For methods using prevalence data, both methods using serial and cross-sectional prevalence data are associated with uncertainties in evaluating HIV prevalence and AIDS deaths. Moreover, both methods are strongly influenced by the use of ART. Methods using cross-sectional prevalence data further assume that HIV incidence during the time interval between two prevalence surveys is constant, which is only true for very short time intervals. Biomarker-based approaches, which uses biomarkers in biological samples collected in cross-sectional studies to identify recent HIV infections, can avoid the difficulties associated with follow-up of high-risk uninfected persons in cohort studies as well as difficulties in estimating the distribution of long incubation periods. Biomarker-based methods can better estimate more recent HIV incidence compared with the back-calculation method. As laboratory testing techniques progress, MAAs have become available at low cost, which could minimize the effort and cost involved in incidence estimation in the future. Nevertheless, minimizing the 'false recency ratio' (FRR) at a sufficiently low level remains to be a challenge. Biomarker approaches also involve other technical difficulties in quality control, training and evaluation of assays.
The required data are, at the moment, divided into four different categories: (i) epidemiological data including AIDS diagnoses and HIV diagnoses, (ii) CD4 T-cell counts at diagnosis, (iii) prevalence data, and (iv) biomarker testing data. Prevalence data may be further divided into serial prevalence and cross-sectional prevalence data. It must be noted that definitions of HIV incidence are not uniform across different methods. For the back-calculation method, methods using CD4+ T-cell depletion models, methods using cohort studies and methods using serial prevalence data, HIV incidence is defined as the number of new HIV infections per unit time (year) or the instantaneous incident infections occurring at time t. However, for methods using cross-sectional prevalence data, HIV incidence is defined as the average hazard of new infections occurring during the interval. For the biomarker approach, an HIV incidence rate is estimated, which is defined as the infection risk per unit time for each uninfected individual (except for the method using BED-CEIA [76], which estimates conventional incidence instead). Obviously, conventional incidence and incidence rates can be converted as long as the total number of uninfected individuals is known. In addition to different incidence definitions, there is also another difference among these methods. The back-calculation method, methods using CD4+ T-cell depletion model, methods using cohort studies and methods using serial prevalence data can estimate serial incidence (i.e., the incidence year-over-year). However, the method using cross-sectional prevalence data and the biomarker approach estimate the cross-sectional incidence or the HIV incidence at a time prior to collection of samples. Thus, different methods estimate HIV incidence with variable time frames.

Conclusion
A variety of methods exist to estimate HIV incidence from different data types and scopes, and it is difficult to conclude which method perform best. Rather, it should be remembered that HIV incidence estimation itself described what cannot be directly validated, as the estimated quantity is not directly observable in natural settings. Thus, a new method should be regarded as way to mitigate uncertainty with respect to the estimates of another method, and analyzing HIV data from multiple standpoints and sources is one way to overcome such uncertainty. As the methods for HIV incidence estimation have different scopes and different advantages and disadvantages, we hope that this review will be useful for determining which datasets need to be collected to estimate HIV incidence in a comprehensive manner. Should a surveillance system be improved to collect multiple types of datasets as described above, it would be feasible to cross-validate different methodologies and see how different methods can complement each other so that an objective assessment of the HIV/AIDS epidemic will be eventually achieved.