Browse the corpus
Walk the Even Hospital Database by book and chapter — the raw source passages that ground Ask, DDx, and the rest.
87 passages
Liquid biopsy for the diagnosis of EBV-positive Burkitt's lymphoma in endemic areas. Burkitt's lymphoma (BL) is common in sub-Saharan Africa, yet diagnosis is often delayed due to limited pathology capacity. Here we evaluated blood-based liquid biopsies from 377 children and young adults with clinically suspected lymphoma at four hospitals in Tanzania and Uganda, assessing diagnostic accuracy and turnaround time (TAT). After extensive pathology capacity building, a gold-standard diagnosis was established using tissue morphology, a limited validated immunohistochemistry panel and independent dual histopathologist review. Using clinical features and circulating tumor DNA markers (MYC mutations, MYC-immunoglobulin translocations and Epstein-Barr virus fragmentomics), we trained six penalized logistic regression models with tenfold crossvalidation (n = 212). The best-performing model was externally validated in a prospective real-world cohort (n = 56). Diagnostic accuracy, yield and TAT were compared head to head between liquid biopsy and the gold standard in 58 participants. The comprehensive model achieved the highest performance (area under the curve (AUC) 0.95, 95% confidence interval (95% CI) 0.901-0.981, sensitivity 0.86, specificity 0.95), confirmed by external validation (AUC 0.98, 95% CI 0.942-1.000). Liquid biopsy was the only diagnostic result available at the multidisciplinary review in 42% of participants and reduced median diagnostic TAT from 46.8 d to 6.5 d (P = 4.42 × 10-10). These findings demonstrate that liquid biopsy enables fast, highly accurate molecular diagnosis of EBV+ BL and may substantially reduce treatment delays in resource-limited settings.
Burkitt’s lymphoma (BL) is a common cancer in children, with an aggressive course, but a high cure rate when treatment is initiated early and correctly1,2. It is associated with Epstein–Barr virus (EBV) infection in 95% of cases in sub-Saharan Africa (SSA) and shows a strong correlation with malaria endemic areas3–5. To differentiate BL from other non-Hodgkin’s lymphomas, standard-of-care (SoC) diagnosis requires the integration of morphology, immunophenotype and demonstration of a MYC rearrangement in the absence of BCL2 and/or BCL6 rearrangements6. However, these methods are not readily available in many low-income countries, where the burden of BL is highest and the availability of timely, reliable pathology services remains an important challenge7–9. In addition to a shortage of histopathologists, access to immunohistochemistry (IHC) is limited and, even when present, often inconsistently performed due to difficulties maintaining a stable supply of reagents and antibodies9–12. As a result, diagnosis typically relies solely on hematoxylin and eosin (H&E) morphology, making precise differential diagnosis impossible. Moreover, delays in diagnosis and treatment initiation remain a critical barrier to care, with a median delay of 91 d for BL from first onset of symptoms and a 51% probability of treatment not starting until 90 d after presentation9.
atoxylin and eosin (H&E) morphology, making precise differential diagnosis impossible. Moreover, delays in diagnosis and treatment initiation remain a critical barrier to care, with a median delay of 91 d for BL from first onset of symptoms and a 51% probability of treatment not starting until 90 d after presentation9. A three-phase diagnostic scoring system, based on a limited IHC panel, has been proposed to support diagnosis in environments where IHC is either unavailable or inconsistently implemented. This scoring system relies on a minimal set of markers that are more accessible in low-resource laboratories12. This scoring system, designed for settings with limited pathology infrastructure, demonstrated 81% concordance with SoC BL pathology diagnosis, supporting its use as a practical and reliable reference standard in such contexts. New methodologies are continually emerging in the field of liquid biopsies to better define and expand their clinical utility13–16. In principle, targeted sequencing of circulating cell-free DNA (cfDNA) extracted from plasma enables identification of genetic aberrations associated with different tumor types17,18. However, for SSA, the application of cfDNA analysis in cancer care is an under-researched topic. Despite a plethora of literature from high-income countries on the utility of this approach for the detection of minimal residual disease and accelerated approvals by the US Food and Drug Administration of some early cancer detection tests13,19, there is still a paucity of data on its potential applications in SSA.
ic. Despite a plethora of literature from high-income countries on the utility of this approach for the detection of minimal residual disease and accelerated approvals by the US Food and Drug Administration of some early cancer detection tests13,19, there is still a paucity of data on its potential applications in SSA. We previously proposed a liquid biopsy diagnostic model as a minimally invasive yet accurate modality for the detection of BL in children and young adults with suspected lymphoma20. In this preliminary analysis involving 20 children, key features, including the hallmark MYC–immunoglobulin (MYC–Ig) translocation, acquired MYC intron 1 single nucleotide variants (SNVs), tumor fraction of circulating cfDNA (ctDNA) and autosomal DNA entropy, were significantly associated with BL diagnosis20. When integrated into a penalized logistic regression model, these variables demonstrated strong discriminative performance, achieving an area under the curve (AUC) of 0.95 for the detection of BL20. However, the limited targeted panel used detected the MYC–Ig translocation in only 50% of cases of BL, highlighting the need for additional molecular markers to further improve diagnostic accuracy.
demonstrated strong discriminative performance, achieving an area under the curve (AUC) of 0.95 for the detection of BL20. However, the limited targeted panel used detected the MYC–Ig translocation in only 50% of cases of BL, highlighting the need for additional molecular markers to further improve diagnostic accuracy. Genetic alterations in BL include ID3 mutations, present in up to 76% of cases21,22, which disrupt cell cycle regulation, and TP53 mutations, found in up to 58% of cases22–24, which impair key tumor-suppressor functions. Together SNVs and insertions and/or deletions in MYC, ID3 and TP53 are among the most frequently reported mutated genes in BL and are recognized as key contributors to BL lymphomagenesis22,25. Assessing the median variant allele frequency (VAF) of mutations in MYC, ID3 and TP53 could offer complementary molecular insights and improve the identification of cases of BL beyond reliance on MYC translocation status. In addition, given that >95% of cases of BL are EBV+, it is important to consider whether EBV DNA characteristics, such as fragment size and relative abundance, could meaningfully enhance the performance of diagnostic models. For example, in nasopharyngeal carcinoma, a blood-based screening test combining EBV DNA quantity expressed as the EBV proportion (EBVP) with EBV fragment size ratios, achieved superior diagnostic performance compared to using EBV quantity alone26, underscoring the value of integrating multiple molecular parameters to improve the accuracy and predictive power of diagnostic assays.
ng test combining EBV DNA quantity expressed as the EBV proportion (EBVP) with EBV fragment size ratios, achieved superior diagnostic performance compared to using EBV quantity alone26, underscoring the value of integrating multiple molecular parameters to improve the accuracy and predictive power of diagnostic assays. Here we present the diagnostic accuracy study of this liquid biopsy approach in a larger cohort of children and young adults with suspected lymphoma incorporating multiple molecular attributes, followed by an independent, prospective, real-world evaluation, to provide further evidence of its clinical utility. Overall diagnostic performance was assessed comprehensively based on accuracy (sensitivity, specificity), diagnostic yield and TAT, compared to the best local gold-standard pathology (GSP) as previously described12.
In phase I of the study, we enrolled a total of 313 children and young adults clinically suspected of having a diagnosis of lymphoma (Extended Data Fig. 1). A tissue biopsy was collected and processed for a histopathology diagnosis with H&E staining in 89.5% (280 out of 313) of these patients. For the remaining 33 participants, 8 died before undergoing a tissue biopsy, 5 declined the procedure or were lost to follow-up and 20 had inadequate biopsy samples for histopathological assessment. Among the 280 samples, only 212 (67.7% of the total; 212 out of 313) achieved a GSP tissue biopsy diagnosis consisting of IHC with a limited panel12 and review by at least 2 pathologists. The remaining 68 samples were not of adequate quality or sufficient quantity to allow IHC. The failure rate of H&E diagnosis was 10.5% (33 out of 313), whereas the failure rate of GSP tissue biopsy was 24.3% (68 out of 280). Liquid biopsy was collected from all the 313 participants. However, sequencing was prioritized for samples with a GSP tissue biopsy because this was the gold-standard comparator in the clinical validation. In total, samples from 247 participants were sequenced. Although a total of 4 runs (covering 24 samples) initially failed during the first attempt, all were successfully repeated and sequencing results were subsequently generated for all samples. For clinical validation of the liquid biopsy test, we analyzed data from the 212 samples in this cohort with a GSP tissue biopsy diagnosis (Extended Data Fig. 1).
ing 24 samples) initially failed during the first attempt, all were successfully repeated and sequencing results were subsequently generated for all samples. For clinical validation of the liquid biopsy test, we analyzed data from the 212 samples in this cohort with a GSP tissue biopsy diagnosis (Extended Data Fig. 1). For the phase I enrollment, participants had a median age of 13 years (interquartile range (IQR) 9–17 years), with a male-to-female ratio of 2:1. Eleven (3.5%) participants had a serologically confirmed HIV diagnosis and were on antiretroviral treatment, whereas nine (2.9%) had a history of having been diagnosed and treated for tuberculosis. Only 15 (4.8%) participants reported a positive family history of cancer in their family. A jaw mass was the presenting feature in 25% (78 out of 313) of participants, peripheral lymph node enlargement was present in 68% (213 out of 313) and at least half (53%, 167 out of 313) the participants had an abdominal organomegaly. Participants had a median hemoglobin of 9.81 g dl−1 (IQR 8.15–11.30 g dl−1) and median lactate dehydrogenase (LDH) levels of 679 IU l−1 (IQR 393–1,233 IU l−1). The most common GSP tissue biopsy diagnosis was classic Hodgkin’s lymphoma (HL) (26%, 80 out of 313), followed by BL (25%, 77 out of 313) and diffuse large B cell lymphoma (DLBCL) (14%, 45 out of 313). Other childhood cancers were present in 9.6% (30 out of 313), whereas 9.3% (29 out of 313) of participants had benign conditions. The demographic, clinical and laboratory characteristics of participants for both phases of the study are summarized in Extended Data Tables 1 and 2.
l lymphoma (DLBCL) (14%, 45 out of 313). Other childhood cancers were present in 9.6% (30 out of 313), whereas 9.3% (29 out of 313) of participants had benign conditions. The demographic, clinical and laboratory characteristics of participants for both phases of the study are summarized in Extended Data Tables 1 and 2. Among the 212 participants, where GSP tissue biopsy diagnosis was achieved, 38.2% (81 out of 212) had a diagnosis of BL and 61.8% (131 out of 212) had a non-BL diagnosis. For the non-BL participants, 36.7% (48 out of 131) had HL, 16.8% (22 out of 131) had DLBCL, 18.3% (24 out of 131) had a benign diagnosis (benign tumor, tuberculous adenitis, reactive or EBV+ lymphadenitis) and the remaining 28.2% (37 out of 131) had a mixture of other types of cancer. A younger age, presence of an abdominal or jaw mass and elevated LDH were significantly associated with a diagnosis of BL (P < 0.05 for all; Table 1). For the liquid biopsy variables, diagnosis of BL was associated with higher ctDNA levels, greater median number of mutations in MYC intron 1, increased EBER1, EBER2 and EBNA2 copies per cell, higher EBV DNA fragment size ratio and EBVP, increased EBV DNA fragment size entropy and autosomal fragment size entropy and the presence of a MYC–Ig translocation (P < 0.01 for all; Table 1 and Extended Data Fig. 2).
greater median number of mutations in MYC intron 1, increased EBER1, EBER2 and EBNA2 copies per cell, higher EBV DNA fragment size ratio and EBVP, increased EBV DNA fragment size entropy and autosomal fragment size entropy and the presence of a MYC–Ig translocation (P < 0.01 for all; Table 1 and Extended Data Fig. 2). Additional clinical and molecular characteristics are reported in Supplementary Table 1.Table 1Clinical and liquid biopsy characteristics of phase I participants stratified as BL and non-BL (n = 212 children and young adults with clinically suspected lymphoma)CharacteristicOverallaBLaNon-BLaP valuebTestAge, years10.3 (6.8, 14.1)9.9 (6.8, 12.5)10.7 (6.9, 15.7)0.0429Wilcoxon’sTumor site5.43 × 10−16χ2 Jaw or abdominal, n (%)103 (49)68 (84)35 (27) Other, n (%)109 (51)13 (16)96 (73)LDH, IU l−1656 (385, 1,312)945 (553, 1,633)516 (343, 824)7.50 × 10−6Wilcoxon’sCtDNA, hGE l−1153 (0, 664)530 (166, 1,579)28 (0, 241)9.93 × 10−12Wilcoxon’sMYC intron 1 mutation count1 (0, 13)15 (6, 28)0 (0, 1)2.03 × 10−19Wilcoxon’sEBER1 (copies per cell)0.1 (0, 3.1)3.1 (0.1, 6.6)0 (0, 0.3)2.00 × 10−12Wilcoxon’sEBER2 (copies per cell)0.07 (0, 2.47)3.06 (0.15, 5.51)0 (0, 0.22)2.10 × 10−12Wilcoxon’sEBNA2 (copies per cell)0 (0, 2.6)1.2 (0.0, 10.1)0 (0, 0.2)2.41 × 10−6Wilcoxon’sEBVmax (copies per cell)0.1 (0, 5.2)5.3 (0.4, 11.3)0 (0, 0.4)3.07 × 10−12Wilcoxon’sEBV fragment size ratio0.43 (0, 0.67)0.63 (0.40, 0.78)0.15 (0, 0.55)7.97 × 10−8Wilcoxon’sEBVP0 (0, 0.008)0.008 (0, 0.019)0 (0, 0.001)4.71 × 10−13Wilcoxon’sEBV fragment entropy, bits4.64 (1.10, 5.22)5.25 (4.88, 5.49)2.43 (0, 4.90)1.054.71 × 10−13Wilcoxon’sMYC–Ig translocation, n (%)39 (18)39 (48)0 (0)1.474.71 × 10−18χ2aMedian (quartile (Q)1, Q3) for continuous, n (%) for categorical.bExact P values from two-sided Wilcoxon’s rank-sum or χ2 tests.hGE, haploid genome equivalents.Source data
ropy, bits4.64 (1.10, 5.22)5.25 (4.88, 5.49)2.43 (0, 4.90)1.054.71 × 10−13Wilcoxon’sMYC–Ig translocation, n (%)39 (18)39 (48)0 (0)1.474.71 × 10−18χ2aMedian (quartile (Q)1, Q3) for continuous, n (%) for categorical.bExact P values from two-sided Wilcoxon’s rank-sum or χ2 tests.hGE, haploid genome equivalents.Source data Clinical and liquid biopsy characteristics of phase I participants stratified as BL and non-BL (n = 212 children and young adults with clinically suspected lymphoma) aMedian (quartile (Q)1, Q3) for continuous, n (%) for categorical. bExact P values from two-sided Wilcoxon’s rank-sum or χ2 tests. hGE, haploid genome equivalents. Source data
Clinical and liquid biopsy characteristics of phase I participants stratified as BL and non-BL (n = 212 children and young adults with clinically suspected lymphoma) aMedian (quartile (Q)1, Q3) for continuous, n (%) for categorical. bExact P values from two-sided Wilcoxon’s rank-sum or χ2 tests. hGE, haploid genome equivalents. Source data MYC–Ig translocations were detectable in 48% of confirmed cases of BL using our limited targeted sequencing panel. Most MYC–IGH breakpoints on the MYC locus occurred within the class II (46%, 18 out of 39) and class III (28%, 11 out of 39) regions. On the IGH locus, 51% (20 out of 39) of breakpoints were located within the VDJ junction region (Extended Data Fig. 3). As mutations in MYC have previously been reported as surrogate markers for MYC rearrangement, we also examined their association with BL in our dataset. Although MYC mutations were present in both BL and non-BL samples, the distribution differed markedly between the two groups (Extended Data Figs. 4d,e and 5). BL samples showed a significantly higher mutation burden, reflected by a higher median number of mutations in MYC intron 1 and exon 2 compared to non-BL samples (Supplementary Table 1). A fourfold difference was also observed in the median VAF between the BL and non-BL samples. The VAF distribution by diagnosis for each sample is shown in Extended Data Fig. 4c.
tation burden, reflected by a higher median number of mutations in MYC intron 1 and exon 2 compared to non-BL samples (Supplementary Table 1). A fourfold difference was also observed in the median VAF between the BL and non-BL samples. The VAF distribution by diagnosis for each sample is shown in Extended Data Fig. 4c. Six models were constructed using different combinations of variables (Table 2): a clinical model comprising clinical parameters only; an EBV quantitative model based on quantitative (q)PCR-derived metrics; an EBV quantitative plus clinical model; an EBV model incorporating additional qualitative EBV features (fragment size ratio and fragment entropy); a liquid biopsy model including cfDNA-derived variables only; and a comprehensive model combining liquid biopsy and clinical variables. Tenfold crossvalidation showed that all models demonstrated good discriminative ability for BL (AUC ≥ 0.8), with the liquid biopsy model and the comprehensive model showing excellent discriminative ability27 with AUC values of 0.92 and 0.94, respectively. The sensitivity of the models ranged from 0.57 (EBV quantitative model) to 0.86 (comprehensive model), whereas specificity ranged from 0.78 (clinical model) to 0.95 (liquid biopsy and comprehensive model). Quantification of EBV in copies per cell, which was previously shown to correlate strongly with EBV DNA qPCR20, was performed inferiorly to the liquid biopsy model both when clinical parameters were included (AUC 0.91 versus 0.95) and when they were not (AUC 0.80 versus 0.92). Overall, the comprehensive model demonstrated the best performance, with an AUC of 0.95, sensitivity of 0.86 and specificity of 0.95 (Fig. 1a,b). Among 81 patients with a confirmed local gold-standard diagnosis of BL, 11 (13.6%) were not detected by the liquid biopsy model (false negatives), whereas 70 (86.4%) were correctly identified (true positives). False-negative cases were comparable to true positives with respect to age, sex distribution and LDH levels.
. Among 81 patients with a confirmed local gold-standard diagnosis of BL, 11 (13.6%) were not detected by the liquid biopsy model (false negatives), whereas 70 (86.4%) were correctly identified (true positives). False-negative cases were comparable to true positives with respect to age, sex distribution and LDH levels. However, they were significantly less likely to present with jaw or abdominal tumors (P < 0.05) (Supplementary Table 2).Table 2BL diagnostic modelsModel nameVariablesClinical modelAge, sex, duration of presenting illness, site of tumor and LDHEBV quantitative modelEBER1, EBER2, EBNA2, EBVmax (maximum EBV copies per cell)EBV quantitative + clinical modelAge, sex, duration of presenting illness, site of tumor, LDH, EBER1, EBER2, EBNA2, EBVmaxEBV modelEBER1, EBER2, EBNA2, EBVmax, EBV size ratio, EBVP, EBV entropyLiquid biopsy modelCtDNA, median VAF, MYC intron 1 mutations, MYC exon 2 mutations, EBER1, EBER2, EBNA2, EBVmax, EBV size ratio, EBP, EBV entropy, autosomal entropy and MYC–Ig translocationComprehensive modelAge, sex, duration of presenting illness, site of tumor, LDH, ctDNA, median VAF, MYC intron 1 mutations, MYC exon 2 mutations, EBER1, EBER2, EBNA2, EBVmax, EBV size ratio, EBVP, EBV entropy, autosomal entropy and MYC–Ig translocationFig. 1Performance and feature importance of diagnostic models for BL.a, ROC curves comparing clinical, EBV-based, liquid biopsy and comprehensive models. AUCs are shown in the figure. b, Sensitivity, specificity and AUC (95% CIs) for each model. P values were derived from two-sided DeLong’s tests comparing each model against the clinical model; no adjustment for multiple comparisons was applied. c, Most influential predictors from the LASSO comprehensive model ranked by absolute coefficient magnitude; coefficients reflect standardized variables (n = 212). autoent, autosomal fragment entropy; e2mut, MYC exon 2 mutations; EBVent, EBV fragment entropy; i1mut, MYC intron 1 mutations; SympMon, duration of presenting symptoms in months.Source data
rom the LASSO comprehensive model ranked by absolute coefficient magnitude; coefficients reflect standardized variables (n = 212). autoent, autosomal fragment entropy; e2mut, MYC exon 2 mutations; EBVent, EBV fragment entropy; i1mut, MYC intron 1 mutations; SympMon, duration of presenting symptoms in months.Source data BL diagnostic models a, ROC curves comparing clinical, EBV-based, liquid biopsy and comprehensive models. AUCs are shown in the figure. b, Sensitivity, specificity and AUC (95% CIs) for each model. P values were derived from two-sided DeLong’s tests comparing each model against the clinical model; no adjustment for multiple comparisons was applied. c, Most influential predictors from the LASSO comprehensive model ranked by absolute coefficient magnitude; coefficients reflect standardized variables (n = 212). autoent, autosomal fragment entropy; e2mut, MYC exon 2 mutations; EBVent, EBV fragment entropy; i1mut, MYC intron 1 mutations; SympMon, duration of presenting symptoms in months. Source data The relative contribution of each predictor to the comprehensive model is shown in Fig. 1c. Among the included variables, MYC–Ig translocation, EBVP, tumor site and autosomal entropy emerged as the strongest predictors, reflected by their large absolute coefficients in the least absolute shrinkage and selection operator (LASSO) regression analysis. Most important predictors for the other models are included in Extended Data Fig. 6.
ables, MYC–Ig translocation, EBVP, tumor site and autosomal entropy emerged as the strongest predictors, reflected by their large absolute coefficients in the least absolute shrinkage and selection operator (LASSO) regression analysis. Most important predictors for the other models are included in Extended Data Fig. 6. In the phase II arm, we enrolled 64 participants, of whom 1 patient died and another was lost to follow-up before the collection of a tissue or liquid biopsy, leaving 62 patients eligible for analysis. With respect to tissue biopsy, GSP tissue biopsy diagnosis was not obtained for six patients: one patient sample was inadequate for processing and five patient samples were insufficient for IHC. In addition, the liquid biopsy sample for one patient was not sequenced because it hemolyzed, disqualifying it for downstream analysis. Consequently, a GSP tissue biopsy diagnosis was reached for 56 patient tissue biopsy samples, whereas a diagnostic report was generated for 61 patient liquid biopsy samples (Extended Data Fig. 1). We conducted an external validation of our best-performing model (comprehensive model) on these 56 patients from the phase II cohort after addressing missing values by appropriate imputation methods. The comprehensive model demonstrated excellent discriminative ability, with an AUC of 0.97 (95% CI 0.924–1.000), a sensitivity of 0.94 (95% CI 0.698–0.998) and a specificity of 0.85 (95% CI 0.702–0.943) (Fig. 2a). When the analysis was focused on the 44 patients with a complete dataset, the AUC was 0.97 with a sensitivity of 0.93 (95% CI 0.661–0.998) and a specificity of 0.90 (95% CI 0.735–0.979) (Fig. 2b).Fig. 2External validation of the diagnostic model.a, Confusion matrix (left) and ROC curve (right) for external validation using imputed data (n = 56). The heatmap shows agreement between predicted and actual diagnoses (BL versus non-BL) and the ROC curve summarizes overall discriminative performance (AUC with 95% CI indicated). b, Confusion matrix (left) and ROC curve (right) for external validation restricted to complete cases (n = 44). Performance metrics are shown as in a. Sensitivity and specificity are derived from the confusion matrices and the diagonal line in ROC plots indicates no discrimination. Freq., frequency.Source data
icated). b, Confusion matrix (left) and ROC curve (right) for external validation restricted to complete cases (n = 44). Performance metrics are shown as in a. Sensitivity and specificity are derived from the confusion matrices and the diagonal line in ROC plots indicates no discrimination. Freq., frequency.Source data a, Confusion matrix (left) and ROC curve (right) for external validation using imputed data (n = 56). The heatmap shows agreement between predicted and actual diagnoses (BL versus non-BL) and the ROC curve summarizes overall discriminative performance (AUC with 95% CI indicated). b, Confusion matrix (left) and ROC curve (right) for external validation restricted to complete cases (n = 44). Performance metrics are shown as in a. Sensitivity and specificity are derived from the confusion matrices and the diagonal line in ROC plots indicates no discrimination. Freq., frequency. Source data
a, Confusion matrix (left) and ROC curve (right) for external validation using imputed data (n = 56). The heatmap shows agreement between predicted and actual diagnoses (BL versus non-BL) and the ROC curve summarizes overall discriminative performance (AUC with 95% CI indicated). b, Confusion matrix (left) and ROC curve (right) for external validation restricted to complete cases (n = 44). Performance metrics are shown as in a. Sensitivity and specificity are derived from the confusion matrices and the diagonal line in ROC plots indicates no discrimination. Freq., frequency. Source data Diagnosis of the phase II samples was made following the multidisciplinary team (MDT) decision tree (Fig. 3a). Liquid biopsy was the only test available for making a diagnosis in 42.6% (26 out of 61) of cases (Fig. 3b) at the first MDT meeting. Among all the cases that were finally diagnosed as BL (n = 15), in the first MDT meeting, 1 (6.7%) was diagnosed using tissue biopsy alone, 6 (40%) were diagnosed with both liquid and tissue biopsy available, whereas 8 (53.3%) were diagnosed using liquid biopsy alone. One child with BL was not diagnosed in the first MDT meeting, but was confirmed in subsequent MDT meetings (Fig. 3b). Six samples were diagnosed with tissue biopsy alone at the first MDT meeting, out of which five were non-BL. This demonstrates that liquid biopsy enabled diagnosis in an additional 53.3% (8 out of 15) of cases of BL at the first MDT meeting, increasing the overall diagnostic yield of BL to 93.3% (14 out of 15), underscoring its potential as a complementary and timely diagnostic tool, especially where tissue biopsy access is limited or delayed. The percentage of occasional decisions taken based on each pathway are denoted on Fig. 3a. In 16.4% (10 out of 61) of cases, neither the liquid biopsy nor the tissue biopsy was available at the first MDT meeting and diagnosis was therefore made in the subsequent MDT meeting (Fig. 3a).Fig. 3Diagnostic pathways and test availability at the first MDT meeting.a, Diagnostic and treatment decisions for children and young adults with clinically suspected lymphoma based on tissue biopsy (green) or liquid biopsy (pink) at the first MDT meeting. In 16% of cases, diagnosis was established at a subsequent MDT meeting, not shown. b, Sankey diagram illustrating the availability of liquid biopsy (LB) and tissue biopsy (TB) results at the first MDT meeting and the corresponding diagnostic outcomes (n = 61). Panel a created in BioRender; Claudius, C. https://biorender.com/rmkrptd (2026).Source data
stablished at a subsequent MDT meeting, not shown. b, Sankey diagram illustrating the availability of liquid biopsy (LB) and tissue biopsy (TB) results at the first MDT meeting and the corresponding diagnostic outcomes (n = 61). Panel a created in BioRender; Claudius, C. https://biorender.com/rmkrptd (2026).Source data a, Diagnostic and treatment decisions for children and young adults with clinically suspected lymphoma based on tissue biopsy (green) or liquid biopsy (pink) at the first MDT meeting. In 16% of cases, diagnosis was established at a subsequent MDT meeting, not shown. b, Sankey diagram illustrating the availability of liquid biopsy (LB) and tissue biopsy (TB) results at the first MDT meeting and the corresponding diagnostic outcomes (n = 61). Panel a created in BioRender; Claudius, C. https://biorender.com/rmkrptd (2026). Source data
a, Diagnostic and treatment decisions for children and young adults with clinically suspected lymphoma based on tissue biopsy (green) or liquid biopsy (pink) at the first MDT meeting. In 16% of cases, diagnosis was established at a subsequent MDT meeting, not shown. b, Sankey diagram illustrating the availability of liquid biopsy (LB) and tissue biopsy (TB) results at the first MDT meeting and the corresponding diagnostic outcomes (n = 61). Panel a created in BioRender; Claudius, C. https://biorender.com/rmkrptd (2026). Source data TAT was assessed in a head-to-head comparison for 58 samples from children and young adults with suspected lymphoma who had complete data for both the liquid and the tissue biopsies. The median time from patient presentation at the cancer center to tissue biopsy sample collection was 7.3 d (IQR 2.6–13.6 d). The median time from sample collection to its receipt at the histopathology lab was 1 d (IQR 0.5–2.2 d), whereas sample processing to the generation of an H&E slide ready for interpretation took 3.5 d (IQR 1.5–5.6 d). The median time for H&E slide reporting was 3.7 d (IQR 1.7–8.6 d), whereas tissue processing for GSP tissue biopsy diagnosis had a median TAT of 43 d (IQR 6.6–207.4 d) (Fig. 4a).Fig. 4Time from sample collection to diagnosis for tissue and liquid biopsies.a, TAT for tissue biopsy from the sample collection to diagnostic reporting, shown per patient and stratified by a processing step. b, TAT for liquid biopsy from sample receipt to diagnostic reporting, shown per patient and stratified by a processing step. c, Comparison of time to diagnosis between liquid biopsy and tissue biopsy. The center lines denote medians, the boxes indicate IQRs, the whiskers extend to values within 1.5× the IQR and the points represent individual samples. Groups were compared using a two-sided, paired Wilcoxon’s signed-rank test; no adjustment for multiple comparisons was applied.Source data
id biopsy and tissue biopsy. The center lines denote medians, the boxes indicate IQRs, the whiskers extend to values within 1.5× the IQR and the points represent individual samples. Groups were compared using a two-sided, paired Wilcoxon’s signed-rank test; no adjustment for multiple comparisons was applied.Source data a, TAT for tissue biopsy from the sample collection to diagnostic reporting, shown per patient and stratified by a processing step. b, TAT for liquid biopsy from sample receipt to diagnostic reporting, shown per patient and stratified by a processing step. c, Comparison of time to diagnosis between liquid biopsy and tissue biopsy. The center lines denote medians, the boxes indicate IQRs, the whiskers extend to values within 1.5× the IQR and the points represent individual samples. Groups were compared using a two-sided, paired Wilcoxon’s signed-rank test; no adjustment for multiple comparisons was applied. Source data TAT for liquid biopsy processing was assessed from the time that the sample arrived at the sequencing laboratory and included all samples that arrived in the laboratory within 72 h of collection28. The median time from sample receipt to library preparation was 2.6 d (IQR 1.8–3.6 d). Sequencing had a median duration of 1 d (IQR 1.0–1.1 d), whereas the time from sequencing completion to bioinformatics analysis and diagnostic report generation had a median of 2 d (IQR 1.0–3.9 d) (Fig. 4b).
thin 72 h of collection28. The median time from sample receipt to library preparation was 2.6 d (IQR 1.8–3.6 d). Sequencing had a median duration of 1 d (IQR 1.0–1.1 d), whereas the time from sequencing completion to bioinformatics analysis and diagnostic report generation had a median of 2 d (IQR 1.0–3.9 d) (Fig. 4b). We compared the time from sample receipt in the laboratory to generation of a diagnostic report between liquid biopsy and tissue biopsy in a paired analysis. The TAT for liquid biopsy (median: 6.5 d, IQR 5.1–10.5 d) was 40.3 d earlier than the TAT for tissue biopsy (median: 46.8 d, IQR 20.1–192.4 d) (P = 4.42 × 10−10) (Fig. 4c).
For the phase I enrollment, participants had a median age of 13 years (interquartile range (IQR) 9–17 years), with a male-to-female ratio of 2:1. Eleven (3.5%) participants had a serologically confirmed HIV diagnosis and were on antiretroviral treatment, whereas nine (2.9%) had a history of having been diagnosed and treated for tuberculosis. Only 15 (4.8%) participants reported a positive family history of cancer in their family. A jaw mass was the presenting feature in 25% (78 out of 313) of participants, peripheral lymph node enlargement was present in 68% (213 out of 313) and at least half (53%, 167 out of 313) the participants had an abdominal organomegaly. Participants had a median hemoglobin of 9.81 g dl−1 (IQR 8.15–11.30 g dl−1) and median lactate dehydrogenase (LDH) levels of 679 IU l−1 (IQR 393–1,233 IU l−1). The most common GSP tissue biopsy diagnosis was classic Hodgkin’s lymphoma (HL) (26%, 80 out of 313), followed by BL (25%, 77 out of 313) and diffuse large B cell lymphoma (DLBCL) (14%, 45 out of 313). Other childhood cancers were present in 9.6% (30 out of 313), whereas 9.3% (29 out of 313) of participants had benign conditions. The demographic, clinical and laboratory characteristics of participants for both phases of the study are summarized in Extended Data Tables 1 and 2.
Among the 212 participants, where GSP tissue biopsy diagnosis was achieved, 38.2% (81 out of 212) had a diagnosis of BL and 61.8% (131 out of 212) had a non-BL diagnosis. For the non-BL participants, 36.7% (48 out of 131) had HL, 16.8% (22 out of 131) had DLBCL, 18.3% (24 out of 131) had a benign diagnosis (benign tumor, tuberculous adenitis, reactive or EBV+ lymphadenitis) and the remaining 28.2% (37 out of 131) had a mixture of other types of cancer. A younger age, presence of an abdominal or jaw mass and elevated LDH were significantly associated with a diagnosis of BL (P < 0.05 for all; Table 1). For the liquid biopsy variables, diagnosis of BL was associated with higher ctDNA levels, greater median number of mutations in MYC intron 1, increased EBER1, EBER2 and EBNA2 copies per cell, higher EBV DNA fragment size ratio and EBVP, increased EBV DNA fragment size entropy and autosomal fragment size entropy and the presence of a MYC–Ig translocation (P < 0.01 for all; Table 1 and Extended Data Fig. 2).
Six models were constructed using different combinations of variables (Table 2): a clinical model comprising clinical parameters only; an EBV quantitative model based on quantitative (q)PCR-derived metrics; an EBV quantitative plus clinical model; an EBV model incorporating additional qualitative EBV features (fragment size ratio and fragment entropy); a liquid biopsy model including cfDNA-derived variables only; and a comprehensive model combining liquid biopsy and clinical variables. Tenfold crossvalidation showed that all models demonstrated good discriminative ability for BL (AUC ≥ 0.8), with the liquid biopsy model and the comprehensive model showing excellent discriminative ability27 with AUC values of 0.92 and 0.94, respectively. The sensitivity of the models ranged from 0.57 (EBV quantitative model) to 0.86 (comprehensive model), whereas specificity ranged from 0.78 (clinical model) to 0.95 (liquid biopsy and comprehensive model). Quantification of EBV in copies per cell, which was previously shown to correlate strongly with EBV DNA qPCR20, was performed inferiorly to the liquid biopsy model both when clinical parameters were included (AUC 0.91 versus 0.95) and when they were not (AUC 0.80 versus 0.92). Overall, the comprehensive model demonstrated the best performance, with an AUC of 0.95, sensitivity of 0.86 and specificity of 0.95 (Fig. 1a,b). Among 81 patients with a confirmed local gold-standard diagnosis of BL, 11 (13.6%) were not detected by the liquid biopsy model (false negatives), whereas 70 (86.4%) were correctly identified (true positives). False-negative cases were comparable to true positives with respect to age, sex distribution and LDH levels.
In the phase II arm, we enrolled 64 participants, of whom 1 patient died and another was lost to follow-up before the collection of a tissue or liquid biopsy, leaving 62 patients eligible for analysis. With respect to tissue biopsy, GSP tissue biopsy diagnosis was not obtained for six patients: one patient sample was inadequate for processing and five patient samples were insufficient for IHC. In addition, the liquid biopsy sample for one patient was not sequenced because it hemolyzed, disqualifying it for downstream analysis. Consequently, a GSP tissue biopsy diagnosis was reached for 56 patient tissue biopsy samples, whereas a diagnostic report was generated for 61 patient liquid biopsy samples (Extended Data Fig. 1). We conducted an external validation of our best-performing model (comprehensive model) on these 56 patients from the phase II cohort after addressing missing values by appropriate imputation methods. The comprehensive model demonstrated excellent discriminative ability, with an AUC of 0.97 (95% CI 0.924–1.000), a sensitivity of 0.94 (95% CI 0.698–0.998) and a specificity of 0.85 (95% CI 0.702–0.943) (Fig. 2a). When the analysis was focused on the 44 patients with a complete dataset, the AUC was 0.97 with a sensitivity of 0.93 (95% CI 0.661–0.998) and a specificity of 0.90 (95% CI 0.735–0.979) (Fig. 2b).Fig. 2External validation of the diagnostic model.a, Confusion matrix (left) and ROC curve (right) for external validation using imputed data (n = 56). The heatmap shows agreement between predicted and actual diagnoses (BL versus non-BL) and the ROC curve summarizes overall discriminative performance (AUC with 95% CI indicated). b, Confusion matrix (left) and ROC curve (right) for external validation restricted to complete cases (n = 44). Performance metrics are shown as in a. Sensitivity and specificity are derived from the confusion matrices and the diagonal line in ROC plots indicates no discrimination. Freq., frequency.Source data
Diagnosis of the phase II samples was made following the multidisciplinary team (MDT) decision tree (Fig. 3a). Liquid biopsy was the only test available for making a diagnosis in 42.6% (26 out of 61) of cases (Fig. 3b) at the first MDT meeting. Among all the cases that were finally diagnosed as BL (n = 15), in the first MDT meeting, 1 (6.7%) was diagnosed using tissue biopsy alone, 6 (40%) were diagnosed with both liquid and tissue biopsy available, whereas 8 (53.3%) were diagnosed using liquid biopsy alone. One child with BL was not diagnosed in the first MDT meeting, but was confirmed in subsequent MDT meetings (Fig. 3b). Six samples were diagnosed with tissue biopsy alone at the first MDT meeting, out of which five were non-BL. This demonstrates that liquid biopsy enabled diagnosis in an additional 53.3% (8 out of 15) of cases of BL at the first MDT meeting, increasing the overall diagnostic yield of BL to 93.3% (14 out of 15), underscoring its potential as a complementary and timely diagnostic tool, especially where tissue biopsy access is limited or delayed. The percentage of occasional decisions taken based on each pathway are denoted on Fig. 3a. In 16.4% (10 out of 61) of cases, neither the liquid biopsy nor the tissue biopsy was available at the first MDT meeting and diagnosis was therefore made in the subsequent MDT meeting (Fig. 3a).Fig. 3Diagnostic pathways and test availability at the first MDT meeting.a, Diagnostic and treatment decisions for children and young adults with clinically suspected lymphoma based on tissue biopsy (green) or liquid biopsy (pink) at the first MDT meeting. In 16% of cases, diagnosis was established at a subsequent MDT meeting, not shown. b, Sankey diagram illustrating the availability of liquid biopsy (LB) and tissue biopsy (TB) results at the first MDT meeting and the corresponding diagnostic outcomes (n = 61). Panel a created in BioRender; Claudius, C. https://biorender.com/rmkrptd (2026).Source data
TAT was assessed in a head-to-head comparison for 58 samples from children and young adults with suspected lymphoma who had complete data for both the liquid and the tissue biopsies. The median time from patient presentation at the cancer center to tissue biopsy sample collection was 7.3 d (IQR 2.6–13.6 d). The median time from sample collection to its receipt at the histopathology lab was 1 d (IQR 0.5–2.2 d), whereas sample processing to the generation of an H&E slide ready for interpretation took 3.5 d (IQR 1.5–5.6 d). The median time for H&E slide reporting was 3.7 d (IQR 1.7–8.6 d), whereas tissue processing for GSP tissue biopsy diagnosis had a median TAT of 43 d (IQR 6.6–207.4 d) (Fig. 4a).Fig. 4Time from sample collection to diagnosis for tissue and liquid biopsies.a, TAT for tissue biopsy from the sample collection to diagnostic reporting, shown per patient and stratified by a processing step. b, TAT for liquid biopsy from sample receipt to diagnostic reporting, shown per patient and stratified by a processing step. c, Comparison of time to diagnosis between liquid biopsy and tissue biopsy. The center lines denote medians, the boxes indicate IQRs, the whiskers extend to values within 1.5× the IQR and the points represent individual samples. Groups were compared using a two-sided, paired Wilcoxon’s signed-rank test; no adjustment for multiple comparisons was applied.Source data
Liquid biopsies have gained notable attention for their clinical potential across various cancer types29, prompting efforts to harness this technology to improve existing early and precise cancer detection15. However, integration into routine diagnostic practice remains limited and further evidence is needed to support clinical implementation, particularly in low-resource settings29–31.
ntial across various cancer types29, prompting efforts to harness this technology to improve existing early and precise cancer detection15. However, integration into routine diagnostic practice remains limited and further evidence is needed to support clinical implementation, particularly in low-resource settings29–31. The aggressive nature of BL necessitates early and accurate diagnosis to enable timely treatment and maximize clinical benefit32. This study evaluates the performance of a minimally invasive blood-based liquid biopsy approach designed for rapid BL diagnosis in children and young people presenting with clinical features of lymphoma, classifying cases as either BL or non-BL. Diagnosis in this study relied on a limited IHC panel optimized for resource-limited settings, previously shown to have 81% concordance with SoC histopathology12. Although the use of a simplified panel introduces diagnostic limitations, it reflects the realities of regions where BL is most prevalent and where limited access to comprehensive histopathology compels treatment initiation based on clinical features alone33. Quantification of plasma EBV DNA has been shown to distinguish those with BL from healthy individuals, with sensitivity of 88% and specificity of 100%34. In the present study, however, the EBV quantitative model achieved lower sensitivity (57%) and specificity (95%), likely reflecting the inclusion of other EBV-associated malignancies such as HL and DLBCL rather than healthy controls. Similarly, a recent, large, multi-country study reported high diagnostic accuracy for EBV DNA quantification using digital droplet PCR, but again relied on comparisons with healthy population controls, limiting applicability to real-world clinical settings where multiple EBV-associated malignancies coexist35. Together, these findings suggest that, although EBV quantification is a useful screening tool, it lacks sufficient specificity for diagnostic confirmation. Incorporating additional EBV attributes alongside mutation-based and translocation-based markers therefore enhances diagnostic specificity and biological interpretability consistent with observations in EBV-associated nasopharyngeal carcinoma26.
reening tool, it lacks sufficient specificity for diagnostic confirmation. Incorporating additional EBV attributes alongside mutation-based and translocation-based markers therefore enhances diagnostic specificity and biological interpretability consistent with observations in EBV-associated nasopharyngeal carcinoma26. Although acute EBV infection can be associated with elevated EBV copy numbers, integration of both quantitative and qualitative EBV parameters enabled reliable distinction between BL and reactive cases (Supplementary Fig. 1). Improved performance was achieved by incorporating additional liquid biopsy parameters, including MYC–Ig translocation, MYC intron 1 and exon 2 mutation counts and autosomal fragment entropy, resulting in 73% sensitivity and 95% specificity. Combining liquid biopsy features with clinical variables yielded the highest diagnostic performance (AUC of 0.98, sensitivity 0.94 and specificity 93%). Notably, model performance was primarily driven by molecular features, rather than clinical parameters (Extended Data Fig. 6), underscoring the central role of liquid biopsy markers in BL discrimination.
h clinical variables yielded the highest diagnostic performance (AUC of 0.98, sensitivity 0.94 and specificity 93%). Notably, model performance was primarily driven by molecular features, rather than clinical parameters (Extended Data Fig. 6), underscoring the central role of liquid biopsy markers in BL discrimination. Due to the rational panel design, MYC–IGH translocations were detectable in 48% of cases of BL. Although translocation-positive cases showed marginally higher sequencing depth, median coverage across both groups exceeded 1,000× (Supplementary Table 3), making insufficient depth an unlikely explanation for undetected events. Instead, this likely reflects the heterogeneous distribution of MYC breakpoints in BL, many of which lie outside the targeted region. Expanding coverage could improve sensitivity but would increase cost and complexity, conflicting with the goal of a scalable, cost-efficient assay for use in low-resource settings. Distinguishing true somatic mutations from germline SNPs remains challenging in tumor-only sequencing, particularly in African populations that are underrepresented in reference genomes and variant databases. Although matched germline DNA was unavailable, stringent filtering was applied to mitigate this limitation. Nevertheless, some intronic SNPs may have escaped detection, highlighting the need for population-specific genomic reference datasets to support accurate interpretation of molecular findings in underrepresented African populations.
rmline DNA was unavailable, stringent filtering was applied to mitigate this limitation. Nevertheless, some intronic SNPs may have escaped detection, highlighting the need for population-specific genomic reference datasets to support accurate interpretation of molecular findings in underrepresented African populations. Obviously, for non-BL cases, standard pathology remains essential for accurately diagnosing other lymphoma subtypes, identifying nonlymphoid malignancies or confirming benign lesions. Although none of the acute EBV infection cases was misclassified as BL, underscoring the model’s ability to distinguish malignant from reactive EBV-driven states, the number of cases of acute infection was limited. Evaluation in an even larger and more diverse cohort, incorporating additional acute EBV infection samples, would further substantiate the model’s specificity and clinical applicability. Our study demonstrates the feasibility of integrating liquid biopsy into routine clinical workflow in underresourced African hospitals using an MDT approach. Incorporation of liquid biopsy increased diagnostic yield for BL and shortened the time to diagnosis, with 93.3% of cases of BL diagnosed within 1 week, compared with 40% using tissue biopsy alone.
ity of integrating liquid biopsy into routine clinical workflow in underresourced African hospitals using an MDT approach. Incorporation of liquid biopsy increased diagnostic yield for BL and shortened the time to diagnosis, with 93.3% of cases of BL diagnosed within 1 week, compared with 40% using tissue biopsy alone. Although liquid biopsy achieved significantly shorter turnaround times than GSP tissue biopsy, consistent with studies from high-income countries36,37, several limitations merit consideration. First, future service configuration (centralized or decentralized), automation and health system integration remain uncertain and may affect turnaround times at scale. Second, study conditions likely benefitted from enhanced oversight and quality control. The 10% (6 out of 62) tissue biopsy failure rate observed in phase II may underestimate rates in routine care, particularly in resource-limited settings, potentially limiting generalizability38. The shorter TAT of liquid biopsy compared with IHC reflects differences in workflow, staffing and infrastructure. In Tanzania, existing molecular skills among laboratory scientists enabled rapid training for cfDNA workflows, supported by automated bioinformatics and clinician-led interpretation using a predefined diagnostic algorithm.
of liquid biopsy compared with IHC reflects differences in workflow, staffing and infrastructure. In Tanzania, existing molecular skills among laboratory scientists enabled rapid training for cfDNA workflows, supported by automated bioinformatics and clinician-led interpretation using a predefined diagnostic algorithm. In contrast, the longer TAT observed for tissue-based diagnosis reflects constrained histopathology capacity, with few pathologists and prolonged training requirements. Diagnostic delays are largely procedure related, driven by surgical biopsy processing time and delayed IHC interpretation due to heavy workload and occasional repeat biopsies9. Implementing cfDNA diagnostics in SSA will require alignment with existing clinical pathways. CfDNA testing could complement limited histopathology by providing rapid, minimally invasive diagnosis deployable from peripheral hospitals with centralized sequencing. Leveraging established molecular networks for HIV or tuberculosis testing may facilitate integration. Implementation research, including pragmatic pilots and health-economic analyses, will be essential to address scalability, cost-effectiveness and sustainability adoption within national cancer control programs. Ongoing studies are extending this cfDNA approach to the quantitative detection of minimal residual disease in children receiving therapy for EBV+ BL, to determine its utility in predicting treatment response and clinical outcomes.
Implementing cfDNA diagnostics in SSA will require alignment with existing clinical pathways. CfDNA testing could complement limited histopathology by providing rapid, minimally invasive diagnosis deployable from peripheral hospitals with centralized sequencing. Leveraging established molecular networks for HIV or tuberculosis testing may facilitate integration. Implementation research, including pragmatic pilots and health-economic analyses, will be essential to address scalability, cost-effectiveness and sustainability adoption within national cancer control programs. Ongoing studies are extending this cfDNA approach to the quantitative detection of minimal residual disease in children receiving therapy for EBV+ BL, to determine its utility in predicting treatment response and clinical outcomes. Finally, there are considerable costs associated with next-generation sequencing (NGS) technology. In a previously published health-economic microcosting analysis conducted by our group, the average per-patient cost of histopathology was estimated at US$185.01, driven primarily by staining (US$87.20, largely IHC consumables) and the biopsy procedure (US$72.29). In the same analysis, liquid biopsy cost $710.15 per patient at current throughput, with sequencing reagents representing the largest component. ($175.48 per sample)39,40. Although liquid biopsy is more expensive, improved clinical outcomes could justify the higher upfront cost. Ongoing modeling work by our group is evaluating the cost-effectiveness of earlier diagnosis using liquid biopsy, with preliminary findings supporting further evaluation in implementation studies. Adoption of additional sequencing tests for other indications across infectious and noncommunicable diseases using high-throughput sequencing platforms is likely to lead to substantial cost reduction, similar to that observed with HIV viral load or human papillomavirus testing41,42. Ultimately this might even allow the expansion of liquid biopsy testing for BL into primary care settings in endemic regions.
This study was approved by the Oxford Tropical Research Ethics Committee (OxTREC: no. 15-19), the National Institute of Medical Research in Tanzania (no. NIMR/HQ/R.8a/Vol.IX/3408) and the Uganda National Council of Science and Technology (UNCST: no. HS529ES), and was conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from parents or legal guardians of all participating children, with age-appropriate assent obtained where applicable. Participants did not receive financial compensation for study participation. This study used data from the prospective, multicenter, observational, Aggressive Infection-Related East Africa Lymphoma (AI-REAL) study43, which was conducted in two phases. In phase I, we trained different liquid biopsy models and assessed their performance for the diagnosis of EBV+ BL using histopathology with a limited IHC panel and review by a minimum of 2 histopathologists as the GSP in 212 participants with clinically suspected lymphoma. In phase II, we performed a head-to-head comparison between the liquid biopsy and gold-standard tissue biopsy to assess the TAT for the two methods (n = 58 participants with clinically suspected lymphoma) and to conduct an additional independent external validation for the best-performing model (n = 56 participants with clinically suspected lymphoma).
to-head comparison between the liquid biopsy and gold-standard tissue biopsy to assess the TAT for the two methods (n = 58 participants with clinically suspected lymphoma) and to conduct an additional independent external validation for the best-performing model (n = 56 participants with clinically suspected lymphoma). The study enrolled all children and young adults, from the age of 3 years to 25 years, clinically suspected of having lymphoma and consenting to participate in the study from August 2019 to July 2023 from four hospitals in Tanzania and Uganda, that is, the Muhimbili National Hospital, Kilimanjaro Christian Medical Centre, Bugando Medical Centre and St. Mary’s Hospital Lacor. Participants with suspected lymphoma who had previously received chemotherapy, immunotherapy or any investigational agent for lymphoma, or participated in a clinical trial on lymphoma, were excluded. Phase I of the study was conducted from August 2019 to July 2022 and enrolled a total of 313 participants, whereas phase II was conducted from August 2022 to July 2023 and enrolled a total of 64 participants (Extended Data Fig. 1).
onal agent for lymphoma, or participated in a clinical trial on lymphoma, were excluded. Phase I of the study was conducted from August 2019 to July 2022 and enrolled a total of 313 participants, whereas phase II was conducted from August 2022 to July 2023 and enrolled a total of 64 participants (Extended Data Fig. 1). Tissue biopsies were obtained by a qualified surgeon through excisional biopsy and immediately fixed in 10% neutral buffered formalin. The formalin-fixed specimens were then transported to the local pathology laboratory for further processing and diagnostic evaluation. In parallel, liquid biopsy samples were collected via venipuncture into Roche Cell-Free DNA Collection Tubes, following standard phlebotomy procedures. The tubes were transported to the molecular laboratory, where plasma was separated by centrifugation and stored at −80 °C until further analysis. Extended Data Fig. 7 provides a summarized overview of the laboratory procedures performed for both the liquid and the tissue biopsies. All measurements were taken from distinct, independent samples.
Tissue biopsies were obtained by a qualified surgeon through excisional biopsy and immediately fixed in 10% neutral buffered formalin. The formalin-fixed specimens were then transported to the local pathology laboratory for further processing and diagnostic evaluation. In parallel, liquid biopsy samples were collected via venipuncture into Roche Cell-Free DNA Collection Tubes, following standard phlebotomy procedures. The tubes were transported to the molecular laboratory, where plasma was separated by centrifugation and stored at −80 °C until further analysis. Extended Data Fig. 7 provides a summarized overview of the laboratory procedures performed for both the liquid and the tissue biopsies. All measurements were taken from distinct, independent samples. Before study initiation, we performed a comprehensive training and infrastructure needs assessment at the three local study pathology laboratories. To ensure optimal histopathology assessment for the study and local capacity building, technical staff and the four clinical study pathologists underwent training, followed by competency assessments conducted by expert hematopathologists in tissue fixation, embedding, sectioning and IHC staining techniques. The training included hands-on sessions and on-site mentoring over a period of 1 year. An automated IHC staining platform (Ventana Benchmark GX) was installed and maintained at all participating laboratories for the duration of the study and 3-monthly internal quality control procedures were implemented, including the use of known positive controls. Tissue samples were fixed in 10% neutral buffered formalin and embedded in paraffin according to standard protocols. H&E staining was performed on all samples to evaluate morphology and IHC was done using the previously described limited IHC antibody panel for BL.
lemented, including the use of known positive controls. Tissue samples were fixed in 10% neutral buffered formalin and embedded in paraffin according to standard protocols. H&E staining was performed on all samples to evaluate morphology and IHC was done using the previously described limited IHC antibody panel for BL. This diagnostic approach utilizes a three-phase scoring system. The first phase combines typical BL morphology with immunostains BCL2, CD10 and CD20 to establish a diagnosis. Cases that remain inconclusive proceed to a second phase with CD38, CD44 and Ki67, whereas unresolved cases undergo a third phase with FISH analysis for MYC–Ig translocation. In our study, we used the first-phase immunostains (BCL2, CD10 and CD20) to support the diagnosis of BL, because this phase has previously been shown to achieve 81% concordance with the World Health Organization SoC pathology and is well suited for the limited-resource settings12,44. During the study, we purchased and procured all relevant antibodies and ensured that these were available. Finally, we introduced digital whole-slide imaging to enable slide review by a minimum of two local, study histopathologists and to gauge external third opinions as required45. This approach was adopted as the gold standard for evaluating the performance of the liquid biopsy methods.
ensured that these were available. Finally, we introduced digital whole-slide imaging to enable slide review by a minimum of two local, study histopathologists and to gauge external third opinions as required45. This approach was adopted as the gold standard for evaluating the performance of the liquid biopsy methods. CfDNA extraction, library preparation and targeted sequencing were performed at two dedicated study laboratories; the Haematology clinical Research Laboratory at the Muhimbili University of Health and Allied Sciences in Tanzania and the Central Public Health Laboratory. Approximately 8 ml of whole blood was collected in Roche circulating cfDNA (ccfDNA) tubes or PAXgene blood ccfDNA tubes. Plasma was separated by centrifugation at 1,600g for a duration of 15 min continuously, followed by a second centrifugation of the separated plasma at 4,500g for 15 min. Separated plasma was stored in 1.5-ml Eppendorf tubes at −80 °C.
as collected in Roche circulating cfDNA (ccfDNA) tubes or PAXgene blood ccfDNA tubes. Plasma was separated by centrifugation at 1,600g for a duration of 15 min continuously, followed by a second centrifugation of the separated plasma at 4,500g for 15 min. Separated plasma was stored in 1.5-ml Eppendorf tubes at −80 °C. CfDNA was extracted from plasma using the QIAamp Circulating Nucleic Acid Kit (QIAGEN, cat. nos. 51304 and 51306) according to the manufacturer’s instructions. Briefly, it involved four main steps; first, samples are lyzed to inactivate DNases and RNases and allow complete release of nucleic acids from bound proteins, lipids and vesicles. Second, the lysates are transferred onto a QIAamp Mini column and circulating nucleic acids are adsorbed from a large volume on to the small silica membrane as the lysate is drawn through by vacuum pressure. Third, while the nucleic acids remain bound to the membrane, contaminants are washed away in a three-step wash process. The final step involves elution of highly purified nucleic acid. The quantification of cfDNA was done with a Qubit fluorimeter 3.0 using the qubit high-sensitivity assay (Thermo Fisher Scientific).
e. Third, while the nucleic acids remain bound to the membrane, contaminants are washed away in a three-step wash process. The final step involves elution of highly purified nucleic acid. The quantification of cfDNA was done with a Qubit fluorimeter 3.0 using the qubit high-sensitivity assay (Thermo Fisher Scientific). Libraries were constructed using 50 ng (in 30 μl) of extracted cfDNA with the ThruPLEX Tag-Seq HV kit according to the manufacturer’s protocol. The process involved addition of unique ThruPLEX HV Unique Dual indexed PCR primers to aid with sample tracking. Seven cycles were performed to ensure a yield of >500 ng depending on the concentration of the input DNA. This was followed by a purification step through magnetic separation using AMPure XP beads (Beckman Coulter). The final library was quantified and validated using the Qubit HS kit and the Bio-analyser High Sensitivity DNA kit, according to the manufacturer’s instructions. Library hybridization and capture were done using the xGen hybridization capture kit according to the manufacturer’s instructions.
P beads (Beckman Coulter). The final library was quantified and validated using the Qubit HS kit and the Bio-analyser High Sensitivity DNA kit, according to the manufacturer’s instructions. Library hybridization and capture were done using the xGen hybridization capture kit according to the manufacturer’s instructions. The procedure used a custom-made NGS panel targeting mutational hotspots in 17 EBL-related and HL-related genes (MYC, IGH, IGK, IGL, ID3, TP53, TNFAIP3, NFKBIE, SOCS1, EP300, BTK, STAT6, CSF2RB, ITPBK, XPO1 and B2M), selected based on results from previous studies and EBV genes expressed in latently infected cells: EBER1, EBER2 and EBNA2. The final panel manufactured by Integrated DNA Technologies (IDT) consisted of 731 probes at a size of 148 kb to permit compatibility with either the iSeq100 or the MiSeq sequencing platforms. The final panel design with the genomic coordinates of our genes of interest is listed in Supplementary Table 4. Target enrichment was performed using a hybrid capture approach and sequencing was conducted twice weekly on the MiSeq platform using the MiSeq reagent kit v2 (300 cycles) at a loading concentration of 10 pM, with 6 samples per run.
th the genomic coordinates of our genes of interest is listed in Supplementary Table 4. Target enrichment was performed using a hybrid capture approach and sequencing was conducted twice weekly on the MiSeq platform using the MiSeq reagent kit v2 (300 cycles) at a loading concentration of 10 pM, with 6 samples per run. Structural variants were detected by split-read analysis using GRIDSS and SNVs were called using VarScan 246. The following filtering strategies were used to call somatic variants:Panel of normal plasmas: we excluded variants that occurred in ≥2 samples in a separate cohort of 12 healthy controls.Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded.VAF: to minimize the possibility of including germline variants, we excluded all variants with VAF ≥ 40%, in addition to the initial exclusion of low-level artefacts with VAF < 1%.Sequencing depth and read support: only variants with a minimum sequencing depth of 500× and at least 5 mutant reads were retained for downstream analysis.Copy number alterations (CNAs): we excluded variants that showed evidence of being affected by CNAs. This was done through an internal normalization approach in which the expected relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6).
relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6). Panel of normal plasmas: we excluded variants that occurred in ≥2 samples in a separate cohort of 12 healthy controls. Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded. VAF: to minimize the possibility of including germline variants, we excluded all variants with VAF ≥ 40%, in addition to the initial exclusion of low-level artefacts with VAF < 1%. Sequencing depth and read support: only variants with a minimum sequencing depth of 500× and at least 5 mutant reads were retained for downstream analysis. Copy number alterations (CNAs): we excluded variants that showed evidence of being affected by CNAs. This was done through an internal normalization approach in which the expected relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6).
relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6). Samples that had GSP tissue biopsy results were prioritized for sequencing to allow for clinical validation. For phase II samples, batching was not applied; instead, samples were transported to the sequencing laboratory immediately after collection and sequenced in a head-to-head comparison against GSP tissue biopsy, allowing us to measure TAT prospectively. The same twice-weekly sequencing schedule was maintained, with each MiSeq run limited to a maximum of six samples. Details on the target coordinates (Supplementary Table 4) and sequencing parameters (Supplementary Table 5 and Supplementary Fig. 2) are described in Supplementary Information.
o measure TAT prospectively. The same twice-weekly sequencing schedule was maintained, with each MiSeq run limited to a maximum of six samples. Details on the target coordinates (Supplementary Table 4) and sequencing parameters (Supplementary Table 5 and Supplementary Fig. 2) are described in Supplementary Information. The sequence data were analyzed and stored by a bespoke pipeline (courtesy of the Oxford Molecular Diagnostic Centre) in a Health Insurance Portability and Accountability Act of 1996-compliant, Amazon Web Services cloud. Paired reads were combined and statistics generated on the total number of reads and invalid reads, using a customized tool (udini). The raw sequence data in FASTQ format were aligned to GRCh37 (hs37d5) with BWA-MEM2. Sorting was done using samtools. De-duplication of reads and generation of error rates and family sizes were done by a custom-made tool (elduderino). The de-duplicated FASQ was re-aligned to GRCh37 with BWA-MEM2 and fragment sizes calculated using read pairs via a customized script (available on request). Another customized script (available on request) was used to map the reads on to the genomic regions (targeted genes) of interest and to calculate copies per cell for the EBV genes (mean depth of EBV gene per mean depth for all other targeted genes). The last base at the end of each read was trimmed using a customized tool (trim) and the trimmed SAM file converted into a BAM file and indexed. A customized script (available on request) was used to generate summary statistics for coverage of all targets at different coverage depths and for different genes.
ed genes). The last base at the end of each read was trimmed using a customized tool (trim) and the trimmed SAM file converted into a BAM file and indexed. A customized script (available on request) was used to generate summary statistics for coverage of all targets at different coverage depths and for different genes. Customized scripts (available on request) were developed for variant calling using VarScan and annotation of the variants using Ensembl Variant Effect Predictor. IgCaller (v1.2 software utilizing the hg19 reference genome) was used to comprehensively analyze the rearrangements of the Ig genes and identify oncogenic translocations and the Genomic Rearrangement Identification Software Suite, with additional Picards options and Samtools as aligner (GRIDSS, v2.13.2), was also included as a structural variation caller. To visualize and distinguish true variants from false variants, Integrated Genome Viewer (IGV, v2.13.0) was used. EBV DNA size ratio was calculated as the proportion of EBV DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of the size ratio and selection of the fragment size window were determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26. The lower the EBV size ratio, the lower the proportion of EBV DNA molecules of size 180–200 bp. The distribution of the fragment sizes for reads that map to regions of EBV and for reads that map to the autosomes was calculated and recorded as the EBV entropy and autosome entropy, respectively. This gives a measure of how wide or clustered the distribution of fragment sizes is. The EBV DNA size ratio, EBV entropy and autosome entropy were calculated using customized scripts (Fig. 3 and Supplementary Information).
the autosomes was calculated and recorded as the EBV entropy and autosome entropy, respectively. This gives a measure of how wide or clustered the distribution of fragment sizes is. The EBV DNA size ratio, EBV entropy and autosome entropy were calculated using customized scripts (Fig. 3 and Supplementary Information). The following covariates were tested as predictors in the liquid biopsy diagnostic model: Median VAF: calculated as the median VAF of the likely somatic mutations in MYC, TP53 and ID3. Median VAF: calculated as the median VAF of the likely somatic mutations in MYC, TP53 and ID3. CtDNA: calculated as a product of the cfDNA in hGE ml−1 and the median VAF of likely somatic mutations. CtDNA: calculated as a product of the cfDNA in hGE ml−1 and the median VAF of likely somatic mutations. MYC intron 1 mutations: absolute number of mutations in MYC intron 1. MYC intron 1 mutations: absolute number of mutations in MYC intron 1. MYC exon 2 mutations: absolute number of mutations in MYC exon 2. MYC exon 2 mutations: absolute number of mutations in MYC exon 2. EBER1, EBER2 and EBNA2 (EBV genes) quantity (copies per cell): calculated as the mean depth of the respective EBV gene divided by the mean depth for all other targeted genes and expressed as copies per cell. EBER1, EBER2 and EBNA2 (EBV genes) quantity (copies per cell): calculated as the mean depth of the respective EBV gene divided by the mean depth for all other targeted genes and expressed as copies per cell.
EBER1, EBER2 and EBNA2 (EBV genes) quantity (copies per cell): calculated as the mean depth of the respective EBV gene divided by the mean depth for all other targeted genes and expressed as copies per cell. EBER1, EBER2 and EBNA2 (EBV genes) quantity (copies per cell): calculated as the mean depth of the respective EBV gene divided by the mean depth for all other targeted genes and expressed as copies per cell. Maximum value of EBV copies per cell (EBVmax): defined as the highest copies-per-cell value among the three EBV genes measured (EBER1, EBER2 and EBNA2) for each sample. Maximum value of EBV copies per cell (EBVmax): defined as the highest copies-per-cell value among the three EBV genes measured (EBER1, EBER2 and EBNA2) for each sample. EBV fragment size ratio (EBVSR): EBV DNA size ratio was calculated as the proportion of EBV DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of EBVSR and selection of the fragment size window was determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26. EBV fragment size ratio (EBVSR): EBV DNA size ratio was calculated as the proportion of EBV DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of EBVSR and selection of the fragment size window was determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26.
DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of EBVSR and selection of the fragment size window was determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26. EBVP: represents the number of EBV DNA reads divided by the total number of sequenced reads (after removal of PCR duplicates), expressed as a proportion26. EBVP: represents the number of EBV DNA reads divided by the total number of sequenced reads (after removal of PCR duplicates), expressed as a proportion26. EBV DNA and autosomal entropy: calculated as a measure of the diversity of fragment length distributions, using Shannon entropy computed across the full distribution of paired-end sequencing fragment sizes mapping to the EBV genome (for EBV entropy) and the autosomal regions (for autosomal entropy), respectively. Fragment sizes were binned into 20-bp intervals and the proportional frequency of fragments in each bin was used to derive entropy, reflecting the overall variability and dispersion of the fragment size distribution.
ng to the EBV genome (for EBV entropy) and the autosomal regions (for autosomal entropy), respectively. Fragment sizes were binned into 20-bp intervals and the proportional frequency of fragments in each bin was used to derive entropy, reflecting the overall variability and dispersion of the fragment size distribution. EBV DNA and autosomal entropy: calculated as a measure of the diversity of fragment length distributions, using Shannon entropy computed across the full distribution of paired-end sequencing fragment sizes mapping to the EBV genome (for EBV entropy) and the autosomal regions (for autosomal entropy), respectively. Fragment sizes were binned into 20-bp intervals and the proportional frequency of fragments in each bin was used to derive entropy, reflecting the overall variability and dispersion of the fragment size distribution. To further evaluate the potential clinical utility of the liquid biopsy diagnostic test, we established a weekly virtual MDT meeting consisting of pediatric hematologists and oncologists, hematologists, pathologists and laboratory scientists from all study sites, as well as a senior study bioinformatician. When tissue biopsy was reported based on H&E stain only (minimum pathology report) and the liquid biopsy result indicated BL, treatment was initiated as BL (Fig. 3). When liquid biopsy indicated non-BL, treatment was deferred until GSP tissue biopsy results were available. If the liquid biopsy report was unavailable at the first MDT meeting, treatment decisions were deferred until GSP tissue biopsy results were available.
result indicated BL, treatment was initiated as BL (Fig. 3). When liquid biopsy indicated non-BL, treatment was deferred until GSP tissue biopsy results were available. If the liquid biopsy report was unavailable at the first MDT meeting, treatment decisions were deferred until GSP tissue biopsy results were available. In cases where BL diagnosis was definitive by liquid biopsy, treatment was initiated (Fig. 3). When liquid biopsy indicated non-BL, treatment was deferred until the GSP tissue biopsy result was available. If neither tissue biopsy nor liquid biopsy results were available at the first MDT meeting, treatment decisions were deferred until the second MDT meeting in the following week. Our analysis focused on the initial MDT review to capture the immediate outcome of the two tests in the early decision-making process for cases of BL, although enhanced follow-up protocols in phase II ensured that results from both liquid biopsy and tissue biopsy were obtained for all cases and integrated into patient management plans.
sed on the initial MDT review to capture the immediate outcome of the two tests in the early decision-making process for cases of BL, although enhanced follow-up protocols in phase II ensured that results from both liquid biopsy and tissue biopsy were obtained for all cases and integrated into patient management plans. For the tissue biopsy TAT analysis, we assessed the time from patient presentation at the hospital to the issuance of the GSP diagnostic report, divided into key intervals: time from presentation to sample collection; collection to laboratory receipt; receipt to H&E slide processing; time to issuing the first diagnostic report (based on H&E review); and time to completing the final GSP diagnostic report. For the liquid biopsy TAT, we measured the time from sample receipt in the laboratory to issuance of the diagnostic report, broken down into the intervals of cfDNA extraction and library preparation, targeted sequencing, bioinformatic analysis and final report generation. For the direct TAT comparison between GSP tissue biopsy and liquid biopsy, we specifically analyzed the time from laboratory sample receipt at the respective laboratories to the issuance of the diagnostic report for each method. We assessed the diagnostic yield for BL at the first MDT meeting, expressed as a percentage and calculated from the number of BL cases diagnosed at the first MDT meeting by either the GSP tissue biopsy or the liquid biopsy method divided by the total number of confirmed cases of BL.
For the tissue biopsy TAT analysis, we assessed the time from patient presentation at the hospital to the issuance of the GSP diagnostic report, divided into key intervals: time from presentation to sample collection; collection to laboratory receipt; receipt to H&E slide processing; time to issuing the first diagnostic report (based on H&E review); and time to completing the final GSP diagnostic report. For the liquid biopsy TAT, we measured the time from sample receipt in the laboratory to issuance of the diagnostic report, broken down into the intervals of cfDNA extraction and library preparation, targeted sequencing, bioinformatic analysis and final report generation. For the direct TAT comparison between GSP tissue biopsy and liquid biopsy, we specifically analyzed the time from laboratory sample receipt at the respective laboratories to the issuance of the diagnostic report for each method. We assessed the diagnostic yield for BL at the first MDT meeting, expressed as a percentage and calculated from the number of BL cases diagnosed at the first MDT meeting by either the GSP tissue biopsy or the liquid biopsy method divided by the total number of confirmed cases of BL. All statistical analyses were conducted in R v4.2.3. Normality of continuous variables was assessed using the Shapiro–Wilk test and visual inspection of histograms and Q–Q plots. As the data were not normally distributed, nonparametric summaries and tests were used where appropriate. Descriptive statistics were summarized as proportions for categorical variables or medians with IQRs for continuous variables. The Pearson’s χ2 test and Wilcoxon’s rank-sum test were used to assess associations between variables. We applied the Benjamini–Hochberg procedure to adjust for multiple comparisons across 19 variables. False recovery rate-adjusted P values are presented and values <0.05 were considered statistically significant. All variables showing a significant association with the diagnosis of BL were considered for inclusion in a penalized logistic regression model that we previously described and validated using a smaller cohort20. Before model fitting, candidate predictors were examined for multicollinearity using pairwise correlation and variance inflation factor analysis and highly collinear variables were excluded to improve model stability and interpretability. Priority was given for those that were most biologically or clinically relevant. Six different models were created based on a combination of different variables, as shown in Table 2.
tion and variance inflation factor analysis and highly collinear variables were excluded to improve model stability and interpretability. Priority was given for those that were most biologically or clinically relevant. Six different models were created based on a combination of different variables, as shown in Table 2. For each diagnostic model, the optimal penalty parameter (λ) was selected via tenfold crossvalidation to minimize classification error and ROC curves were generated. The best-performing model was identified based on the highest AUC and pairwise comparisons of AUCs were conducted using DeLong’s test to assess the statistical significance of performance differences between models. The relative importance of variables contributing to the performance of the comprehensive model was derived from the absolute values of the coefficients obtained through LASSO regression. In this penalized logistic regression framework, variables with larger absolute coefficients exert a greater influence on the model’s predictions, reflecting their relative contribution to distinguishing cases of BL from non-BL cases. External validation was done by predicting the diagnosis on a different cohort (phase II) based on the best-performing model. In this analysis, missing data for LDH in 12 samples was imputed using multiple imputation by chained equations, implemented in the mice R package (v3.16.0). Predictive mean matching (method = ‘pmm’) was applied. Comparison of the median TAT between the liquid biopsy and the tissue biopsy was done using Wilcoxon’s signed-rank test (for paired samples).
LDH in 12 samples was imputed using multiple imputation by chained equations, implemented in the mice R package (v3.16.0). Predictive mean matching (method = ‘pmm’) was applied. Comparison of the median TAT between the liquid biopsy and the tissue biopsy was done using Wilcoxon’s signed-rank test (for paired samples). A formal sample size calculation was conducted to ensure sufficient power for assessing diagnostic accuracy. Using a binomial proportion approach (z = 1.96, expected sensitivity = 0.8, margin of error = 0.10), a minimum of 62 cases with BL was estimated, corresponding to approximately 124 participants, assuming a 1:1 case–control ratio. For multivariable model development, an events-per-variable threshold of 20, with up to 18 candidate predictors, indicated a target sample size of 720 participants. Owing to the prospective design and disruptions related to the COVID-19 pandemic, this target was not reached. The final cohort comprised 212 participants, including 81 cases of BL. To mitigate overfitting in the context of a constrained sample size and multiple candidate predictors, LASSO regression was used for variable selection and regularization. Model performance was internally validated using tenfold crossvalidation to assess robustness and generalizability, in accordance with recommended best practices for predictive model development.
strained sample size and multiple candidate predictors, LASSO regression was used for variable selection and regularization. Model performance was internally validated using tenfold crossvalidation to assess robustness and generalizability, in accordance with recommended best practices for predictive model development. Participants were enrolled consecutively as part of routine clinical care. No randomization was performed and investigators were not blinded to allocation during experiments or outcome assessment. Participants with missing outcome data were excluded from the analysis; this exclusion was predefined to preserve the integrity of outcome-based comparisons and was limited to individuals lacking key clinical endpoints or follow-up information required for the primary analyses. For other missing data, including covariates, appropriate imputation methods were applied to minimize bias and retain statistical power. Reproducibility was supported through the use of standardized laboratory protocols, predefined diagnostic algorithms established a priori and automated bioinformatics pipelines applied consistently across all samples. Source data and analysis code are made available as described in ‘Data availability’.
Participants were enrolled consecutively as part of routine clinical care. No randomization was performed and investigators were not blinded to allocation during experiments or outcome assessment. Participants with missing outcome data were excluded from the analysis; this exclusion was predefined to preserve the integrity of outcome-based comparisons and was limited to individuals lacking key clinical endpoints or follow-up information required for the primary analyses. For other missing data, including covariates, appropriate imputation methods were applied to minimize bias and retain statistical power. Reproducibility was supported through the use of standardized laboratory protocols, predefined diagnostic algorithms established a priori and automated bioinformatics pipelines applied consistently across all samples. Source data and analysis code are made available as described in ‘Data availability’. This research was conducted through equitable collaboration between institutions in Tanzania, Uganda and the UK, with a deliberate focus on capacity strengthening, technology transfer and shared scientific leadership. NGS platforms were transferred and implemented locally, enabling all cfDNA sequencing to be performed in-country. Laboratory scientists, clinicians and bioinformaticians participated in structured, crosscountry training and reciprocal site visits, supporting hands-on skills development in library preparation, sequencing, bioinformatics analysis and clinical interpretation. Bioinformatics pipelines and reporting frameworks were deployed with local oversight, ensuring full access to data and analytical workflows for all collaborating investigators. The program supported formal academic training, including PhD, DPhil and masters training for local scientists, contributing to sustainable research capacity beyond the duration of the study. Study design, implementation, data interpretation and authorship reflected shared leadership and local ownership.
investigators. The program supported formal academic training, including PhD, DPhil and masters training for local scientists, contributing to sustainable research capacity beyond the duration of the study. Study design, implementation, data interpretation and authorship reflected shared leadership and local ownership. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
This study used data from the prospective, multicenter, observational, Aggressive Infection-Related East Africa Lymphoma (AI-REAL) study43, which was conducted in two phases. In phase I, we trained different liquid biopsy models and assessed their performance for the diagnosis of EBV+ BL using histopathology with a limited IHC panel and review by a minimum of 2 histopathologists as the GSP in 212 participants with clinically suspected lymphoma. In phase II, we performed a head-to-head comparison between the liquid biopsy and gold-standard tissue biopsy to assess the TAT for the two methods (n = 58 participants with clinically suspected lymphoma) and to conduct an additional independent external validation for the best-performing model (n = 56 participants with clinically suspected lymphoma).
The study enrolled all children and young adults, from the age of 3 years to 25 years, clinically suspected of having lymphoma and consenting to participate in the study from August 2019 to July 2023 from four hospitals in Tanzania and Uganda, that is, the Muhimbili National Hospital, Kilimanjaro Christian Medical Centre, Bugando Medical Centre and St. Mary’s Hospital Lacor. Participants with suspected lymphoma who had previously received chemotherapy, immunotherapy or any investigational agent for lymphoma, or participated in a clinical trial on lymphoma, were excluded. Phase I of the study was conducted from August 2019 to July 2022 and enrolled a total of 313 participants, whereas phase II was conducted from August 2022 to July 2023 and enrolled a total of 64 participants (Extended Data Fig. 1). Tissue biopsies were obtained by a qualified surgeon through excisional biopsy and immediately fixed in 10% neutral buffered formalin. The formalin-fixed specimens were then transported to the local pathology laboratory for further processing and diagnostic evaluation. In parallel, liquid biopsy samples were collected via venipuncture into Roche Cell-Free DNA Collection Tubes, following standard phlebotomy procedures. The tubes were transported to the molecular laboratory, where plasma was separated by centrifugation and stored at −80 °C until further analysis.
Extended Data Fig. 7 provides a summarized overview of the laboratory procedures performed for both the liquid and the tissue biopsies. All measurements were taken from distinct, independent samples. Before study initiation, we performed a comprehensive training and infrastructure needs assessment at the three local study pathology laboratories. To ensure optimal histopathology assessment for the study and local capacity building, technical staff and the four clinical study pathologists underwent training, followed by competency assessments conducted by expert hematopathologists in tissue fixation, embedding, sectioning and IHC staining techniques. The training included hands-on sessions and on-site mentoring over a period of 1 year. An automated IHC staining platform (Ventana Benchmark GX) was installed and maintained at all participating laboratories for the duration of the study and 3-monthly internal quality control procedures were implemented, including the use of known positive controls. Tissue samples were fixed in 10% neutral buffered formalin and embedded in paraffin according to standard protocols. H&E staining was performed on all samples to evaluate morphology and IHC was done using the previously described limited IHC antibody panel for BL.
Before study initiation, we performed a comprehensive training and infrastructure needs assessment at the three local study pathology laboratories. To ensure optimal histopathology assessment for the study and local capacity building, technical staff and the four clinical study pathologists underwent training, followed by competency assessments conducted by expert hematopathologists in tissue fixation, embedding, sectioning and IHC staining techniques. The training included hands-on sessions and on-site mentoring over a period of 1 year. An automated IHC staining platform (Ventana Benchmark GX) was installed and maintained at all participating laboratories for the duration of the study and 3-monthly internal quality control procedures were implemented, including the use of known positive controls. Tissue samples were fixed in 10% neutral buffered formalin and embedded in paraffin according to standard protocols. H&E staining was performed on all samples to evaluate morphology and IHC was done using the previously described limited IHC antibody panel for BL.
CfDNA extraction, library preparation and targeted sequencing were performed at two dedicated study laboratories; the Haematology clinical Research Laboratory at the Muhimbili University of Health and Allied Sciences in Tanzania and the Central Public Health Laboratory. Approximately 8 ml of whole blood was collected in Roche circulating cfDNA (ccfDNA) tubes or PAXgene blood ccfDNA tubes. Plasma was separated by centrifugation at 1,600g for a duration of 15 min continuously, followed by a second centrifugation of the separated plasma at 4,500g for 15 min. Separated plasma was stored in 1.5-ml Eppendorf tubes at −80 °C. CfDNA was extracted from plasma using the QIAamp Circulating Nucleic Acid Kit (QIAGEN, cat. nos. 51304 and 51306) according to the manufacturer’s instructions. Briefly, it involved four main steps; first, samples are lyzed to inactivate DNases and RNases and allow complete release of nucleic acids from bound proteins, lipids and vesicles. Second, the lysates are transferred onto a QIAamp Mini column and circulating nucleic acids are adsorbed from a large volume on to the small silica membrane as the lysate is drawn through by vacuum pressure. Third, while the nucleic acids remain bound to the membrane, contaminants are washed away in a three-step wash process. The final step involves elution of highly purified nucleic acid. The quantification of cfDNA was done with a Qubit fluorimeter 3.0 using the qubit high-sensitivity assay (Thermo Fisher Scientific).
Approximately 8 ml of whole blood was collected in Roche circulating cfDNA (ccfDNA) tubes or PAXgene blood ccfDNA tubes. Plasma was separated by centrifugation at 1,600g for a duration of 15 min continuously, followed by a second centrifugation of the separated plasma at 4,500g for 15 min. Separated plasma was stored in 1.5-ml Eppendorf tubes at −80 °C.
CfDNA was extracted from plasma using the QIAamp Circulating Nucleic Acid Kit (QIAGEN, cat. nos. 51304 and 51306) according to the manufacturer’s instructions. Briefly, it involved four main steps; first, samples are lyzed to inactivate DNases and RNases and allow complete release of nucleic acids from bound proteins, lipids and vesicles. Second, the lysates are transferred onto a QIAamp Mini column and circulating nucleic acids are adsorbed from a large volume on to the small silica membrane as the lysate is drawn through by vacuum pressure. Third, while the nucleic acids remain bound to the membrane, contaminants are washed away in a three-step wash process. The final step involves elution of highly purified nucleic acid. The quantification of cfDNA was done with a Qubit fluorimeter 3.0 using the qubit high-sensitivity assay (Thermo Fisher Scientific).
Libraries were constructed using 50 ng (in 30 μl) of extracted cfDNA with the ThruPLEX Tag-Seq HV kit according to the manufacturer’s protocol. The process involved addition of unique ThruPLEX HV Unique Dual indexed PCR primers to aid with sample tracking. Seven cycles were performed to ensure a yield of >500 ng depending on the concentration of the input DNA. This was followed by a purification step through magnetic separation using AMPure XP beads (Beckman Coulter). The final library was quantified and validated using the Qubit HS kit and the Bio-analyser High Sensitivity DNA kit, according to the manufacturer’s instructions. Library hybridization and capture were done using the xGen hybridization capture kit according to the manufacturer’s instructions.
The procedure used a custom-made NGS panel targeting mutational hotspots in 17 EBL-related and HL-related genes (MYC, IGH, IGK, IGL, ID3, TP53, TNFAIP3, NFKBIE, SOCS1, EP300, BTK, STAT6, CSF2RB, ITPBK, XPO1 and B2M), selected based on results from previous studies and EBV genes expressed in latently infected cells: EBER1, EBER2 and EBNA2. The final panel manufactured by Integrated DNA Technologies (IDT) consisted of 731 probes at a size of 148 kb to permit compatibility with either the iSeq100 or the MiSeq sequencing platforms. The final panel design with the genomic coordinates of our genes of interest is listed in Supplementary Table 4. Target enrichment was performed using a hybrid capture approach and sequencing was conducted twice weekly on the MiSeq platform using the MiSeq reagent kit v2 (300 cycles) at a loading concentration of 10 pM, with 6 samples per run.
Structural variants were detected by split-read analysis using GRIDSS and SNVs were called using VarScan 246. The following filtering strategies were used to call somatic variants:Panel of normal plasmas: we excluded variants that occurred in ≥2 samples in a separate cohort of 12 healthy controls.Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded.VAF: to minimize the possibility of including germline variants, we excluded all variants with VAF ≥ 40%, in addition to the initial exclusion of low-level artefacts with VAF < 1%.Sequencing depth and read support: only variants with a minimum sequencing depth of 500× and at least 5 mutant reads were retained for downstream analysis.Copy number alterations (CNAs): we excluded variants that showed evidence of being affected by CNAs. This was done through an internal normalization approach in which the expected relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6). Panel of normal plasmas: we excluded variants that occurred in ≥2 samples in a separate cohort of 12 healthy controls. Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded.
Structural variants were detected by split-read analysis using GRIDSS and SNVs were called using VarScan 246. The following filtering strategies were used to call somatic variants:Panel of normal plasmas: we excluded variants that occurred in ≥2 samples in a separate cohort of 12 healthy controls.Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded.VAF: to minimize the possibility of including germline variants, we excluded all variants with VAF ≥ 40%, in addition to the initial exclusion of low-level artefacts with VAF < 1%.Sequencing depth and read support: only variants with a minimum sequencing depth of 500× and at least 5 mutant reads were retained for downstream analysis.Copy number alterations (CNAs): we excluded variants that showed evidence of being affected by CNAs. This was done through an internal normalization approach in which the expected relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6). Panel of normal plasmas: we excluded variants that occurred in ≥2 samples in a separate cohort of 12 healthy controls. Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded. VAF: to minimize the possibility of including germline variants, we excluded all variants with VAF ≥ 40%, in addition to the initial exclusion of low-level artefacts with VAF < 1%.
Population databases: all variants were annotated against gnomAD and those with a population allele frequency (minimum allele frequency (MAF)) > 1% were excluded. VAF: to minimize the possibility of including germline variants, we excluded all variants with VAF ≥ 40%, in addition to the initial exclusion of low-level artefacts with VAF < 1%. Sequencing depth and read support: only variants with a minimum sequencing depth of 500× and at least 5 mutant reads were retained for downstream analysis. Copy number alterations (CNAs): we excluded variants that showed evidence of being affected by CNAs. This was done through an internal normalization approach in which the expected relationship between VAF and log₁₀(transformed ctDNA) was modeled using ID3 mutations (rarely affected by CNAs), and the resulting tolerance limits (±2× median absolute deviation of residuals) were applied to MYC and TP53 variants to flag likely CNA-affected outliers (Extended Data Fig. 4a,b and Supplementary Table 6).
The sequence data were analyzed and stored by a bespoke pipeline (courtesy of the Oxford Molecular Diagnostic Centre) in a Health Insurance Portability and Accountability Act of 1996-compliant, Amazon Web Services cloud. Paired reads were combined and statistics generated on the total number of reads and invalid reads, using a customized tool (udini). The raw sequence data in FASTQ format were aligned to GRCh37 (hs37d5) with BWA-MEM2. Sorting was done using samtools. De-duplication of reads and generation of error rates and family sizes were done by a custom-made tool (elduderino). The de-duplicated FASQ was re-aligned to GRCh37 with BWA-MEM2 and fragment sizes calculated using read pairs via a customized script (available on request). Another customized script (available on request) was used to map the reads on to the genomic regions (targeted genes) of interest and to calculate copies per cell for the EBV genes (mean depth of EBV gene per mean depth for all other targeted genes). The last base at the end of each read was trimmed using a customized tool (trim) and the trimmed SAM file converted into a BAM file and indexed. A customized script (available on request) was used to generate summary statistics for coverage of all targets at different coverage depths and for different genes.
The following covariates were tested as predictors in the liquid biopsy diagnostic model: Median VAF: calculated as the median VAF of the likely somatic mutations in MYC, TP53 and ID3. Median VAF: calculated as the median VAF of the likely somatic mutations in MYC, TP53 and ID3. CtDNA: calculated as a product of the cfDNA in hGE ml−1 and the median VAF of likely somatic mutations. CtDNA: calculated as a product of the cfDNA in hGE ml−1 and the median VAF of likely somatic mutations. MYC intron 1 mutations: absolute number of mutations in MYC intron 1. MYC intron 1 mutations: absolute number of mutations in MYC intron 1. MYC exon 2 mutations: absolute number of mutations in MYC exon 2. MYC exon 2 mutations: absolute number of mutations in MYC exon 2. EBER1, EBER2 and EBNA2 (EBV genes) quantity (copies per cell): calculated as the mean depth of the respective EBV gene divided by the mean depth for all other targeted genes and expressed as copies per cell. EBER1, EBER2 and EBNA2 (EBV genes) quantity (copies per cell): calculated as the mean depth of the respective EBV gene divided by the mean depth for all other targeted genes and expressed as copies per cell. Maximum value of EBV copies per cell (EBVmax): defined as the highest copies-per-cell value among the three EBV genes measured (EBER1, EBER2 and EBNA2) for each sample. Maximum value of EBV copies per cell (EBVmax): defined as the highest copies-per-cell value among the three EBV genes measured (EBER1, EBER2 and EBNA2) for each sample.
Maximum value of EBV copies per cell (EBVmax): defined as the highest copies-per-cell value among the three EBV genes measured (EBER1, EBER2 and EBNA2) for each sample. Maximum value of EBV copies per cell (EBVmax): defined as the highest copies-per-cell value among the three EBV genes measured (EBER1, EBER2 and EBNA2) for each sample. EBV fragment size ratio (EBVSR): EBV DNA size ratio was calculated as the proportion of EBV DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of EBVSR and selection of the fragment size window was determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26. EBV fragment size ratio (EBVSR): EBV DNA size ratio was calculated as the proportion of EBV DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of EBVSR and selection of the fragment size window was determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26. EBVP: represents the number of EBV DNA reads divided by the total number of sequenced reads (after removal of PCR duplicates), expressed as a proportion26. EBVP: represents the number of EBV DNA reads divided by the total number of sequenced reads (after removal of PCR duplicates), expressed as a proportion26.
EBV fragment size ratio (EBVSR): EBV DNA size ratio was calculated as the proportion of EBV DNA fragments with size between 180 bp and 200 bp ((Total EBV DNA with size 180–200 bp)/(Total autosomal DNA with size between 180 bp and 200 bp)). Calculation of EBVSR and selection of the fragment size window was determined based on the methodology of a previous study looking at EBV DNA in nasopharyngeal carcinoma26. EBVP: represents the number of EBV DNA reads divided by the total number of sequenced reads (after removal of PCR duplicates), expressed as a proportion26. EBVP: represents the number of EBV DNA reads divided by the total number of sequenced reads (after removal of PCR duplicates), expressed as a proportion26. EBV DNA and autosomal entropy: calculated as a measure of the diversity of fragment length distributions, using Shannon entropy computed across the full distribution of paired-end sequencing fragment sizes mapping to the EBV genome (for EBV entropy) and the autosomal regions (for autosomal entropy), respectively. Fragment sizes were binned into 20-bp intervals and the proportional frequency of fragments in each bin was used to derive entropy, reflecting the overall variability and dispersion of the fragment size distribution.
ng to the EBV genome (for EBV entropy) and the autosomal regions (for autosomal entropy), respectively. Fragment sizes were binned into 20-bp intervals and the proportional frequency of fragments in each bin was used to derive entropy, reflecting the overall variability and dispersion of the fragment size distribution. EBV DNA and autosomal entropy: calculated as a measure of the diversity of fragment length distributions, using Shannon entropy computed across the full distribution of paired-end sequencing fragment sizes mapping to the EBV genome (for EBV entropy) and the autosomal regions (for autosomal entropy), respectively. Fragment sizes were binned into 20-bp intervals and the proportional frequency of fragments in each bin was used to derive entropy, reflecting the overall variability and dispersion of the fragment size distribution.
To further evaluate the potential clinical utility of the liquid biopsy diagnostic test, we established a weekly virtual MDT meeting consisting of pediatric hematologists and oncologists, hematologists, pathologists and laboratory scientists from all study sites, as well as a senior study bioinformatician. When tissue biopsy was reported based on H&E stain only (minimum pathology report) and the liquid biopsy result indicated BL, treatment was initiated as BL (Fig. 3). When liquid biopsy indicated non-BL, treatment was deferred until GSP tissue biopsy results were available. If the liquid biopsy report was unavailable at the first MDT meeting, treatment decisions were deferred until GSP tissue biopsy results were available. In cases where BL diagnosis was definitive by liquid biopsy, treatment was initiated (Fig. 3). When liquid biopsy indicated non-BL, treatment was deferred until the GSP tissue biopsy result was available. If neither tissue biopsy nor liquid biopsy results were available at the first MDT meeting, treatment decisions were deferred until the second MDT meeting in the following week. Our analysis focused on the initial MDT review to capture the immediate outcome of the two tests in the early decision-making process for cases of BL, although enhanced follow-up protocols in phase II ensured that results from both liquid biopsy and tissue biopsy were obtained for all cases and integrated into patient management plans.
For the tissue biopsy TAT analysis, we assessed the time from patient presentation at the hospital to the issuance of the GSP diagnostic report, divided into key intervals: time from presentation to sample collection; collection to laboratory receipt; receipt to H&E slide processing; time to issuing the first diagnostic report (based on H&E review); and time to completing the final GSP diagnostic report. For the liquid biopsy TAT, we measured the time from sample receipt in the laboratory to issuance of the diagnostic report, broken down into the intervals of cfDNA extraction and library preparation, targeted sequencing, bioinformatic analysis and final report generation. For the direct TAT comparison between GSP tissue biopsy and liquid biopsy, we specifically analyzed the time from laboratory sample receipt at the respective laboratories to the issuance of the diagnostic report for each method.
We assessed the diagnostic yield for BL at the first MDT meeting, expressed as a percentage and calculated from the number of BL cases diagnosed at the first MDT meeting by either the GSP tissue biopsy or the liquid biopsy method divided by the total number of confirmed cases of BL.
All statistical analyses were conducted in R v4.2.3. Normality of continuous variables was assessed using the Shapiro–Wilk test and visual inspection of histograms and Q–Q plots. As the data were not normally distributed, nonparametric summaries and tests were used where appropriate. Descriptive statistics were summarized as proportions for categorical variables or medians with IQRs for continuous variables. The Pearson’s χ2 test and Wilcoxon’s rank-sum test were used to assess associations between variables. We applied the Benjamini–Hochberg procedure to adjust for multiple comparisons across 19 variables. False recovery rate-adjusted P values are presented and values <0.05 were considered statistically significant. All variables showing a significant association with the diagnosis of BL were considered for inclusion in a penalized logistic regression model that we previously described and validated using a smaller cohort20. Before model fitting, candidate predictors were examined for multicollinearity using pairwise correlation and variance inflation factor analysis and highly collinear variables were excluded to improve model stability and interpretability. Priority was given for those that were most biologically or clinically relevant. Six different models were created based on a combination of different variables, as shown in Table 2.
A formal sample size calculation was conducted to ensure sufficient power for assessing diagnostic accuracy. Using a binomial proportion approach (z = 1.96, expected sensitivity = 0.8, margin of error = 0.10), a minimum of 62 cases with BL was estimated, corresponding to approximately 124 participants, assuming a 1:1 case–control ratio. For multivariable model development, an events-per-variable threshold of 20, with up to 18 candidate predictors, indicated a target sample size of 720 participants. Owing to the prospective design and disruptions related to the COVID-19 pandemic, this target was not reached. The final cohort comprised 212 participants, including 81 cases of BL. To mitigate overfitting in the context of a constrained sample size and multiple candidate predictors, LASSO regression was used for variable selection and regularization. Model performance was internally validated using tenfold crossvalidation to assess robustness and generalizability, in accordance with recommended best practices for predictive model development.
This research was conducted through equitable collaboration between institutions in Tanzania, Uganda and the UK, with a deliberate focus on capacity strengthening, technology transfer and shared scientific leadership. NGS platforms were transferred and implemented locally, enabling all cfDNA sequencing to be performed in-country. Laboratory scientists, clinicians and bioinformaticians participated in structured, crosscountry training and reciprocal site visits, supporting hands-on skills development in library preparation, sequencing, bioinformatics analysis and clinical interpretation. Bioinformatics pipelines and reporting frameworks were deployed with local oversight, ensuring full access to data and analytical workflows for all collaborating investigators. The program supported formal academic training, including PhD, DPhil and masters training for local scientists, contributing to sustainable research capacity beyond the duration of the study. Study design, implementation, data interpretation and authorship reflected shared leadership and local ownership.
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41591-026-04291-z.
Source Data Fig. 1Statistical source data. Source Data Fig. 2Statistical source data. Source Data Fig. 3Statistical source data. Source Data Fig. 4Statistical source data. Source Data Table 2Statistical source data. Source Data Extended Data Fig. 2Statistical source data. Source Data Extended Data Fig. 4aStatistical source data. Source Data Extended Data Fig. 4bStatistical source data. Source Data Extended Data Fig. 4cStatistical source data. Source Data Extended Data Fig. 4dStatistical source data. Source Data Extended Data Fig. 4eStatistical source data. Source Data Extended Data Fig. 5Statistical source data. Source Data Extended Data Fig. 6Statistical source data. Source Data Extended Data Table 1Statistical source data. Source Data Extended Data Table 2Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data. Statistical source data.