Browse the corpus
Walk the Even Hospital Database by book and chapter — the raw source passages that ground Ask, DDx, and the rest.
93 passages
Remote monitoring of heart failure exacerbations using a smartwatch. Heart failure (HF) involves cycles of remission and exacerbation, which are poorly characterized by static disease measures. Consumer wearables have an understudied potential for daily monitoring of HF symptoms. Here we report results from an observational cohort of free-living patients over a median of 94.5 d with HF in the Ted Rogers Understanding Exacerbations of HF (TRUE-HF) study. The study measured the ability of Apple Watch data to predict peak oxygen uptake (pVO2) as measured using in-clinic cardiopulmonary exercise testing (CPET). A deep learning model was trained with data from 154 patients (46 women, 108 men) and validated on a held-out set of 63 patients (24 women, 39 men) for determining wearable-derived daily pVO2, which correlated strongly with CPET-measured pVO2 (Pearson's correlation = 0.85). Each 10% drop in wearable-derived daily pVO2 was associated with a 3.62-fold increased hazard ratio (HR) for unplanned healthcare events (95% confidence interval (CI), 1.37-9.55; P < 0.01), which occurred at a median of 7.4 d after the first 10% drop in wearable-derived pVO2. These findings were externally validated in an independent external cohort from the All of Us Research Program using a crossplatform model that accounted for the reduced-sensor capacities available in this external cohort. Using this reduced-sensor variant of the model, drops in wearable-derived daily pVO2 were associated with unplanned healthcare utilization (HR 1.32, 95% CI 1.03-1.69; P = 0.03), which occurred at a median of 21 d after the first 10% drop in wearable-derived pVO2. These results indicate that wearable-derived daily pVO2 provides earlier and improved risk discrimination compared with existing wearable fitness estimates and established clinical markers and offers a scalable and generalizable approach for longitudinal HF research and monitoring.
Heart failure (HF) is a global health crisis affecting an estimated 64.3 million individuals worldwide1. The annual global cost of HF is estimated at US$346 billion, with a substantial portion attributed to hospitalizations, missed work and healthcare service utilization2,3. HF is also associated with reduced life expectancy, with a median survival of 3.2 years for women and 1.7 years for men4. Despite recent medical advances, patients with HF continue to experience a high risk of adverse outcomes, indicating a critical need for better prognostic markers to enhance risk stratification and enable more effective and timely interventions1.
cy, with a median survival of 3.2 years for women and 1.7 years for men4. Despite recent medical advances, patients with HF continue to experience a high risk of adverse outcomes, indicating a critical need for better prognostic markers to enhance risk stratification and enable more effective and timely interventions1. HF is characterized by periods of relative stability interspersed with acute exacerbations requiring timely intervention to prevent hospitalization5. In contrast, risk evaluation in HF traditionally relies on intermittent and static clinical evaluation tools, making accurate and timely risk stratification a challenge6. Many cardiac centers use cardiopulmonary exercise testing (CPET) to estimate 1-year prognosis, enabling stratification of patients based on established exercise parameters7,8. However, although serial CPET trends are clinically informative, its high cost, limited geographic accessibility and patient burden make daily or weekly monitoring impractical9. The clinically supervised 6-min walk test (6MWT) offers an alternative measurement of exercise intolerance in patients with HF10. However, evidence supporting its prognostic value compared to CPET is limited. In addition, the accuracy of unsupervised 6MWT compared to supervised is uncertain, restricting its widespread use in free-living remote monitoring10. The New York Heart Association (NYHA) functional classification is commonly used in practice, but is subjective and fails to capture a patient’s fluctuating clinical status11. Thus, although these traditional tools for HF risk stratification provide essential insights into patient health, they cannot also account for the episodic course of HF. This highlights the need for better free-living prognostic markers to improve risk assessment and enable more effective and timely intervention.
us11. Thus, although these traditional tools for HF risk stratification provide essential insights into patient health, they cannot also account for the episodic course of HF. This highlights the need for better free-living prognostic markers to improve risk assessment and enable more effective and timely intervention. Remote patient monitoring modalities shift from static and intermittent measurements of patient status to daily measurements and have shown up to a 15% reduction in total hospitalizations12. The effectiveness of remote patient monitoring has been demonstrated in various studies, which report improved patient outcomes and reduced healthcare utilization when these systems are proactive and correctly implemented. Nevertheless, the demands of manual data entry for patients and the constant oversight by clinicians hinder adherence and broader implementation of these programs13.
in various studies, which report improved patient outcomes and reduced healthcare utilization when these systems are proactive and correctly implemented. Nevertheless, the demands of manual data entry for patients and the constant oversight by clinicians hinder adherence and broader implementation of these programs13. Advances in wearable technology and artificial intelligence offer promising avenues for addressing these challenges. Consumer-wearable devices, like Apple Watch, can provide near-continuous capture of a wide range of physiological data, including heart rate and physical activity, offering a promising avenue for passive, noninvasive, real-time health monitoring14–17. Wearables have been investigated to monitor heart rate and exercise for outpatient patients with HF and to detect atrial fibrillation18. Concerning cardiopulmonary fitness, the Apple cardio fitness estimate on Apple Watch can accurately estimate a critical prognostic metric from CPET: the peak oxygen consumption rate (pVO2) in a healthy population19. However, it remains unclear whether wearable-derived algorithms validated in nondiseased populations can be reliably translated to patients with HF, whose distinct pathophysiological profiles may require tailored assessment approaches.
metric from CPET: the peak oxygen consumption rate (pVO2) in a healthy population19. However, it remains unclear whether wearable-derived algorithms validated in nondiseased populations can be reliably translated to patients with HF, whose distinct pathophysiological profiles may require tailored assessment approaches. To investigate this knowledge gap, we initiated the Ted Rogers Understanding Exacerbations of Heart Failure study (TRUE-HF), a prospective 3-month observational cohort study of patients with HF using an Apple Watch in free living20. We hypothesized that: (1) consumer-wearable data can be used to estimate cardiopulmonary fitness of patients with HF in free-living conditions; and (2) daily fitness monitoring markers would predict worsening patient health as indicated by unplanned healthcare utilization.
The primary goals of this observational study were to investigate the relationships among widely available consumer-wearable data, cardiopulmonary fitness and worsening HF in a free-living environment. Between December 2019 and April 2024, 217 patients with HF were enrolled in the TRUE-HF study (NCT05008692)21. Patients were provided with Apple Watch Series 6 and/or compatible iPhones for the study duration or allowed to use their existing compatible devices. All participants were instructed to capture pedestrian activities using their Apple Watch if they were going to partake in an activity; however, they were not instructed to perform additional pedestrian activities.
s 6 and/or compatible iPhones for the study duration or allowed to use their existing compatible devices. All participants were instructed to capture pedestrian activities using their Apple Watch if they were going to partake in an activity; however, they were not instructed to perform additional pedestrian activities. Each patient completed an in-person study-entry clinic visit and 3-month end-of-study clinic visit, where they underwent formal CPET testing, bloodwork, clinical examination, a supervised 6MWT and a Tecumseh cube test. All demographic and clinical measurements recorded at the study-entry and 3-month end-of-study clinic visits were captured. Participants had a median observation period from study entry to end of study of 94.5 d. Apple Watch data were collected via HealthKit during clinic visits and free-living observations. During this window, daily self-administered surveys about patient health status and unplanned healthcare utilization were conducted (Fig. 1). We defined unplanned healthcare utilization as hospital admissions, unscheduled clinic visits or intravenous furosemide treatment. These unscheduled events were manually verified using electronic health records (EHRs) and physician notes, where available.Fig. 1Schematic representation of the TRUE-HF observational cohort study design.a, The TRUE-HF study protocol incorporating study-entry clinic visits and 3-month end-of-study visits to assess patient status and cardiopulmonary fitness. Wearable data and daily surveys were collected during the free-living observation period. b, Distribution of key HealthKit variables, illustrating the number of HealthKit recordings collected during the study. c, The TRUE-HF model processing a 30-d sliding window of wearable data for outcome prediction (pVO2), which was tested for concordance with CPET-derived pVO2 and clinical events including declines in CPET pVO2, hospitalization, emergency visits and intravenous furosemide administration.
ings collected during the study. c, The TRUE-HF model processing a 30-d sliding window of wearable data for outcome prediction (pVO2), which was tested for concordance with CPET-derived pVO2 and clinical events including declines in CPET pVO2, hospitalization, emergency visits and intravenous furosemide administration. a, The TRUE-HF study protocol incorporating study-entry clinic visits and 3-month end-of-study visits to assess patient status and cardiopulmonary fitness. Wearable data and daily surveys were collected during the free-living observation period. b, Distribution of key HealthKit variables, illustrating the number of HealthKit recordings collected during the study. c, The TRUE-HF model processing a 30-d sliding window of wearable data for outcome prediction (pVO2), which was tested for concordance with CPET-derived pVO2 and clinical events including declines in CPET pVO2, hospitalization, emergency visits and intravenous furosemide administration.
ings collected during the study. c, The TRUE-HF model processing a 30-d sliding window of wearable data for outcome prediction (pVO2), which was tested for concordance with CPET-derived pVO2 and clinical events including declines in CPET pVO2, hospitalization, emergency visits and intravenous furosemide administration. Of the 217 enrolled patients, 191 completed the study, of whom 187 completed an end-of-study CPET. During the study period, 25 patients experienced decreased CPET pVO2 from study entry to end of study and 32 had unplanned healthcare utilization. Among the 32 total events in our cohort, 19 were confirmed via clinical records to be directly HF related through agreement of 2 board-certified cardiologists, 4 were confirmed as noncardiac related and 9 were self-reported hospitalizations. The nine self-reported hospitalizations were patient-reported outcomes that occurred in patients without centralized linking of healthcare records and thus could not be verified against EHRs or physician notes; 30 patients were excluded from the primary analysis because they died (n = 2), received a left ventricular assist device (n = 2), withdrew (n = 7), could not complete the end-of-study assessment process (n = 4) or had insufficient device usage (n = 15) (Fig. 2). Insufficient device usage was defined as wearing their Apple Watch <10 d throughout the 90-d free-living period (with a median of 4 d of usage for the 15 participants). Patient demographics are summarized in Table 1.Fig. 2Consort diagram for the TRUE-HF study.Reasons for patient ineligibility or nonparticipation are included. aPatients were deemed ineligible or not approached for the following reasons: unable to give consent, had documentation requesting not to be approached for research, on dialysis, had documented noncompliance, had active psychiatric problems, new patients to the clinic or purposefully selected to ensure demographic diversity in the study cohort. In addition, the final 50 patients (held-out set) were recruited to match the CPET pVO2 distribution of the training cohort. bAmong the group of individuals who were excluded for having insufficient data, one patient had received a left ventricular assist device (LVAD) and one patient had died.
ity in the study cohort. In addition, the final 50 patients (held-out set) were recruited to match the CPET pVO2 distribution of the training cohort. bAmong the group of individuals who were excluded for having insufficient data, one patient had received a left ventricular assist device (LVAD) and one patient had died. The 15 participants who did not complete the end-of-study assessments nevertheless had sufficient follow-up data to be included in the unplanned healthcare utilization analysis.Table 1Demographics and characteristics of the TRUE-HF study cohortCharacteristicsTRUE-HF (n = 217)Age, median years (IQR)57.0 (47.2–64.3)Sex, n (%) Female70 (32.3) Male147 (67.7)Self-reported racea, n (%) White133 (60.8) Asian43 (19.8) Black25 (11.5) Other16 (7.4)Weight, median, kg (IQR)83.4 (70.5–96.9)BMI Median (IQR)27.1 (24.1–31.7) <35 kg m−2, n (%)185 (85.3) ≥35 kg m−2, n (%)32 (14.7)Blood pressure, median, mm Hg (IQR) Systolic110.0 (101.0–119.75) Diastolic71.0 (65.0–78.0)Left ventricular ejection fraction Median (IQR)32.0 (23.0–44.0) <30%, n (%)98 (45.2) ≥30% to <40%, n (%)50 (23.0) ≥40% to <50%, n (%)32 (14.7) ≥50% to <60%, n (%)27 (12.4) ≥60%, n (%)10 (4.6)6-min walk distance, m median (IQR)420.0 (363.0–474.0)CPET pVO2, ml kg−1 min −1 median (IQR)14.2 (12.0–18.55)Implantable cardioverter-defibrillator, n (%)113 (52.1)Cardiac resynchronization therapy, n (%)36 (16.6%)Permanent pacemaker, n (%)3 (1.4%)NT-proBNP (ng l−1), median (IQR)1533.2 (513.3–5670.4)Creatinine (µmol l−1), median (IQR)98.0 (81.0–124.0)Heart failure type, n (%) HFrEF179 (82.5) HFpEF27 (12.4) Improved6 (2.8)Primary diagnosis, n (%) Dilated myopathy135 (62.2) Restrictive myopathy8 (3.7) Coronary artery disease28 (12.9) Hypertrophic cardiomyopathy18 (8.3) Valvular heart disease6 (2.8) Congenital heart defect1 (0.5) Otherb21 (9.7)NYHA functional class, n (%) Class I45 (20.7) Class II114 (52.5) Class III55 (25.3) Class IV3 (1.4)Comorbidities, n (%) Tobacco use49 (22.6) Diabetes mellitus55 (25.3) Hypertension57 (26.3) Lipids67 (30.9) EtOH19 (8.8) Obesity40 (18.4) Family history38 (17.5) Peripheral vascular disease2 (0.9)Concomitant medications, n (%) β-Blockers201 (92.6) Diuretics136 (62.7) ACE inhibitors41 (18.9) Antiarrhythmics82 (37.8) ARBs31 (14.3) ARNIs111 (51.2) ACE inhibitors, ARBs or ARNIs176 (81.1) MRAs167 (77.0) SGLT2 inhibitors128 (59.0) Statin90 (41.5) Vasodilators9 (4.1)Data are n (%) or median (IQR). NT-proBNP was derived for all participants.
, n (%) β-Blockers201 (92.6) Diuretics136 (62.7) ACE inhibitors41 (18.9) Antiarrhythmics82 (37.8) ARBs31 (14.3) ARNIs111 (51.2) ACE inhibitors, ARBs or ARNIs176 (81.1) MRAs167 (77.0) SGLT2 inhibitors128 (59.0) Statin90 (41.5) Vasodilators9 (4.1)Data are n (%) or median (IQR). NT-proBNP was derived for all participants. For 113 patients with BNP measurements only, NT-proBNP values were derived from BNP using a formula from ref. 39. ACE, angiotensin-converting enzyme; ARBs, angiotensin II receptor blockers; ARNIs, angiotensin receptor–neprilysin inhibitors; BMI, body mass index; HFpEF, HF with preserved ejection fraction; HFrEF, HF with reduced ejection fraction; MRAs, mineralocorticoid receptor antagonists; NT-proBNP, B-type natriuretic peptide; SGLT2, sodium–glucose co-transporter-2.aThe race or ethnicity of the participants was reported by the investigators based on the participant’s perception of their race or ethnicity. Participants who identified with more than one racial or ethnic group, or who did not identify with the listed categories, were classified as ‘Other’.bIncludes unknown (n = 16), cancer (n = 2) or not reported (n = 3).
ts was reported by the investigators based on the participant’s perception of their race or ethnicity. Participants who identified with more than one racial or ethnic group, or who did not identify with the listed categories, were classified as ‘Other’.bIncludes unknown (n = 16), cancer (n = 2) or not reported (n = 3). Reasons for patient ineligibility or nonparticipation are included. aPatients were deemed ineligible or not approached for the following reasons: unable to give consent, had documentation requesting not to be approached for research, on dialysis, had documented noncompliance, had active psychiatric problems, new patients to the clinic or purposefully selected to ensure demographic diversity in the study cohort. In addition, the final 50 patients (held-out set) were recruited to match the CPET pVO2 distribution of the training cohort. bAmong the group of individuals who were excluded for having insufficient data, one patient had received a left ventricular assist device (LVAD) and one patient had died. The 15 participants who did not complete the end-of-study assessments nevertheless had sufficient follow-up data to be included in the unplanned healthcare utilization analysis. Demographics and characteristics of the TRUE-HF study cohort
Reasons for patient ineligibility or nonparticipation are included. aPatients were deemed ineligible or not approached for the following reasons: unable to give consent, had documentation requesting not to be approached for research, on dialysis, had documented noncompliance, had active psychiatric problems, new patients to the clinic or purposefully selected to ensure demographic diversity in the study cohort. In addition, the final 50 patients (held-out set) were recruited to match the CPET pVO2 distribution of the training cohort. bAmong the group of individuals who were excluded for having insufficient data, one patient had received a left ventricular assist device (LVAD) and one patient had died. The 15 participants who did not complete the end-of-study assessments nevertheless had sufficient follow-up data to be included in the unplanned healthcare utilization analysis. Demographics and characteristics of the TRUE-HF study cohort Data are n (%) or median (IQR). NT-proBNP was derived for all participants. For 113 patients with BNP measurements only, NT-proBNP values were derived from BNP using a formula from ref. 39. ACE, angiotensin-converting enzyme; ARBs, angiotensin II receptor blockers; ARNIs, angiotensin receptor–neprilysin inhibitors; BMI, body mass index; HFpEF, HF with preserved ejection fraction; HFrEF, HF with reduced ejection fraction; MRAs, mineralocorticoid receptor antagonists; NT-proBNP, B-type natriuretic peptide; SGLT2, sodium–glucose co-transporter-2.
s, angiotensin II receptor blockers; ARNIs, angiotensin receptor–neprilysin inhibitors; BMI, body mass index; HFpEF, HF with preserved ejection fraction; HFrEF, HF with reduced ejection fraction; MRAs, mineralocorticoid receptor antagonists; NT-proBNP, B-type natriuretic peptide; SGLT2, sodium–glucose co-transporter-2. aThe race or ethnicity of the participants was reported by the investigators based on the participant’s perception of their race or ethnicity. Participants who identified with more than one racial or ethnic group, or who did not identify with the listed categories, were classified as ‘Other’. bIncludes unknown (n = 16), cancer (n = 2) or not reported (n = 3). We pre-allocated the last 50 successful end-of-study assessments for a held-out test to be conducted for evaluation only at study completion, reducing the risk for test-set overfitting bias from repeated model refinement22. Successful end-of-study assessments were defined as those completing an end-of-study CPET. This resulted in the first 154 patients being utilized for model development and training using cross validation (n = 137 successful assessments) and the last 63 patients for a held-out test (n = 50 successful assessments). All model performance results reported herein were calculated from that held-out test set.
CPET. This resulted in the first 154 patients being utilized for model development and training using cross validation (n = 137 successful assessments) and the last 63 patients for a held-out test (n = 50 successful assessments). All model performance results reported herein were calculated from that held-out test set. In our held-out test cohort (n = 63), 5 patients were excluded from evaluation due to insufficient device usage. Of the remaining 58 held-out patients, 50 completed the study and an end-of-study CPET. Eight experienced unplanned healthcare utilization events: five unexpected hospital admissions, two unscheduled clinic visits and one instance requiring intravenous furosemide treatment. Among these, four events were adjudicated as HF related, three were self-reported and one was unrelated to HF. Demographic details are highlighted in Extended Data Table 1.
utilization events: five unexpected hospital admissions, two unscheduled clinic visits and one instance requiring intravenous furosemide treatment. Among these, four events were adjudicated as HF related, three were self-reported and one was unrelated to HF. Demographic details are highlighted in Extended Data Table 1. We developed an autoregressive transformer framework, termed the TRUE-HF model (Extended Data Fig. 1), which combines wearable data with patient-specific clinical information to forecast daily predictions. Our approach processes 30 d of wearable data, initially aggregated at 90-min intervals and progressively pooled into larger time windows, allowing the model to learn relationships at multiple resolutions, as detailed in Methods. Throughout our study, we considered predicted daily TRUE-HF pVO2 as a naturally continuous measure. We prespecified a clinically meaningful drop in CPET pVO2 as a ≥10% reduction, based on prior literature observing a ≥6% to 10% decrease in CPET pVO2 being associated with an increased risk of medium-to-long-term hospitalization or death in patients with HF23–26. Therefore, hereafter, a ≥10% drop in TRUE-HF pVO2 as predicted by our model is considered test positive.
CPET pVO2 as a ≥10% reduction, based on prior literature observing a ≥6% to 10% decrease in CPET pVO2 being associated with an increased risk of medium-to-long-term hospitalization or death in patients with HF23–26. Therefore, hereafter, a ≥10% drop in TRUE-HF pVO2 as predicted by our model is considered test positive. We conducted external validation on an independent patient cohort from the National Institutes of Health (NIH) All of Us Research Program which uses Fitbit wearables27,28. We identified 193 patients who closely matched our TRUE-HF cohort inclusion criteria, including confirmed diagnosis of HF documented in EHRs, sufficient wearable data coverage (defined as wearable use on at least 40% of study days, nonconsecutively) and detailed EHR records available for event adjudication (Extended Data Fig. 2). However, regarding HF severity, the NYHA class was missing in the EHRs for 1,653 of 1,664 (99%) all possible All of Us patients with HF and wearable data and BNP were missing in 917 of 1,664 (55%). Therefore, to more closely reflect the clinical severity observed in the TRUE-HF population—where 77% were NYHA class ≥II—we added the inclusion criteria of at least one previously recorded unplanned healthcare event before enrollment. To establish a consistent starting point across patients in the All of Us cohort, we defined the study-entry time point as the first point at which both the index event had occurred and wearable data were available for each patient. Furthermore, due to data availability in the All of Us Research Program, we applied a more stringent definition of unplanned healthcare utilization, restricting events to inpatient hospital admission or intravenous furosemide administration. Outpatient visits were excluded because we could not reliably determine whether they were scheduled or unscheduled and self-reported outcomes were not available in this dataset. Patient demographics are summarized in Extended Data Table 2.
events to inpatient hospital admission or intravenous furosemide administration. Outpatient visits were excluded because we could not reliably determine whether they were scheduled or unscheduled and self-reported outcomes were not available in this dataset. Patient demographics are summarized in Extended Data Table 2. The All of Us external validation cohort had a median observation period of 120 d, during which 20 patients experienced at least one unplanned healthcare utilization event. Note that, to provide a calibration and adjustment period for patients initiating wearable use and to account for a refractory interval after the initial clinical encounter, we introduced a 7-d buffer after the study-entry date, which was not used in TRUE-HF. To ensure timely model predictions aligned with TRUE-HF, which require only 30 d of wearable data before pVO2 predictions, while accommodating the necessary 7-d buffer period, we allowed predictions to begin after collecting 20 d of wearable data. To enable compatibility with data available in the All of Us Research Program, we trained and tested a reduced-sensor variant of the TRUE-HF pVO2 model, herein referred to as TRUE-HF-RS, which relies only on HeartRate, StepCount and a reduced clinical feature set (Methods). For fair comparison, this model was trained exclusively on the same training set as the TRUE-HF model (n = 154).
ch Program, we trained and tested a reduced-sensor variant of the TRUE-HF pVO2 model, herein referred to as TRUE-HF-RS, which relies only on HeartRate, StepCount and a reduced clinical feature set (Methods). For fair comparison, this model was trained exclusively on the same training set as the TRUE-HF model (n = 154). We trained the TRUE-HF framework to predict absolute CPET pVO2 (l min−1) measurements and performed regression analysis of this TRUE-HF pVO2 model against the end-of-study absolute CPET pVO2 measurement as a validated performance indicator in the held-out test set (n = 50 successful end-of-study assessments). Absolute pVO2 (l min−1) was chosen because it removes the need for linear indexing to body weight and alleviates the confounding effects of body mass on the model’s predicted outcomes. We compared TRUE-HF pVO2 performance as a baseline to the Apple Watch model: Apple cardio fitness measure VO2max algorithm. The TRUE-HF model delivers daily predictions by evaluating the preceding 30-d data and provides near-continuous remote assessment of cardiopulmonary fitness fluctuations, averaging one prediction per day over the observation period. Conversely, Apple’s algorithm produced fewer predictions given design requirements for minimal required physical activity levels, averaging 4.6 predictions for patients with CPET pVO2 < 1 l min−1, 6.3 for CPET pVO2 between 1 l min−1 and 2 l min−1 and 18.0 for patients with CPET pVO2 > 3 l min−1 over the observation period.
ersely, Apple’s algorithm produced fewer predictions given design requirements for minimal required physical activity levels, averaging 4.6 predictions for patients with CPET pVO2 < 1 l min−1, 6.3 for CPET pVO2 between 1 l min−1 and 2 l min−1 and 18.0 for patients with CPET pVO2 > 3 l min−1 over the observation period. Model performance was evaluated using Spearman’s coefficient (ρ), Pearson’s r and mean absolute error (m.a.e.) on the held-out test set, comparing observed and predicted values for the end-of-study CPET test. Strong concordance was observed between the TRUE-HF pVO2 predictions the day before the patients’ end-of-study date and the gold-standard end-of-study CPET pVO2 measurements (Fig. 3a). TRUE-HF achieved a Pearson’s r of 0.85 and an m.a.e. of 0.25 l min−1 in the held-out test set. Spearman’s coefficient measures (Fig. 3b) also suggest the enhanced capability of the TRUE-HF model to rank patients with HF accurately based on their pVO2.Fig. 3Biometric data from Apple Watch can be used to estimate cardiorespiratory fitness.a, Scatterplot representation of predicted pVO2 compared to the measured ground truth of CPET pVO2 at the end-of-study clinic visit (median 94.5 d post study-entry). Points represent individual participants. The solid line indicates the fitted linear relationship between predicted and measured pVO2. Shaded bands denote the 95% confidence band around the fitted linear relationship. b, Spearman’s correlation of the TRUE-HF model (n = 50) and Apple model (n = 35) to the CPET pVO2 measured at the end-of-study clinic visits. The pVO2 according to the Apple model was not measured for 15 participants. c, AUROC curve in clinical pVO2 measurement from study-entry to end-of-study clinic visits. AUROCs were calculated by comparing each of the model’s prediction of pVO2 decline (based on comparing the last, temporally closest to the end-of-study clinic visit date, prediction to the study-entry CPET pVO2) against the actual observed declines in CPET pVO2. Solid lines represent the mean ROC curve and shaded bands indicate the 95% confidence interval (CI), estimated by bootstrap resampling. A statistically significant difference in AUROC was observed between pVO2 predicted by the TRUE-HF model and the Apple pVO2 algorithm for detecting declines in CPET pVO2 (DeLong’s two-tailed test, P = 0.006671).
ean ROC curve and shaded bands indicate the 95% confidence interval (CI), estimated by bootstrap resampling. A statistically significant difference in AUROC was observed between pVO2 predicted by the TRUE-HF model and the Apple pVO2 algorithm for detecting declines in CPET pVO2 (DeLong’s two-tailed test, P = 0.006671). a, Scatterplot representation of predicted pVO2 compared to the measured ground truth of CPET pVO2 at the end-of-study clinic visit (median 94.5 d post study-entry). Points represent individual participants. The solid line indicates the fitted linear relationship between predicted and measured pVO2. Shaded bands denote the 95% confidence band around the fitted linear relationship. b, Spearman’s correlation of the TRUE-HF model (n = 50) and Apple model (n = 35) to the CPET pVO2 measured at the end-of-study clinic visits. The pVO2 according to the Apple model was not measured for 15 participants. c, AUROC curve in clinical pVO2 measurement from study-entry to end-of-study clinic visits. AUROCs were calculated by comparing each of the model’s prediction of pVO2 decline (based on comparing the last, temporally closest to the end-of-study clinic visit date, prediction to the study-entry CPET pVO2) against the actual observed declines in CPET pVO2. Solid lines represent the mean ROC curve and shaded bands indicate the 95% confidence interval (CI), estimated by bootstrap resampling. A statistically significant difference in AUROC was observed between pVO2 predicted by the TRUE-HF model and the Apple pVO2 algorithm for detecting declines in CPET pVO2 (DeLong’s two-tailed test, P = 0.006671).
ean ROC curve and shaded bands indicate the 95% confidence interval (CI), estimated by bootstrap resampling. A statistically significant difference in AUROC was observed between pVO2 predicted by the TRUE-HF model and the Apple pVO2 algorithm for detecting declines in CPET pVO2 (DeLong’s two-tailed test, P = 0.006671). Note, however, that a model might have a high correlation with a static clinical measurement but poor sensitivity to changes in the clinical measurement over time if the patients with higher prediction errors undergo the largest changes. Thus, we measured each model’s ability to detect clinically meaningful declines at the end-of-study clinic visit, measured against study-entry clinical measurements. Positive patients were defined as those with a prespecified ≥10% decline in CPET pVO2 from study-entry to end-of-study clinic visit. For each model and patient, we calculated a percentage difference between the first and last wearable-derived pVO2 on study for comparison. The area under the receiver operating characteristic (AUROC) metric was used to evaluate the diagnostic accuracy29,30. The TRUE-HF model pVO2 predictions showed significant accuracy for detecting end-of-study declines in CPET pVO2, achieving an AUROC of 0.82 (95% CI 0.69–0.92) compared to Apple’s algorithm with an AUROC of 0.52 (95% CI 0.39–0.70), based on Delong’s two-tailed test (P < 0.01) (Fig. 3c).
ate the diagnostic accuracy29,30. The TRUE-HF model pVO2 predictions showed significant accuracy for detecting end-of-study declines in CPET pVO2, achieving an AUROC of 0.82 (95% CI 0.69–0.92) compared to Apple’s algorithm with an AUROC of 0.52 (95% CI 0.39–0.70), based on Delong’s two-tailed test (P < 0.01) (Fig. 3c). We performed a secondary exploratory analysis to investigate an unexamined potential relationship between wearable-derived daily pVO2 and near-term unplanned healthcare utilization (defined above) in both the TRUE-HF held-out and All of Us cohorts. The TRUE-HF pVO2 model requires 30 d of data and, hence, our retrospective risk-stratification analysis for each patient in the held-out TRUE-HF test set commenced on day 31.
en wearable-derived daily pVO2 and near-term unplanned healthcare utilization (defined above) in both the TRUE-HF held-out and All of Us cohorts. The TRUE-HF pVO2 model requires 30 d of data and, hence, our retrospective risk-stratification analysis for each patient in the held-out TRUE-HF test set commenced on day 31. In the TRUE-HF held-out cohort, all held-out test patients with sufficient data, including those unable to complete an end-of-study clinic visit, were included, resulting in 58 patients being analyzed. All patients were censored at the time of unplanned healthcare utilization or end-of-study visit. We calculated the AUROC using the maximum percentage drop in the TRUE-HF model, wearable-derived, daily pVO2 from first model prediction to the prediction on that day for each patient before censoring, and compared them to canonical clinical static measures at the time of the study-entry clinic visit (CPET pVO2, 6MWT distance (6MWTD), NYHA class and NT-proBNP) and clinical HF models: Seattle Heart Failure Model (SHFM), Meta-Analysis Global Group in Chronic (MAGGIC) and PRognostic Evaluation During Invasive CaTheterization for Heart Failure (PREDICT-HF)31–33.
atic measures at the time of the study-entry clinic visit (CPET pVO2, 6MWT distance (6MWTD), NYHA class and NT-proBNP) and clinical HF models: Seattle Heart Failure Model (SHFM), Meta-Analysis Global Group in Chronic (MAGGIC) and PRognostic Evaluation During Invasive CaTheterization for Heart Failure (PREDICT-HF)31–33. Patients with drops in wearable-derived daily pVO2 were at significantly higher risk of unscheduled healthcare utilization on study (Fig. 4). The predictive power of daily pVO2 predictions was measured by AUROC at 0.77 (95% CI 0.62–0.90) for predicting these events. TRUE-HF model’s ability to predict unplanned healthcare utilization was statistically higher than baseline clinical metrics: CPET pVO2 (P = 0.04, adjusted P (Padj) = 0.05), NT-proBNP (P < 0.01, Padj = 0.01), 6MWTD (P = 0.05, Padj = 0.05), NYHA (P = 0.01, Padj = 0.02) based on Delong’s one-tailed test and Padj values from the Benjamini–Hochberg correction for the false discovery rate (0.05). Our risk-stratification analysis, focused on our prespecified ≥10% drop in TRUE-HF daily pVO2 (see above), is represented as a specific point on the ROC curve (indicating the sensitivity and specificity at this threshold) in Fig. 4a. At this ≥10% drop, TRUE-HF sensitivity was 88% and specificity was 62% across the complete observation period. Time-varying, extended Simon and Makuch’s Kaplan–Meier cumulative risk curves were drawn for an observed ≥10% drop in their TRUE-HF predicted daily pVO2 (Fig. 4c). During free living, a TRUE-HF wearable-derived daily pVO2 drop was found in 26 of 58 patients. Of the 26 patients, 7 (26.9%) in the drop group experienced unplanned healthcare utilization, compared with 1 (3.1%) of the 32 participants in the no-drop detected group. Drops detected by the TRUE-HF model preceded unplanned healthcare utilization by a median of 7.4 d (interquartile range (IQR), 4.5–8.5 d). Longitudinal qualitative analysis of forward-filled Locally Weighted Scatterplot Smoothing (LOWESS)-smoothed, TRUE-HF-predicted trajectories demonstrated that patients experiencing an unplanned healthcare event exhibited a progressive decline in daily TRUE-HF pVO2 compared to event-free patients, who maintained stable or slightly increasing pVO2 values (Fig. 4c).Fig.
ed Locally Weighted Scatterplot Smoothing (LOWESS)-smoothed, TRUE-HF-predicted trajectories demonstrated that patients experiencing an unplanned healthcare event exhibited a progressive decline in daily TRUE-HF pVO2 compared to event-free patients, who maintained stable or slightly increasing pVO2 values (Fig. 4c).Fig. 4The TRUE-HF model and its reduced-sensor variant (TRUE-HF-RS) predict declines in daily pVO2 before unplanned healthcare use.a–c, Performance of TRUE-HF model on the held-out TRUE-HF test set (n = 63), of which 58 patients had sufficient data for analysis (>10 d, 58 of 63). a, AUROC of unplanned healthcare utilization prediction of the TRUE-HF model against study-entry CPET pVO2, NT-proBNP, 6MWTD and NYHA functional class. AUROC was calculated using the maximum decrease in TRUE-HF daily pVO2 between the first model prediction and drops before the event for TRUE-HF. For NT-proBNP, 6MWT and NYHA, AUROC was calculated using the corresponding study-entry measures. The circled point on the AUROC curve represents the sensitivity and specificity of a 10% drop in TRUE-HF daily pVO2. b, Time-varying, extended, Simon and Makuch’s Kaplan–Meier cumulative risk curve for the TRUE-HF model of a 10% drop detected for event-free survival during the study. The shadowed areas indicate 95% CIs. The TRUE-HF drop group is a time-dependent covariate, which more accurately reflects the impact of time-dependent predictions on survival. c, Longitudinal TRUE-HF daily pVO2 model predictions in patients with unplanned healthcare utilization events versus those without. d,e, External validation of the TRUE-HF-RS model in the independent All of Us cohort, of which 193 patients with HF had sufficient data and EHRs for analysis. d, Time-varying Simon and Makuch’s Kaplan–Meier cumulative risk curves for event-free survival in patients in the All of Us cohort, comparing individuals with and without detected >10% drops in daily predicted pVO2. e, Longitudinal predicted daily pVO2 LOWESS trajectories in patients in the All of Us cohort, comparing those with (teal) and those without (gray) unplanned healthcare utilization events. Across all panels, the solid lines represent mean estimates and the shaded regions indicate 95% CIs. In c and e, LOWESS-smoothed trajectories were generated by forward-filling missing or censored data across the study window.
comparing those with (teal) and those without (gray) unplanned healthcare utilization events. Across all panels, the solid lines represent mean estimates and the shaded regions indicate 95% CIs. In c and e, LOWESS-smoothed trajectories were generated by forward-filling missing or censored data across the study window. a–c, Performance of TRUE-HF model on the held-out TRUE-HF test set (n = 63), of which 58 patients had sufficient data for analysis (>10 d, 58 of 63). a, AUROC of unplanned healthcare utilization prediction of the TRUE-HF model against study-entry CPET pVO2, NT-proBNP, 6MWTD and NYHA functional class. AUROC was calculated using the maximum decrease in TRUE-HF daily pVO2 between the first model prediction and drops before the event for TRUE-HF. For NT-proBNP, 6MWT and NYHA, AUROC was calculated using the corresponding study-entry measures. The circled point on the AUROC curve represents the sensitivity and specificity of a 10% drop in TRUE-HF daily pVO2. b, Time-varying, extended, Simon and Makuch’s Kaplan–Meier cumulative risk curve for the TRUE-HF model of a 10% drop detected for event-free survival during the study. The shadowed areas indicate 95% CIs. The TRUE-HF drop group is a time-dependent covariate, which more accurately reflects the impact of time-dependent predictions on survival. c, Longitudinal TRUE-HF daily pVO2 model predictions in patients with unplanned healthcare utilization events versus those without. d,e, External validation of the TRUE-HF-RS model in the independent All of Us cohort, of which 193 patients with HF had sufficient data and EHRs for analysis. d, Time-varying Simon and Makuch’s Kaplan–Meier cumulative risk curves for event-free survival in patients in the All of Us cohort, comparing individuals with and without detected >10% drops in daily predicted pVO2. e, Longitudinal predicted daily pVO2 LOWESS trajectories in patients in the All of Us cohort, comparing those with (teal) and those without (gray) unplanned healthcare utilization events. Across all panels, the solid lines represent mean estimates and the shaded regions indicate 95% CIs. In c and e, LOWESS-smoothed trajectories were generated by forward-filling missing or censored data across the study window.
comparing those with (teal) and those without (gray) unplanned healthcare utilization events. Across all panels, the solid lines represent mean estimates and the shaded regions indicate 95% CIs. In c and e, LOWESS-smoothed trajectories were generated by forward-filling missing or censored data across the study window. Extended Cox’s proportional hazards models and two-sided Wald’s test quantified the unadjusted HR for the association between continuous-change TRUE-HF daily pVO2 covariate and subsequent unplanned healthcare utilization, demonstrating that each 10% drop in pVO2 was associated with a significantly elevated risk (HR 3.62, 95% CI 1.37–9.55, P < 0.01; Table 2). Sensitivity analyses adjusting for independent covariates: age, sex, ethnicity, body mass index, smoking or left ventricular ejection fraction confirmed that the association remained significant (Extended Data Table 3). Further analysis of common clinical risk measurements and traditional clinical HF models (measured at study-entry clinic visit) did not achieve a significant HR (Table 2).Table 2Association between longitudinal declines in wearable-derived pVO2 and unplanned healthcare utilization in comparison to established clinical markersMethodContinuousTest setHR (95% CI)Wald’s test P valueTRUE-HF daily pVO2✓TRUE-HF (n = 63)3.619 (1.371, 9.551)0.009Study-entry CPET pVO2ⅹTRUE-HF (n = 63)1.372 (0.328, 5.724)0.664Study-entry NT-proBNPaⅹTRUE-HF (n = 63)0.862 (0.523, 1.420)0.559Study-entry 6MWT distanceⅹTRUE-HF (n = 63)1.001 (0.994, 1.006)0.800Study-entry NYHAⅹTRUE-HF (n = 63)0.876 (0.323, 2.377)0.795Study-entry SHFMⅹTRUE-HF (n = 63)1.117 (0.414, 3.019)0.830Study-entry MAGGICⅹTRUE-HF (n = 63)1.06 (0.959, 1.172)0.260Study-entry PREDICT-HFⅹTRUE-HF (n = 63)0.937 (0.302, 2.903)0.910TRUE-HF-RS Daily pVO2✓All of Us Research Program (n = 193)1.323 (1.033, 1.694)0.026The table shows Cox’s proportional HRs (mean, 95% CI) and two-sided Wald’s test (P value) for wearable-derived TRUE-HF declines in daily pVO2 and study-entry clinic visit comparator risk scores and clinical measures: CPET pVO2, NT-proBNP, 6MWTD and NYHA functional class. The Continuous column denotes predictors modeled as continuous time-varying covariates as opposed to one-time measurements at study entry. Declines in wearable-derived daily pVO2 using the TRUE-HF and TRUE-HF-RS models, respectively, achieved significant HRs for both TRUE-HF (3.619, P = 0.009) and All of Us (1.323, P = 0.026).
inuous column denotes predictors modeled as continuous time-varying covariates as opposed to one-time measurements at study entry. Declines in wearable-derived daily pVO2 using the TRUE-HF and TRUE-HF-RS models, respectively, achieved significant HRs for both TRUE-HF (3.619, P = 0.009) and All of Us (1.323, P = 0.026). Association between longitudinal declines in wearable-derived pVO2 and unplanned healthcare utilization in comparison to established clinical markers The table shows Cox’s proportional HRs (mean, 95% CI) and two-sided Wald’s test (P value) for wearable-derived TRUE-HF declines in daily pVO2 and study-entry clinic visit comparator risk scores and clinical measures: CPET pVO2, NT-proBNP, 6MWTD and NYHA functional class. The Continuous column denotes predictors modeled as continuous time-varying covariates as opposed to one-time measurements at study entry. Declines in wearable-derived daily pVO2 using the TRUE-HF and TRUE-HF-RS models, respectively, achieved significant HRs for both TRUE-HF (3.619, P = 0.009) and All of Us (1.323, P = 0.026). Using the reduced-sensor model, TRUE-HF-RS, on the internal held-out (n = 63), we observed modest discrimination for unplanned healthcare utilization, with an AUROC of 0.57 (95% CI 0.35–0.79) and a positive but nonsignificant association in time-to-event analysis (HR = 1.79, 95% CI 0.77–4.16, P = 0.18; Extended Data Table 4). External validation in the All of Us cohort using the TRUE-HF-RS model (described above) reveals positive trends in risk characterization of unplanned healthcare utilization.
Using the reduced-sensor model, TRUE-HF-RS, on the internal held-out (n = 63), we observed modest discrimination for unplanned healthcare utilization, with an AUROC of 0.57 (95% CI 0.35–0.79) and a positive but nonsignificant association in time-to-event analysis (HR = 1.79, 95% CI 0.77–4.16, P = 0.18; Extended Data Table 4). External validation in the All of Us cohort using the TRUE-HF-RS model (described above) reveals positive trends in risk characterization of unplanned healthcare utilization. Consistent with the trends observed in the TRUE-HF test set, each 10% drop in predicted daily pVO2 was associated with a higher risk of unplanned healthcare use in All of Us (HR 1.32, 95% CI 1.03–1.69, P = 0.03; Table 2). The TRUE-HF-RS model predicted unplanned healthcare utilization a median of 21 d (IQR 9.25–67.25) before events. Repeating sensitivity analyses for All of Us, we found that our HR remained significant against each covariate (Extended Data Table 5). Across the observational period in patients with HF in the All of Us cohort, the TRUE-HF-RS model at our prespecified threshold of ≥10% drop in daily pVO2 showed a sensitivity of 50% and a specificity of 59% for unplanned healthcare use (Fig. 4d). Qualitatively, we also observed similar trends in patients with and without events, as shown in Fig. 4e, where predicted pVO2 decreases over time in patients with events compared to those without events. Aligning with these observed trends in the All of Us Research Program, the time-dependent landmark AUROC evaluated over month-long windows was 0.51 (95% CI 0.33–0.68) at day 10 versus 0.72 (95% CI 0.39–0.92) by day 80 (Extended Data Fig. 3).
decreases over time in patients with events compared to those without events. Aligning with these observed trends in the All of Us Research Program, the time-dependent landmark AUROC evaluated over month-long windows was 0.51 (95% CI 0.33–0.68) at day 10 versus 0.72 (95% CI 0.39–0.92) by day 80 (Extended Data Fig. 3). We extended our framework and methods to 6MWTD as an additional variable that could be predicted using wearable devices and further examined its prediction of unplanned healthcare utilization. A separate TRUE-HF model was trained for 6MWTD and compared to Apple’s algorithm, the sixMinuteWalkTestDistance. As with the CPET pVO2 assessment, a drop in 6MWTD (10% reduction in 6MWTD from baseline to end-of-study follow-up) and healthcare utilization was evaluated. Extended Data Fig. 4 highlights the 6MWTD prediction using biometrics from Apple Watch. The sixMinuteWalkTestDistance in HealthKit performed comparably to the 6MWTD study model across all correlation metrics and effectively detected drops in 6MWTD performance (Extended Data Fig. 5). However, neither achieved statistical significance for unplanned healthcare utilization.
using biometrics from Apple Watch. The sixMinuteWalkTestDistance in HealthKit performed comparably to the 6MWTD study model across all correlation metrics and effectively detected drops in 6MWTD performance (Extended Data Fig. 5). However, neither achieved statistical significance for unplanned healthcare utilization. To evaluate the contribution of structured exercise sessions, we masked wearable data collected during the monthly unsupervised 6MWT and Tecumseh cube tests. Exclusion of these structured sessions resulted in negligible differences in predictive performance (m.a.e. change of 0.00575 in predicted pVO2; Supplementary Fig. 2). Specifically, regression performance and classification accuracy for detecting 10% drops in predicted pVO2, as well as predicting unplanned healthcare utilization, remained unchanged, with AUROCs of 0.82 (95% CI 0.69–0.92) and 0.77 (95% CI 0.62–0.90), respectively. We also examined the variation in TRUE-HF prediction performance with shorter input window durations (20 d and 10 d) during both training and evaluation, where we observed reduced model performance (unplanned healthcare utilization at 20-d windows: AUROC of 0.61 (95% CI 0.40–0.80); at 10-d windows: AUROC of 0.61 (95% CI 0.39–0.82)); compared to the complete 30-d input window (AUROC of 0.77 (95% CI 0.62–0.90)) (Supplementary Tables 1 and 2), indicating that longer trends may provide more reliable information.
ance (unplanned healthcare utilization at 20-d windows: AUROC of 0.61 (95% CI 0.40–0.80); at 10-d windows: AUROC of 0.61 (95% CI 0.39–0.82)); compared to the complete 30-d input window (AUROC of 0.77 (95% CI 0.62–0.90)) (Supplementary Tables 1 and 2), indicating that longer trends may provide more reliable information. Saliency-based analyses revealed that wearable sensor data, especially AppleExerciseTime, StepCount and OxygenSaturation, contributed to model predictions, with higher activity levels generally associated with a higher predicted pVO2 (Supplementary Fig. 3). Clinical variables (age, sex, ethnicity and furosemide dosage) also influenced predictions but exhibited context-dependent effects (Supplementary Fig. 4), with no consistent directional trends observed independently. Importantly, as wearable technologies have shown reduced performance on darker skin, subgroup analysis between white and non-white participants in both cohorts showed comparable model performance across both the TRUE-HF and the All of Us validation sets34 (Supplementary Table 4).
ectional trends observed independently. Importantly, as wearable technologies have shown reduced performance on darker skin, subgroup analysis between white and non-white participants in both cohorts showed comparable model performance across both the TRUE-HF and the All of Us validation sets34 (Supplementary Table 4). Finally, ablation analyses assessing the independent and combined contributions of wearable and clinical inputs (Supplementary Table 3) demonstrated that wearable data alone provided substantial predictive capacity (unplanned healthcare utilization AUROC of 0.60 (95% CI 0.42–0.78)). In contrast, predictions based solely on clinical variables were weaker (unplanned healthcare utilization AUROC of 0.52 (95% CI 0.32–0.71)). However, integration of both wearable and clinical data (that is, TRUE-HF model) yielded the strongest performance across all outcomes.
We developed an autoregressive transformer framework, termed the TRUE-HF model (Extended Data Fig. 1), which combines wearable data with patient-specific clinical information to forecast daily predictions. Our approach processes 30 d of wearable data, initially aggregated at 90-min intervals and progressively pooled into larger time windows, allowing the model to learn relationships at multiple resolutions, as detailed in Methods. Throughout our study, we considered predicted daily TRUE-HF pVO2 as a naturally continuous measure. We prespecified a clinically meaningful drop in CPET pVO2 as a ≥10% reduction, based on prior literature observing a ≥6% to 10% decrease in CPET pVO2 being associated with an increased risk of medium-to-long-term hospitalization or death in patients with HF23–26. Therefore, hereafter, a ≥10% drop in TRUE-HF pVO2 as predicted by our model is considered test positive.
We conducted external validation on an independent patient cohort from the National Institutes of Health (NIH) All of Us Research Program which uses Fitbit wearables27,28. We identified 193 patients who closely matched our TRUE-HF cohort inclusion criteria, including confirmed diagnosis of HF documented in EHRs, sufficient wearable data coverage (defined as wearable use on at least 40% of study days, nonconsecutively) and detailed EHR records available for event adjudication (Extended Data Fig. 2). However, regarding HF severity, the NYHA class was missing in the EHRs for 1,653 of 1,664 (99%) all possible All of Us patients with HF and wearable data and BNP were missing in 917 of 1,664 (55%). Therefore, to more closely reflect the clinical severity observed in the TRUE-HF population—where 77% were NYHA class ≥II—we added the inclusion criteria of at least one previously recorded unplanned healthcare event before enrollment. To establish a consistent starting point across patients in the All of Us cohort, we defined the study-entry time point as the first point at which both the index event had occurred and wearable data were available for each patient. Furthermore, due to data availability in the All of Us Research Program, we applied a more stringent definition of unplanned healthcare utilization, restricting events to inpatient hospital admission or intravenous furosemide administration. Outpatient visits were excluded because we could not reliably determine whether they were scheduled or unscheduled and self-reported outcomes were not available in this dataset. Patient demographics are summarized in Extended Data Table 2.
We trained the TRUE-HF framework to predict absolute CPET pVO2 (l min−1) measurements and performed regression analysis of this TRUE-HF pVO2 model against the end-of-study absolute CPET pVO2 measurement as a validated performance indicator in the held-out test set (n = 50 successful end-of-study assessments). Absolute pVO2 (l min−1) was chosen because it removes the need for linear indexing to body weight and alleviates the confounding effects of body mass on the model’s predicted outcomes. We compared TRUE-HF pVO2 performance as a baseline to the Apple Watch model: Apple cardio fitness measure VO2max algorithm. The TRUE-HF model delivers daily predictions by evaluating the preceding 30-d data and provides near-continuous remote assessment of cardiopulmonary fitness fluctuations, averaging one prediction per day over the observation period. Conversely, Apple’s algorithm produced fewer predictions given design requirements for minimal required physical activity levels, averaging 4.6 predictions for patients with CPET pVO2 < 1 l min−1, 6.3 for CPET pVO2 between 1 l min−1 and 2 l min−1 and 18.0 for patients with CPET pVO2 > 3 l min−1 over the observation period.
We performed a secondary exploratory analysis to investigate an unexamined potential relationship between wearable-derived daily pVO2 and near-term unplanned healthcare utilization (defined above) in both the TRUE-HF held-out and All of Us cohorts. The TRUE-HF pVO2 model requires 30 d of data and, hence, our retrospective risk-stratification analysis for each patient in the held-out TRUE-HF test set commenced on day 31. In the TRUE-HF held-out cohort, all held-out test patients with sufficient data, including those unable to complete an end-of-study clinic visit, were included, resulting in 58 patients being analyzed. All patients were censored at the time of unplanned healthcare utilization or end-of-study visit. We calculated the AUROC using the maximum percentage drop in the TRUE-HF model, wearable-derived, daily pVO2 from first model prediction to the prediction on that day for each patient before censoring, and compared them to canonical clinical static measures at the time of the study-entry clinic visit (CPET pVO2, 6MWT distance (6MWTD), NYHA class and NT-proBNP) and clinical HF models: Seattle Heart Failure Model (SHFM), Meta-Analysis Global Group in Chronic (MAGGIC) and PRognostic Evaluation During Invasive CaTheterization for Heart Failure (PREDICT-HF)31–33.
inuous column denotes predictors modeled as continuous time-varying covariates as opposed to one-time measurements at study entry. Declines in wearable-derived daily pVO2 using the TRUE-HF and TRUE-HF-RS models, respectively, achieved significant HRs for both TRUE-HF (3.619, P = 0.009) and All of Us (1.323, P = 0.026). Association between longitudinal declines in wearable-derived pVO2 and unplanned healthcare utilization in comparison to established clinical markers The table shows Cox’s proportional HRs (mean, 95% CI) and two-sided Wald’s test (P value) for wearable-derived TRUE-HF declines in daily pVO2 and study-entry clinic visit comparator risk scores and clinical measures: CPET pVO2, NT-proBNP, 6MWTD and NYHA functional class. The Continuous column denotes predictors modeled as continuous time-varying covariates as opposed to one-time measurements at study entry. Declines in wearable-derived daily pVO2 using the TRUE-HF and TRUE-HF-RS models, respectively, achieved significant HRs for both TRUE-HF (3.619, P = 0.009) and All of Us (1.323, P = 0.026). Using the reduced-sensor model, TRUE-HF-RS, on the internal held-out (n = 63), we observed modest discrimination for unplanned healthcare utilization, with an AUROC of 0.57 (95% CI 0.35–0.79) and a positive but nonsignificant association in time-to-event analysis (HR = 1.79, 95% CI 0.77–4.16, P = 0.18; Extended Data Table 4).
External validation in the All of Us cohort using the TRUE-HF-RS model (described above) reveals positive trends in risk characterization of unplanned healthcare utilization. Consistent with the trends observed in the TRUE-HF test set, each 10% drop in predicted daily pVO2 was associated with a higher risk of unplanned healthcare use in All of Us (HR 1.32, 95% CI 1.03–1.69, P = 0.03; Table 2). The TRUE-HF-RS model predicted unplanned healthcare utilization a median of 21 d (IQR 9.25–67.25) before events. Repeating sensitivity analyses for All of Us, we found that our HR remained significant against each covariate (Extended Data Table 5). Across the observational period in patients with HF in the All of Us cohort, the TRUE-HF-RS model at our prespecified threshold of ≥10% drop in daily pVO2 showed a sensitivity of 50% and a specificity of 59% for unplanned healthcare use (Fig. 4d). Qualitatively, we also observed similar trends in patients with and without events, as shown in Fig. 4e, where predicted pVO2 decreases over time in patients with events compared to those without events. Aligning with these observed trends in the All of Us Research Program, the time-dependent landmark AUROC evaluated over month-long windows was 0.51 (95% CI 0.33–0.68) at day 10 versus 0.72 (95% CI 0.39–0.92) by day 80 (Extended Data Fig. 3).
We extended our framework and methods to 6MWTD as an additional variable that could be predicted using wearable devices and further examined its prediction of unplanned healthcare utilization. A separate TRUE-HF model was trained for 6MWTD and compared to Apple’s algorithm, the sixMinuteWalkTestDistance. As with the CPET pVO2 assessment, a drop in 6MWTD (10% reduction in 6MWTD from baseline to end-of-study follow-up) and healthcare utilization was evaluated. Extended Data Fig. 4 highlights the 6MWTD prediction using biometrics from Apple Watch. The sixMinuteWalkTestDistance in HealthKit performed comparably to the 6MWTD study model across all correlation metrics and effectively detected drops in 6MWTD performance (Extended Data Fig. 5). However, neither achieved statistical significance for unplanned healthcare utilization.
To evaluate the contribution of structured exercise sessions, we masked wearable data collected during the monthly unsupervised 6MWT and Tecumseh cube tests. Exclusion of these structured sessions resulted in negligible differences in predictive performance (m.a.e. change of 0.00575 in predicted pVO2; Supplementary Fig. 2). Specifically, regression performance and classification accuracy for detecting 10% drops in predicted pVO2, as well as predicting unplanned healthcare utilization, remained unchanged, with AUROCs of 0.82 (95% CI 0.69–0.92) and 0.77 (95% CI 0.62–0.90), respectively. We also examined the variation in TRUE-HF prediction performance with shorter input window durations (20 d and 10 d) during both training and evaluation, where we observed reduced model performance (unplanned healthcare utilization at 20-d windows: AUROC of 0.61 (95% CI 0.40–0.80); at 10-d windows: AUROC of 0.61 (95% CI 0.39–0.82)); compared to the complete 30-d input window (AUROC of 0.77 (95% CI 0.62–0.90)) (Supplementary Tables 1 and 2), indicating that longer trends may provide more reliable information.
The results of this study provide strong evidence to support the potential of consumer-wearable data in estimating cardiorespiratory fitness and building remote monitoring methods for HF. Consumer-wearable data in a free-living environment, analyzed using a transformer model, demonstrated strong concordance with gold-standard CPET measurements. Wearable-derived daily cardiopulmonary fitness assessments enabled new insights: clinically meaningful drops in daily pVO2 forecast imminent unplanned healthcare utilization events in outpatient patients with HF, with external validation of a reduced-sensor model that maintains a positive risk relationship. Finally, ancillary analysis found that wearable data showed strong concordance with 6MWTD. The findings herein are an essential addition to risk evaluation in HF for several reasons. First, our findings demonstrated the promise of wearables for accurate serial pVO2 assessment in an outpatient HF population, expanding the ability to assess pVO2 from the clinic to a daily, free-living environment. The TRUE-HF model achieved high concordance with gold-standard CPET measurements of pVO2, with an m.a.e. within the previously reported CPET test–retest CI for patients with HF35. This is further exemplified by the TRUE-HF model’s high accuracy in predicting decreases in CPET pVO2 (10% drops) between study-entry and end-of-study clinic visits. The high accuracy of wearable-derived pVO2 using TRUE-HF enables a potential shift from static, infrequent clinical CPET to frequent, dynamic, daily pVO2.
This is further exemplified by the TRUE-HF model’s high accuracy in predicting decreases in CPET pVO2 (10% drops) between study-entry and end-of-study clinic visits. The high accuracy of wearable-derived pVO2 using TRUE-HF enables a potential shift from static, infrequent clinical CPET to frequent, dynamic, daily pVO2. Second, shifting from infrequent clinical CPET to daily pVO2 monitoring allows detection of previously unrecognized clinically meaningful declines in cardiopulmonary fitness (10% daily drops in pVO2) that preceded unplanned healthcare utilization. The TRUE-HF model enhances early detection of deterioration, acting as an early warning signal to identify patients at risk before clinical symptoms appear. At a 10% drop, the TRUE-HF model demonstrated a sensitivity of 88% and specificity of 62% in predicting unplanned healthcare utilization. The sensitivity underscores the TRUE-HF model’s effectiveness in accurately identifying low-risk patients, thereby ruling out individuals unlikely to require immediate intervention and supporting early low-risk intervention (a clinical touchpoint) for higher-risk individuals. Prioritizing sensitivity is crucial, because the early identification of at-risk patients enables proactive measures, such as expedited clinical visits, which can reduce the risk of unplanned healthcare utilization36.
ention and supporting early low-risk intervention (a clinical touchpoint) for higher-risk individuals. Prioritizing sensitivity is crucial, because the early identification of at-risk patients enables proactive measures, such as expedited clinical visits, which can reduce the risk of unplanned healthcare utilization36. Continuous wearable-derived changes in daily pVO2 were strongly associated with unplanned healthcare utilization, with a significant HR per 10% drop. Static measures inherently capture a single snapshot of physiological status, whereas our model captures evolving physiological changes closer to the time of clinical events, potentially enhancing predictive performance. The near-continuous monitoring of TRUE-HF pVO2 represents a more dynamic and sensitive risk-stratification method or a potential dynamically monitored endpoint in future trials. Thus, although practical, conventional static metrics, such as NT-proBNP levels, CPET pVO2, 6MWTD or NYHA functional class, failed to provide the same proactive early warning signal as our model, which leverages a 30-d window of wearable data to predict unplanned healthcare utilization a median of 7.4 d before it occurred36.
although practical, conventional static metrics, such as NT-proBNP levels, CPET pVO2, 6MWTD or NYHA functional class, failed to provide the same proactive early warning signal as our model, which leverages a 30-d window of wearable data to predict unplanned healthcare utilization a median of 7.4 d before it occurred36. External validation analysis in the All of Us Research Program reaffirmed that a drop in daily wearable-derived pVO2 remains a meaningful predictor of short-term unplanned healthcare utilization. This association held despite reduced-sensor and demographic differences. Specifically, the All of Us cohort provided fewer Fitbit per min measurements, limited to HeartRate and StepCount, compared to the more comprehensive Apple Watch metrics available in the original cohort. There was no AppleExerciseTime or OxygenSaturation, two measurements highlighted as necessary in our feature analysis. Although predictive performance decreased modestly with fewer wearable sensors, the consistent directionality, the significance of HR and the predictive lead time support the practical feasibility of wearable-derived daily pVO2 in real-world heterogeneous settings.
ents highlighted as necessary in our feature analysis. Although predictive performance decreased modestly with fewer wearable sensors, the consistent directionality, the significance of HR and the predictive lead time support the practical feasibility of wearable-derived daily pVO2 in real-world heterogeneous settings. Furthermore, our findings indicate that reliance on specific exercise thresholds to trigger health predictions poses a barrier to tracking health in sicker patients, whereas use of longer-term trends yields more consistent measurements. Patients with HF often suffer from exercise intolerance due to impaired pulmonary vasodilatation, reduced cardiac reserves, skeletal muscle dysfunction and other comorbidities that may limit their ability to trigger intensity-based exercise thresholds. For instance, at the time of the study, Apple’s pVO2 algorithm required a patient to reach at least 60% of their maximum heart rate and certain activity parameters before observations were reported. Thus, we noted far fewer predictions in sicker patients with HF. In contrast, focus on trends over a longer time window, as with the TRUE-HF pVO2 model, yielded consistent daily predictions for all patients in our cohort, including those with more advanced HF.
tain activity parameters before observations were reported. Thus, we noted far fewer predictions in sicker patients with HF. In contrast, focus on trends over a longer time window, as with the TRUE-HF pVO2 model, yielded consistent daily predictions for all patients in our cohort, including those with more advanced HF. The 6MWTD predictions were also assessed, allowing for a separate comparison of wearable-derived remote measurements. Although our findings indicate that wearables can reliably track daily physical activity and predict the 6MWTD measured in the clinic, the direct correlation between changes in 6MWTD and risk of unplanned healthcare utilization was less pronounced. It should be noted that the HR for remote 6MWTD measurements exceeds that for static 6MWTD measurements. However, this difference did not reach statistical significance. Differences in downstream unplanned healthcare utilization may stem from inherent distinctions between the 6MWT and CPET as clinical endpoints. The 6MWT reflects different physiological constructs to CPET. The CPET pVO2 captures maximal volitional capacity (validated by achieving a respiratory exchange ratio of ≤1.1 during CPET). In contrast, 6MWT may be more influenced by submaximal efforts. Thus, although 6MWTD indicates overall health status and functional ability, it may not independently predict acute clinical events in the HF population.
maximal volitional capacity (validated by achieving a respiratory exchange ratio of ≤1.1 during CPET). In contrast, 6MWT may be more influenced by submaximal efforts. Thus, although 6MWTD indicates overall health status and functional ability, it may not independently predict acute clinical events in the HF population. Despite the promise of remote monitoring with wearables, challenges remain. Remote monitoring with wearables can generate over 5 gigabytes of data per patient per week. Analysis of these data to identify biomarker signatures is best suited for modern artificial intelligence techniques, particularly deep learning (DL). We used a transformer-based backbone to interpret wearable data sequences and assessed functional changes in cardiorespiratory fitness among patients with HF. Our new TRUE-HF DL model also employs pooling-based methods, allowing it to assess 30 d of wearable data and aggregate them into increasingly larger periods for analysis while retaining temporal relationships37. TRUE-HF looks at how each hour fits into the rest of that day, how that day fits into the week and how that week fits into 30 d. In contrast, other artificial intelligence methods treat each moment independently; thus, they struggle to understand how a Tuesday workout might affect activities on Thursday38. Here we used the TRUE-HF transformer model to understand wearable data sequences and their ability to measure functional changes in cardiorespiratory fitness in patients with HF, which has not been previously explored.
ently; thus, they struggle to understand how a Tuesday workout might affect activities on Thursday38. Here we used the TRUE-HF transformer model to understand wearable data sequences and their ability to measure functional changes in cardiorespiratory fitness in patients with HF, which has not been previously explored. The study has some limitations that merit consideration. First, the necessity of dividing our TRUE-HF cohort into a development and final held-out test set reduced the sample size available for testing. Although the TRUE-HF model’s predictive power for unplanned healthcare utilization is highly encouraging, and our patient enrollment was highly diverse, the low incidence of events among our patient cohort limited our ability to perform subanalyses of our results for specific subgroups of patients with HF. For example, in patients with HF with preserved ejection fraction, only six were present in the test set. Accordingly, the present study was not powered to support phenotype-specific inference and subgroup analyses were considered exploratory. Consequently, our sample size constrained the feasibility of fully adjusted multivariate analyses without risking overfitting. Future studies with a larger validation set will be needed to fully assess the independent prognostic value of TRUE-HF pVO2 drop when comprehensively adjusted for potential confounders.
ratory. Consequently, our sample size constrained the feasibility of fully adjusted multivariate analyses without risking overfitting. Future studies with a larger validation set will be needed to fully assess the independent prognostic value of TRUE-HF pVO2 drop when comprehensively adjusted for potential confounders. Traditional prognostic markers, including NT-proBNP, baseline pVO2 and 6MWT distance, showed limited prognostic utility in our analyses. Although point estimates were generally consistent with established prognostic relationships, the corresponding HRs did not reach statistical significance, likely reflecting limited statistical power given the small number of clinical events. Similar patterns were observed for established static risk scores (SHFM, MAGGIC and PREDICT-HF), which are typically calibrated for 1-year to 2-year risk horizons and offer only modest discrimination in many settings. Finally, Apple’s pVO2 estimates may be biased upward under conditions common in this cohort. Heart rate-limiting medications can blunt exercise heart-rate responses and may lead to overestimation by Apple’s pVO2 algorithm. In addition, not all participants engaged in sufficient outdoor activity, which may further reduce Apple’s pVO2 accuracy19.
mates may be biased upward under conditions common in this cohort. Heart rate-limiting medications can blunt exercise heart-rate responses and may lead to overestimation by Apple’s pVO2 algorithm. In addition, not all participants engaged in sufficient outdoor activity, which may further reduce Apple’s pVO2 accuracy19. The findings from the TRUE-HF observational cohort study provide compelling evidence for using consumer-wearable devices to remotely monitor the daily cardiopulmonary fitness of patients with HF in an outpatient, free-living environment. Specifically, drops in wearable-derived daily pVO2 are early indicators of unplanned HF-related healthcare utilization. The potential of the TRUE-HF model to revolutionize remote patient monitoring in cardiology is a cause for optimism. This observed association between detected pVO2 deteriorations and an increased risk of unplanned healthcare utilization underscores the need for further research.
The TRUE-HF study (NCT05008692) was conducted under approved protocols from the University Health Network (UHN) Research Ethics Board (Toronto, Canada; REB no. 20-5205). Written informed consent was obtained from all participants before enrollment. Details of the study protocol are available21 and a summary is provided herein. The statistical analysis plan is included in Supplementary information. Retrospective analyses of the All of Us Research Program used de-identified data accessed through its controlled-access platform. The program operates under a centralized institutional review board and all participants provided written informed consent for data collection and secondary research use. Analyses complied with program policies and applicable ethical and regulatory requirements. The TRUE-HF study (NCT05008692) enrolled outpatients with HF receiving care at the UHN (Fig. 1). Eligible participants were aged ≥18 years and could adequately comprehend English independently or with a caregiver’s help. Research coordinators provided study information at the time of informed consent and contacted patients for follow-up.
NCT05008692) enrolled outpatients with HF receiving care at the UHN (Fig. 1). Eligible participants were aged ≥18 years and could adequately comprehend English independently or with a caregiver’s help. Research coordinators provided study information at the time of informed consent and contacted patients for follow-up. During the initial enrollment visit (study entry), we provided patients guidance on setting up their Apple Watch. During this session, we also educated patients on how to use Apple Watch. The appropriate electronic case report form (eCRF) and electrocardiogram (ECG) applications were downloaded and available on iPhone. Apple Watch ECG has only been validated for patients aged >22 years; therefore, only participants older than 22 years were asked to download the Apple ECG app. Participants interacted with an eCRF iOS mobile application. The data gathered in the application were not used to treat the patient. All patients underwent CPET, comprehensive bloodwork, clinical examination and a supervised 6MWT during study-entry and end-of-study clinic visits.
e asked to download the Apple ECG app. Participants interacted with an eCRF iOS mobile application. The data gathered in the application were not used to treat the patient. All patients underwent CPET, comprehensive bloodwork, clinical examination and a supervised 6MWT during study-entry and end-of-study clinic visits. All demographic and clinical measurements recorded during the study-entry and end-of-study clinic visits were captured in Research Electronic Data Capture, ensuring data transcription for the study cohort. In collaboration with Apple developers, an eCRF iOS mobile application was developed using Swift to communicate study information, conduct daily surveys and gather HealthKit wearable data from patients securely and de-identified during the study duration. The wearable-derived data were not used to inform clinical decision-making.
ation with Apple developers, an eCRF iOS mobile application was developed using Swift to communicate study information, conduct daily surveys and gather HealthKit wearable data from patients securely and de-identified during the study duration. The wearable-derived data were not used to inform clinical decision-making. Daily surveys captured unplanned healthcare utilization events during the free-living observation period. Patients were instructed to complete daily surveys assessing the following symptoms: increasing shortness of breath, leg swelling, palpitations, chest pain, light-headedness and fainting. In addition, the daily surveys assessed whether a patient required the following in the last 24 h: changes to medication, intravenous furosemide, unscheduled health visit, emergency room visit and/or hospital admission. Every month, we also conducted monthly fitness tests, where patients were instructed to partake in a monthly unsupervised 6MWT and Tecumseh cube test. Instructional videos could be used asynchronously to support these tests. We excluded patients with nonadherence, defined by wearing their Apple Watch <1 d throughout the 90-d free-living period.
cted monthly fitness tests, where patients were instructed to partake in a monthly unsupervised 6MWT and Tecumseh cube test. Instructional videos could be used asynchronously to support these tests. We excluded patients with nonadherence, defined by wearing their Apple Watch <1 d throughout the 90-d free-living period. In deriving prediction models, there is no hypothesis test to guide sample size calculation. Therefore, we aimed to calculate sample size requirements for various CI widths around clinically acceptable measures of model discrimination. The study’s sample size was determined using a classification-based approach to predict a >10% decline in pVO2, a threshold previously associated with worse outcomes in patients with HF23–26. With an assumed AUROC of 0.70, a sample size (n = 200 with completed CPETs) was estimated to provide a robust lower bound of the CI for the study objective, including model development (n = 150) and held-out test (n = 50). To prevent analytical bias, model evaluation on the held-out test set was prespecified and conducted only after the study was completed and all participants had exited follow-up.
stimated to provide a robust lower bound of the CI for the study objective, including model development (n = 150) and held-out test (n = 50). To prevent analytical bias, model evaluation on the held-out test set was prespecified and conducted only after the study was completed and all participants had exited follow-up. The external validation cohort was constructed using data from the NIH All of Us Research Program v8 to validate the unplanned healthcare utilization experiments. Initially, 1,664 participants within the All of Us dataset had a documented diagnosis of HF and were available for analysis using wearable data from Fitbit devices. Among these, 400 individuals were excluded due to incomplete or missing EHR data necessary for event adjudication and accurate cohort characterization (Extended Data Fig. 2).
rticipants within the All of Us dataset had a documented diagnosis of HF and were available for analysis using wearable data from Fitbit devices. Among these, 400 individuals were excluded due to incomplete or missing EHR data necessary for event adjudication and accurate cohort characterization (Extended Data Fig. 2). To align the All of Us cohort’s clinical severity with the TRUE-HF population, we restricted the cohort to include only patients with documented prior unplanned healthcare utilization. We defined unplanned events as inpatient hospitalizations (excluding planned procedures) or intravenous furosemide administration40. For each participant, the study-entry date was set as the later of the following two dates: the discharge date of their qualifying unplanned healthcare event or the first day of available wearable sensor data collection. To ensure adequate data availability, we excluded participants with <30 d of wearable data after study entry. Participants were then filtered to ensure adequate wearable data coverage, defined as at least 40% daily measurement coverage during the observational period after study entry, resulting in the exclusion of additional participants.
ta availability, we excluded participants with <30 d of wearable data after study entry. Participants were then filtered to ensure adequate wearable data coverage, defined as at least 40% daily measurement coverage during the observational period after study entry, resulting in the exclusion of additional participants. Finally, we defined the observation endpoint (‘end-of-study’ visit) as either the occurrence of a second unplanned healthcare utilization event within 120 d or the completion of a follow-up period that matched the TRUE-HF cohort median duration (approximately 94.5 d), with a maximum of 120 d. Participants who did not meet either endpoint criterion were excluded, resulting in a final external validation cohort comprising 193 individuals. Patient demographics are summarized in Extended Data Table 2. The following data were collected, with informed consent from study participants, from Apple Watch through HealthKit during the approximate 90-d free-living period (that is, excluding days of baseline and end-of-study follow-up clinic visits): step count, exercise time, distance traveled, stand time, active energy burned, basal energy burned, heart rate, heart rate variability and O2 saturation. These variables were selected because they were the most frequently and consistently recorded during the interim analysis of the monitoring period and details of each are defined in Apple HealthKit41.
traveled, stand time, active energy burned, basal energy burned, heart rate, heart rate variability and O2 saturation. These variables were selected because they were the most frequently and consistently recorded during the interim analysis of the monitoring period and details of each are defined in Apple HealthKit41. A standardized summarization protocol addressed the varying temporal resolutions across data types. First, abnormal data record errors were removed using an outlier approach. Records with values >3 s.d. from the population mean for each data type were removed. Next, we constructed representation of the wearable data that could integrate large-scale data for downstream usage. Specifically, first, we normalized the disparate data streams by synthesizing 90-min aggregated metrics (mean, median, minimum, maximum and s.d.) of HealthKit variables (defined above); the sum was used instead of the mean to more accurately capture the overall exercise quality of these HealthKit data types during the 90-min time window. To maintain an estimate of variable trajectories during periods of sensor nonrecording, we employed intrapatient forward-filling imputation to address gaps in the 90-min summary data, thereby preserving the autoregressive integrity of the time series. We used intrapatient forward-filling imputation to maintain the autoregressive integrity of the time series and conservatively estimated variable trajectories during sensor nonrecording.
rd-filling imputation to address gaps in the 90-min summary data, thereby preserving the autoregressive integrity of the time series. We used intrapatient forward-filling imputation to maintain the autoregressive integrity of the time series and conservatively estimated variable trajectories during sensor nonrecording. With wearable data and incorporating patient-specific clinical information such as sex, race, age, prescription dosages, weight and height, our model predicts an individual patient’s cardiopulmonary fitness and changes in their fitness over time. All nine wearable-derived features presented in Fig. 1b were included as model inputs. Our new method leveraged a contextualized DL model to retain and analyze temporal trends across 30 d of patient-wearable data, providing near-continuous daily monitoring through next-day predictions. To achieve this, our model incorporated three distinct components: (1) it contextualized temporal representations of the data using a bottom-up approach, extending from 90-min intervals to full-day aggregation; (2) it integrated patient-specific clinical information directly into the wearable data features, allowing for adaptive feature calculations; and (3) it explicitly considered the temporal constraints, recognizing that daily activities are influenced by preceding days, and used this to make ongoing predictions for each day.
it integrated patient-specific clinical information directly into the wearable data features, allowing for adaptive feature calculations; and (3) it explicitly considered the temporal constraints, recognizing that daily activities are influenced by preceding days, and used this to make ongoing predictions for each day. The TRUE-HF model (Extended Data Fig. 1) processed 30 d of 90-min summaries of wearable data, starting with sequential 90-min summaries and assembling them into larger time windows, allowing the model to learn temporal relationships at different temporal resolutions while improving processing efficiency. This approach is embodied in the model architecture, a bottom-up variant of the transformer model, which optimizes the feature map and reduces the temporal resolution through pooling37,42,43. HealthKit data are first tokenized through one-dimensional convolutions to collapse features along the temporal domain44. The input is then processed through a TRUE-HF block consisting of a transformer layer followed by a pooling layer. The transformer layer learns relationships within the temporal resolution in each TRUE-HF block. Subsequently, the pooling layer aggregates pairs of consecutive time points (that is, 90–180 min). Using four consecutive TRUE-HF blocks, we effectively analyzed time resolutions of 90 min, 180 min, 360 min, 720 min and 1,440 min (daily resolution). The final prediction layer then aggregated information across the previous 30 d, inclusive, to predict the current day’s measurement.
time points (that is, 90–180 min). Using four consecutive TRUE-HF blocks, we effectively analyzed time resolutions of 90 min, 180 min, 360 min, 720 min and 1,440 min (daily resolution). The final prediction layer then aggregated information across the previous 30 d, inclusive, to predict the current day’s measurement. To enhance the model’s understanding of wearable data, TRUE-HF incorporates patient-specific clinical information (described above), enabling it to learn different operations based on input attributes rather than treating all inputs equally. We achieved this by augmenting TRUE-HF with a feature-wise linear modulation block that modulates activations in the neural networks based on clinical details45. The model used only demographic information from the baseline clinic visit to maintain temporal causality. Finally, by incorporating a causal self-attention mechanism, we explicitly constrained temporal learning to move forward only (autocorrelation) in the TRUE-HF framework42,46,47. This mechanism introduces autoregressive properties into temporal learning. It safeguards predictions for any given day being influenced only by data from that day and preceding days. We leveraged casual attention to enable additional semi-supervised training, as described below47. All iterative model training was performed exclusively within the first 154 patients, whereas the final 63 patients were used solely for held-out testing. We excluded the days of the study-entry and end-of-study clinic visits from training.
Finally, by incorporating a causal self-attention mechanism, we explicitly constrained temporal learning to move forward only (autocorrelation) in the TRUE-HF framework42,46,47. This mechanism introduces autoregressive properties into temporal learning. It safeguards predictions for any given day being influenced only by data from that day and preceding days. We leveraged casual attention to enable additional semi-supervised training, as described below47. All iterative model training was performed exclusively within the first 154 patients, whereas the final 63 patients were used solely for held-out testing. We excluded the days of the study-entry and end-of-study clinic visits from training. To mitigate the lack of daily clinical CPET and 6MWT outcome labels, we utilized semi-supervised learning targeting linear approximated values of each test across the study48–50. In our study, explicit daily labels were absent, and only baseline and end-of-study clinic visits provided clinical status for our target outcomes (CPET pVO2 or clinical 6MWT). To establish a reasonable approximation of our target outcome (pVO2 or clinical 6MWT) for each of these 30-d windows, we used linear interpolation of clinical outcomes recorded at the initial study-entry assessment and follow-up visits, yielding daily outcomes (Supplementary Fig. 2).
comes (CPET pVO2 or clinical 6MWT). To establish a reasonable approximation of our target outcome (pVO2 or clinical 6MWT) for each of these 30-d windows, we used linear interpolation of clinical outcomes recorded at the initial study-entry assessment and follow-up visits, yielding daily outcomes (Supplementary Fig. 2). The final TRUE-HF model was an ensemble of ten models, each trained using the same TRUE-HF framework but with different random seeds. The TRUE-HF model uses exclusively wearable data and clinical data to predict future states, never past states. The average prediction from these models was used to derive TRUE-HF predictions. The All of Us dataset provided per-min wearable measurements exclusively for HeartRate and StepCount, whereas critical features from the original TRUE-HF model of ActiveEnergyBurned, BasalEnergyBurned, Distance, AppleStandPlusTime, AppleExerciseTime, OxygenSaturation and HeartRateVariabilitySDNN were unavailable. To address these differences, we employed a knowledge-distillation approach, specifically a teacher–student training strategy51,52. In this approach, predictions generated by a more comprehensive teacher model (TRUE-HF model) served as training targets (pseudo-labels) for a streamlined student model that accommodated the reduced feature set. Given the substantial feature gap between the original TRUE-HF model and the All of Us-compatible variant, we introduced a ‘teacher-assistant’ model to facilitate knowledge transfer and mitigate performance degradation53.
targets (pseudo-labels) for a streamlined student model that accommodated the reduced feature set. Given the substantial feature gap between the original TRUE-HF model and the All of Us-compatible variant, we introduced a ‘teacher-assistant’ model to facilitate knowledge transfer and mitigate performance degradation53. This teacher-assistant model retained all original wearable features but used a reduced clinical feature set aligned with the All of Us cohort. For each training batch, pseudo-labels were generated from a randomly selected member of the original TRUE-HF ten-model ensemble. Subsequently, the ensemble of trained teacher-assistant models, which provided pseudo-labels to train the final All of Us-compatible TRUE-HF-RS model, relies exclusively on HeartRate, StepCount and the reduced clinical feature set. All TRUE-HF and TRUE-HF-RS models were trained exclusively on the training set of the TRUE-HF cohort (n = 154). The All of Us Research Program was used exclusively for external validation.
This teacher-assistant model retained all original wearable features but used a reduced clinical feature set aligned with the All of Us cohort. For each training batch, pseudo-labels were generated from a randomly selected member of the original TRUE-HF ten-model ensemble. Subsequently, the ensemble of trained teacher-assistant models, which provided pseudo-labels to train the final All of Us-compatible TRUE-HF-RS model, relies exclusively on HeartRate, StepCount and the reduced clinical feature set. All TRUE-HF and TRUE-HF-RS models were trained exclusively on the training set of the TRUE-HF cohort (n = 154). The All of Us Research Program was used exclusively for external validation. We compared our TRUE-HF pVO2 and TRUE-HF 6MWTD against clinically measured CPET pVO2 and 6MWTD, as well as to Apple VO2Max and sixMinuteWalkTestDistance, respectively. To predict the CPET value from the end-of-study clinic visit, we used wearable data collected over the 30 d preceding it. This ensured that all model inputs reflected free-living conditions, unaffected by structured CPET or tests conducted on the visit day. Apple VO2Max and Apple sixMinuteWalkTestDistance were collected through our iOS mobile application. Note, certain conditions or medications that limit heart rate may cause an overestimation of the Apple VO2Max algorithm—as communicated in Apple’s user interface19,54.
structured CPET or tests conducted on the visit day. Apple VO2Max and Apple sixMinuteWalkTestDistance were collected through our iOS mobile application. Note, certain conditions or medications that limit heart rate may cause an overestimation of the Apple VO2Max algorithm—as communicated in Apple’s user interface19,54. To further assess our predictions’ accuracy in detecting pVO2 changes over time, we measured the model’s ability to detect declines at the end-of-study clinic visit, measured against the study-entry clinical measurements. For CPET pVO2, a end-of-study drop in CPET pVO2 was defined as a ≥10% reduction from study-entry visit to end-of-study clinic visit and was chosen because a ≥6% to 10% decrease in pVO2 is associated with an increased risk of medium-to-long-term hospitalization or death in patients with HF23–26. To classify patients as a ≥10% drop in pVO2, we calculated a percentage difference between the last model prediction (that is, TRUE-HF prediction the day before the clinic visit) to study-entry CPET pVO2. The same assessment method was used for the TRUE-HF 6MWTD model, where we tested correlation measures and drops in 6MWTD and classified a decline in 6MWTD (10% reduction in distance walked), respectively.
n the last model prediction (that is, TRUE-HF prediction the day before the clinic visit) to study-entry CPET pVO2. The same assessment method was used for the TRUE-HF 6MWTD model, where we tested correlation measures and drops in 6MWTD and classified a decline in 6MWTD (10% reduction in distance walked), respectively. Our secondary objective was to evaluate the association between TRUE-HF-predicted declines in daily pVO2 and unplanned healthcare utilization during the 90-d follow-up period. This association was compared to associations of traditional static risk factors measured at baseline and clinical models (MAGGIC, SHFM and PREDICT-HF). We defined unplanned healthcare utilization as hospitalization, unscheduled clinical visits or urgent intravenous furosemide treatment taking place between the study-entry and the end-of-study visits. The first prediction made by our TRUE-HF pVO2 model required 30 d of data. Hence, this objective was evaluated only among patients free from unplanned healthcare utilization in the first 31 d of our study.
isits or urgent intravenous furosemide treatment taking place between the study-entry and the end-of-study visits. The first prediction made by our TRUE-HF pVO2 model required 30 d of data. Hence, this objective was evaluated only among patients free from unplanned healthcare utilization in the first 31 d of our study. We examined the impact of removing structured monthly exercise sessions. All wearable data collected during these sessions were masked by excluding measurements within a 90-min window around the start and end times. A 30-d window was chosen before final analyses (described in the statistical analysis plan). We also evaluated how input window length affected accuracy by comparing shorter windows (10 d or 20 d) with the 30-d window using retrained models and zero-shot inference. Saliency analyses on the combined TRUE-HF model quantified feature importance, averaging saliency values daily for visualization. To assess feature contributions, we trained and compared (1) a fully connected neural network with clinical baseline variables and (2) a model using only wearable data. The analyses and methods were prespecified before data evaluation and performed only on the held-out test set (n = 63), consisting of the last 50 patients who successfully completed an end-of-study clinic visit. We followed the STROBE and MI-CLAIM reporting guidelines55,56.
We examined the impact of removing structured monthly exercise sessions. All wearable data collected during these sessions were masked by excluding measurements within a 90-min window around the start and end times. A 30-d window was chosen before final analyses (described in the statistical analysis plan). We also evaluated how input window length affected accuracy by comparing shorter windows (10 d or 20 d) with the 30-d window using retrained models and zero-shot inference. Saliency analyses on the combined TRUE-HF model quantified feature importance, averaging saliency values daily for visualization. To assess feature contributions, we trained and compared (1) a fully connected neural network with clinical baseline variables and (2) a model using only wearable data. The analyses and methods were prespecified before data evaluation and performed only on the held-out test set (n = 63), consisting of the last 50 patients who successfully completed an end-of-study clinic visit. We followed the STROBE and MI-CLAIM reporting guidelines55,56. The primary analysis compared our model’s observed and predicted CPET pVO2 values. We used Spearman’s coefficient to measure the rank-based association between observed and predicted values. Pearson’s r was employed to quantify linear correlation. Together, these two correlation measures capture complementary aspects of the model’s predictive fidelity. The m.a.e. was chosen for its interpretability in the same units as the outcome. AUROC was used to evaluate the diagnostic accuracy of correctly predicting a meaningful drop in outcome measurements.
r correlation. Together, these two correlation measures capture complementary aspects of the model’s predictive fidelity. The m.a.e. was chosen for its interpretability in the same units as the outcome. AUROC was used to evaluate the diagnostic accuracy of correctly predicting a meaningful drop in outcome measurements. To estimate AUROC CIs, we employed 1,000 stratified bootstrap resamples and stratified data resampling techniques. Based on Delong’s test, a two-sided P < 0.05 was considered significant between models.
r correlation. Together, these two correlation measures capture complementary aspects of the model’s predictive fidelity. The m.a.e. was chosen for its interpretability in the same units as the outcome. AUROC was used to evaluate the diagnostic accuracy of correctly predicting a meaningful drop in outcome measurements. To estimate AUROC CIs, we employed 1,000 stratified bootstrap resamples and stratified data resampling techniques. Based on Delong’s test, a two-sided P < 0.05 was considered significant between models. The secondary analysis used time-varying, extended, Simon and Makuch’s Kaplan–Meier cumulative risk curves to assess the association between an observed drop of ≥10% in their TRUE-HF daily pVO2 and unplanned healthcare utilization57. In this analysis, the cohort was stratified into two groups: (1) patients with a ≥10% reduction in daily pVO2; and (2) patients with <10% reduction in daily pVO2. We constructed cumulative incidence curves for unplanned healthcare utilization and accounted for the time-varying nature of our independent variable. To assess the probability of event-free survival between these groups, we constructed cumulative risk curves to account for the continuous change in our daily pVO2 covariate, facilitating a longitudinal comparison of clinical outcomes related to unplanned healthcare utilization. An extended Cox’s proportional hazards model quantified the HR and the strength of association between time-varying occurrence of pVO2 drops (scaled in 10% drops) in TRUE-HF daily pVO2 and subsequent clinical events58. Sensitivity analyses involving minimal covariate adjustment were performed. Due to the limited sample size, these analyses used single covariate adjustments rather than comprehensive multivariate adjustments.
-varying occurrence of pVO2 drops (scaled in 10% drops) in TRUE-HF daily pVO2 and subsequent clinical events58. Sensitivity analyses involving minimal covariate adjustment were performed. Due to the limited sample size, these analyses used single covariate adjustments rather than comprehensive multivariate adjustments. Secondary analysis AUROC was computed as the maximum percentage drop in model prediction from the first model prediction to the prediction on the day for each patient, preceding any event or end-of-study clinic visit (censor). DeLong’s one-tailed test was performed to evaluate the a priori hypothesis that continuous monitoring methods would surpass static measures in discriminative performance. Multiple testing correction was applied using the Benjamini–Hochberg method (target false discovery rate = 0.05). Landmark-based, time-dependent AUROC analysis was performed during external validation to assess the model’s ability to predict outcomes based on the largest percentage drop identified up to each landmark time t59. This analysis was performed in 10-d intervals, with 30-d outcome windows repeated. Qualitative results for the trends observed with TRUE-HF predictions were created using LOWESS-smoothed trajectories, generated by forward-filling censored or missing data across the entire study window using the last observed value before censoring or event.
Secondary analysis AUROC was computed as the maximum percentage drop in model prediction from the first model prediction to the prediction on the day for each patient, preceding any event or end-of-study clinic visit (censor). DeLong’s one-tailed test was performed to evaluate the a priori hypothesis that continuous monitoring methods would surpass static measures in discriminative performance. Multiple testing correction was applied using the Benjamini–Hochberg method (target false discovery rate = 0.05). Landmark-based, time-dependent AUROC analysis was performed during external validation to assess the model’s ability to predict outcomes based on the largest percentage drop identified up to each landmark time t59. This analysis was performed in 10-d intervals, with 30-d outcome windows repeated. Qualitative results for the trends observed with TRUE-HF predictions were created using LOWESS-smoothed trajectories, generated by forward-filling censored or missing data across the entire study window using the last observed value before censoring or event. Data preprocessing and model development were performed in Python using Pandas (v1.5.3), NumPy (v1.21.3) and PyTorch (v2.0.0). Correlation analyses, m.a.e. calculations and Simon and Makuch’s Kaplan–Meier estimations were conducted with SciPy (v1.7.1) and lifelines (v0.28.0). AUROC, DeLong’s test, the Benjamini–Hochberg correction and Cox’s proportional hazards models were implemented in R using pROC (v1.18.5) and survival (v3.6-4) packages and the timeROC package (v0.4). TRUE-HF model architecture definitions, wearable data preprocessing and inference workflows have been made publicly available at https://github.com/mcintoshML/TRUEHF.
orrection and Cox’s proportional hazards models were implemented in R using pROC (v1.18.5) and survival (v3.6-4) packages and the timeROC package (v0.4). TRUE-HF model architecture definitions, wearable data preprocessing and inference workflows have been made publicly available at https://github.com/mcintoshML/TRUEHF. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
The TRUE-HF study (NCT05008692) was conducted under approved protocols from the University Health Network (UHN) Research Ethics Board (Toronto, Canada; REB no. 20-5205). Written informed consent was obtained from all participants before enrollment. Details of the study protocol are available21 and a summary is provided herein. The statistical analysis plan is included in Supplementary information. Retrospective analyses of the All of Us Research Program used de-identified data accessed through its controlled-access platform. The program operates under a centralized institutional review board and all participants provided written informed consent for data collection and secondary research use. Analyses complied with program policies and applicable ethical and regulatory requirements.
The TRUE-HF study (NCT05008692) enrolled outpatients with HF receiving care at the UHN (Fig. 1). Eligible participants were aged ≥18 years and could adequately comprehend English independently or with a caregiver’s help. Research coordinators provided study information at the time of informed consent and contacted patients for follow-up. During the initial enrollment visit (study entry), we provided patients guidance on setting up their Apple Watch. During this session, we also educated patients on how to use Apple Watch. The appropriate electronic case report form (eCRF) and electrocardiogram (ECG) applications were downloaded and available on iPhone. Apple Watch ECG has only been validated for patients aged >22 years; therefore, only participants older than 22 years were asked to download the Apple ECG app. Participants interacted with an eCRF iOS mobile application. The data gathered in the application were not used to treat the patient. All patients underwent CPET, comprehensive bloodwork, clinical examination and a supervised 6MWT during study-entry and end-of-study clinic visits.
The external validation cohort was constructed using data from the NIH All of Us Research Program v8 to validate the unplanned healthcare utilization experiments. Initially, 1,664 participants within the All of Us dataset had a documented diagnosis of HF and were available for analysis using wearable data from Fitbit devices. Among these, 400 individuals were excluded due to incomplete or missing EHR data necessary for event adjudication and accurate cohort characterization (Extended Data Fig. 2). To align the All of Us cohort’s clinical severity with the TRUE-HF population, we restricted the cohort to include only patients with documented prior unplanned healthcare utilization. We defined unplanned events as inpatient hospitalizations (excluding planned procedures) or intravenous furosemide administration40. For each participant, the study-entry date was set as the later of the following two dates: the discharge date of their qualifying unplanned healthcare event or the first day of available wearable sensor data collection. To ensure adequate data availability, we excluded participants with <30 d of wearable data after study entry. Participants were then filtered to ensure adequate wearable data coverage, defined as at least 40% daily measurement coverage during the observational period after study entry, resulting in the exclusion of additional participants.
ta availability, we excluded participants with <30 d of wearable data after study entry. Participants were then filtered to ensure adequate wearable data coverage, defined as at least 40% daily measurement coverage during the observational period after study entry, resulting in the exclusion of additional participants. Finally, we defined the observation endpoint (‘end-of-study’ visit) as either the occurrence of a second unplanned healthcare utilization event within 120 d or the completion of a follow-up period that matched the TRUE-HF cohort median duration (approximately 94.5 d), with a maximum of 120 d. Participants who did not meet either endpoint criterion were excluded, resulting in a final external validation cohort comprising 193 individuals. Patient demographics are summarized in Extended Data Table 2.
The following data were collected, with informed consent from study participants, from Apple Watch through HealthKit during the approximate 90-d free-living period (that is, excluding days of baseline and end-of-study follow-up clinic visits): step count, exercise time, distance traveled, stand time, active energy burned, basal energy burned, heart rate, heart rate variability and O2 saturation. These variables were selected because they were the most frequently and consistently recorded during the interim analysis of the monitoring period and details of each are defined in Apple HealthKit41.
With wearable data and incorporating patient-specific clinical information such as sex, race, age, prescription dosages, weight and height, our model predicts an individual patient’s cardiopulmonary fitness and changes in their fitness over time. All nine wearable-derived features presented in Fig. 1b were included as model inputs. Our new method leveraged a contextualized DL model to retain and analyze temporal trends across 30 d of patient-wearable data, providing near-continuous daily monitoring through next-day predictions. To achieve this, our model incorporated three distinct components: (1) it contextualized temporal representations of the data using a bottom-up approach, extending from 90-min intervals to full-day aggregation; (2) it integrated patient-specific clinical information directly into the wearable data features, allowing for adaptive feature calculations; and (3) it explicitly considered the temporal constraints, recognizing that daily activities are influenced by preceding days, and used this to make ongoing predictions for each day. The TRUE-HF model (Extended Data Fig. 1) processed 30 d of 90-min summaries of wearable data, starting with sequential 90-min summaries and assembling them into larger time windows, allowing the model to learn temporal relationships at different temporal resolutions while improving processing efficiency. This approach is embodied in the model architecture, a bottom-up variant of the transformer model, which optimizes the feature map and reduces the temporal resolution through pooling37,42,43.
ows, allowing the model to learn temporal relationships at different temporal resolutions while improving processing efficiency. This approach is embodied in the model architecture, a bottom-up variant of the transformer model, which optimizes the feature map and reduces the temporal resolution through pooling37,42,43. HealthKit data are first tokenized through one-dimensional convolutions to collapse features along the temporal domain44. The input is then processed through a TRUE-HF block consisting of a transformer layer followed by a pooling layer. The transformer layer learns relationships within the temporal resolution in each TRUE-HF block. Subsequently, the pooling layer aggregates pairs of consecutive time points (that is, 90–180 min). Using four consecutive TRUE-HF blocks, we effectively analyzed time resolutions of 90 min, 180 min, 360 min, 720 min and 1,440 min (daily resolution). The final prediction layer then aggregated information across the previous 30 d, inclusive, to predict the current day’s measurement. To enhance the model’s understanding of wearable data, TRUE-HF incorporates patient-specific clinical information (described above), enabling it to learn different operations based on input attributes rather than treating all inputs equally. We achieved this by augmenting TRUE-HF with a feature-wise linear modulation block that modulates activations in the neural networks based on clinical details45. The model used only demographic information from the baseline clinic visit to maintain temporal causality.
attributes rather than treating all inputs equally. We achieved this by augmenting TRUE-HF with a feature-wise linear modulation block that modulates activations in the neural networks based on clinical details45. The model used only demographic information from the baseline clinic visit to maintain temporal causality. Finally, by incorporating a causal self-attention mechanism, we explicitly constrained temporal learning to move forward only (autocorrelation) in the TRUE-HF framework42,46,47. This mechanism introduces autoregressive properties into temporal learning. It safeguards predictions for any given day being influenced only by data from that day and preceding days. We leveraged casual attention to enable additional semi-supervised training, as described below47.
All iterative model training was performed exclusively within the first 154 patients, whereas the final 63 patients were used solely for held-out testing. We excluded the days of the study-entry and end-of-study clinic visits from training. To mitigate the lack of daily clinical CPET and 6MWT outcome labels, we utilized semi-supervised learning targeting linear approximated values of each test across the study48–50. In our study, explicit daily labels were absent, and only baseline and end-of-study clinic visits provided clinical status for our target outcomes (CPET pVO2 or clinical 6MWT). To establish a reasonable approximation of our target outcome (pVO2 or clinical 6MWT) for each of these 30-d windows, we used linear interpolation of clinical outcomes recorded at the initial study-entry assessment and follow-up visits, yielding daily outcomes (Supplementary Fig. 2). The final TRUE-HF model was an ensemble of ten models, each trained using the same TRUE-HF framework but with different random seeds. The TRUE-HF model uses exclusively wearable data and clinical data to predict future states, never past states. The average prediction from these models was used to derive TRUE-HF predictions.
The All of Us dataset provided per-min wearable measurements exclusively for HeartRate and StepCount, whereas critical features from the original TRUE-HF model of ActiveEnergyBurned, BasalEnergyBurned, Distance, AppleStandPlusTime, AppleExerciseTime, OxygenSaturation and HeartRateVariabilitySDNN were unavailable. To address these differences, we employed a knowledge-distillation approach, specifically a teacher–student training strategy51,52. In this approach, predictions generated by a more comprehensive teacher model (TRUE-HF model) served as training targets (pseudo-labels) for a streamlined student model that accommodated the reduced feature set. Given the substantial feature gap between the original TRUE-HF model and the All of Us-compatible variant, we introduced a ‘teacher-assistant’ model to facilitate knowledge transfer and mitigate performance degradation53. This teacher-assistant model retained all original wearable features but used a reduced clinical feature set aligned with the All of Us cohort. For each training batch, pseudo-labels were generated from a randomly selected member of the original TRUE-HF ten-model ensemble. Subsequently, the ensemble of trained teacher-assistant models, which provided pseudo-labels to train the final All of Us-compatible TRUE-HF-RS model, relies exclusively on HeartRate, StepCount and the reduced clinical feature set. All TRUE-HF and TRUE-HF-RS models were trained exclusively on the training set of the TRUE-HF cohort (n = 154). The All of Us Research Program was used exclusively for external validation.
We compared our TRUE-HF pVO2 and TRUE-HF 6MWTD against clinically measured CPET pVO2 and 6MWTD, as well as to Apple VO2Max and sixMinuteWalkTestDistance, respectively. To predict the CPET value from the end-of-study clinic visit, we used wearable data collected over the 30 d preceding it. This ensured that all model inputs reflected free-living conditions, unaffected by structured CPET or tests conducted on the visit day. Apple VO2Max and Apple sixMinuteWalkTestDistance were collected through our iOS mobile application. Note, certain conditions or medications that limit heart rate may cause an overestimation of the Apple VO2Max algorithm—as communicated in Apple’s user interface19,54.
We examined the impact of removing structured monthly exercise sessions. All wearable data collected during these sessions were masked by excluding measurements within a 90-min window around the start and end times. A 30-d window was chosen before final analyses (described in the statistical analysis plan). We also evaluated how input window length affected accuracy by comparing shorter windows (10 d or 20 d) with the 30-d window using retrained models and zero-shot inference. Saliency analyses on the combined TRUE-HF model quantified feature importance, averaging saliency values daily for visualization. To assess feature contributions, we trained and compared (1) a fully connected neural network with clinical baseline variables and (2) a model using only wearable data.
The analyses and methods were prespecified before data evaluation and performed only on the held-out test set (n = 63), consisting of the last 50 patients who successfully completed an end-of-study clinic visit. We followed the STROBE and MI-CLAIM reporting guidelines55,56. The primary analysis compared our model’s observed and predicted CPET pVO2 values. We used Spearman’s coefficient to measure the rank-based association between observed and predicted values. Pearson’s r was employed to quantify linear correlation. Together, these two correlation measures capture complementary aspects of the model’s predictive fidelity. The m.a.e. was chosen for its interpretability in the same units as the outcome. AUROC was used to evaluate the diagnostic accuracy of correctly predicting a meaningful drop in outcome measurements. To estimate AUROC CIs, we employed 1,000 stratified bootstrap resamples and stratified data resampling techniques. Based on Delong’s test, a two-sided P < 0.05 was considered significant between models.
The primary analysis compared our model’s observed and predicted CPET pVO2 values. We used Spearman’s coefficient to measure the rank-based association between observed and predicted values. Pearson’s r was employed to quantify linear correlation. Together, these two correlation measures capture complementary aspects of the model’s predictive fidelity. The m.a.e. was chosen for its interpretability in the same units as the outcome. AUROC was used to evaluate the diagnostic accuracy of correctly predicting a meaningful drop in outcome measurements. To estimate AUROC CIs, we employed 1,000 stratified bootstrap resamples and stratified data resampling techniques. Based on Delong’s test, a two-sided P < 0.05 was considered significant between models. The secondary analysis used time-varying, extended, Simon and Makuch’s Kaplan–Meier cumulative risk curves to assess the association between an observed drop of ≥10% in their TRUE-HF daily pVO2 and unplanned healthcare utilization57. In this analysis, the cohort was stratified into two groups: (1) patients with a ≥10% reduction in daily pVO2; and (2) patients with <10% reduction in daily pVO2. We constructed cumulative incidence curves for unplanned healthcare utilization and accounted for the time-varying nature of our independent variable. To assess the probability of event-free survival between these groups, we constructed cumulative risk curves to account for the continuous change in our daily pVO2 covariate, facilitating a longitudinal comparison of clinical outcomes related to unplanned healthcare utilization. An extended Cox’s proportional hazards model quantified the HR and the strength of association between time-varying occurrence of pVO2 drops (scaled in 10% drops) in TRUE-HF daily pVO2 and subsequent clinical events58. Sensitivity analyses involving minimal covariate adjustment were performed. Due to the limited sample size, these analyses used single covariate adjustments rather than comprehensive multivariate adjustments.
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41591-026-04247-3.