Browse the corpus

Walk the Even Hospital Database by book and chapter — the raw source passages that ground Ask, DDx, and the rest.

76 passages

abstractpubmed· Abstract· item 41495409

A multimodal sleep foundation model for disease prediction. Sleep is a fundamental biological process with broad implications for physical and mental health, yet its complex relationship with disease remains poorly understood. Polysomnography (PSG)-the gold standard for sleep analysis-captures rich physiological signals but is underutilized due to challenges in standardization, generalizability and multimodal integration. To address these challenges, we developed SleepFM, a multimodal sleep foundation model trained with a new contrastive learning approach that accommodates multiple PSG configurations. Trained on a curated dataset of over 585,000 hours of PSG recordings from approximately 65,000 participants across several cohorts, SleepFM produces latent sleep representations that capture the physiological and temporal structure of sleep and enable accurate prediction of future disease risk. From one night of sleep, SleepFM accurately predicts 130 conditions with a C-Index of at least 0.75 (Bonferroni-corrected P < 0.01), including all-cause mortality (C-Index, 0.84), dementia (0.85), myocardial infarction (0.81), heart failure (0.80), chronic kidney disease (0.79), stroke (0.78) and atrial fibrillation (0.78). Moreover, the model demonstrates strong transfer learning performance on a dataset from the Sleep Heart Health Study-a dataset that was excluded from pretraining-and performs competitively with specialized sleep-staging models such as U-Sleep and YASA on common sleep analysis tasks, achieving mean F1 scores of 0.70-0.78 for sleep staging and accuracies of 0.69 and 0.87 for classifying sleep apnea severity and presence. This work shows that foundation models can learn the language of sleep from multimodal sleep recordings, enabling scalable, label-efficient analysis and disease prediction.

fulltextpubmed· Main· item 41495409

Sleep is a complex process characterized by intricate interactions across physiological systems, including brain, heart, respiratory and muscle activity1. PSG—the gold standard for sleep evaluation—captures these interactions through recordings of several modalities, including brain activity signals (BAS, including electroencephalogram (EEG) and electrooculogram (EOG)), electrocardiography (ECG), electromyography (EMG) and respiratory signals2. Sleep disorders affect millions of people and are increasingly recognized as indicators of, and contributors to, various health conditions3. Sleep disturbances often precede the clinical onset of numerous conditions, such as psychiatric disorders4, neurodegenerative diseases5 and cardiovascular disorders6. These associations highlight the important role sleep plays in maintaining overall health and underscores its predictive potential across a wide spectrum of diseases. However, most existing studies have focused on identifying links between sleep and specific diseases using isolated metrics or manual annotations, leaving much of the complexity of sleep physiology, as captured in PSG, underutilized.

fulltextpubmed· Main· item 41495409

ll health and underscores its predictive potential across a wide spectrum of diseases. However, most existing studies have focused on identifying links between sleep and specific diseases using isolated metrics or manual annotations, leaving much of the complexity of sleep physiology, as captured in PSG, underutilized. Recent advances in deep learning have enabled the use of PSG’s multimodal data for tasks ranging from sleep staging and apnea detection to predicting conditions such as atrial fibrillation, biological aging and narcolepsy3,7–10. Despite this progress, current approaches face key limitations: they focus on individual outcomes, depend on supervised learning with expert-labeled data and are trained on relatively small datasets (2,500–15,913 recordings)3,7,9–11. Manual annotations are time consuming and prone to inter-rater variability, making scaling difficult. Moreover, existing models lack flexibility across recording environments, generalize poorly across cohorts and often fail to exploit the richness of multimodal sleep signals. There remains a need for robust, generalizable architectures and systematic evaluation of sleep’s predictive value across a broad range of health conditions.

fulltextpubmed· Main· item 41495409

isting models lack flexibility across recording environments, generalize poorly across cohorts and often fail to exploit the richness of multimodal sleep signals. There remains a need for robust, generalizable architectures and systematic evaluation of sleep’s predictive value across a broad range of health conditions. Foundation models have emerged as a transformative approach in machine learning, enabling robust representation learning from large-scale, unlabeled data12. By leveraging self-supervised learning, these models can be fine-tuned efficiently for diverse applications. In biomedicine, foundation models have demonstrated remarkable capabilities in analyzing complex, heterogeneous datasets, driving advances in disease prediction, patient stratification and therapeutic discovery13,14. Their ability to extract meaningful patterns from large-scale data has addressed many challenges associated with the diverse and high-dimensional nature of clinical datasets.

fulltextpubmed· Main· item 41495409

ties in analyzing complex, heterogeneous datasets, driving advances in disease prediction, patient stratification and therapeutic discovery13,14. Their ability to extract meaningful patterns from large-scale data has addressed many challenges associated with the diverse and high-dimensional nature of clinical datasets. Despite these successes, their application to sleep remains limited. Sleep data, particularly from PSG, presents unique challenges due to its complexity and variability, including differences in the number and types of recording channel across clinical cohorts. Most sleep studies have focused narrowly on sleep-specific outcomes, constraining the broader potential of foundation models for disease prediction. In preliminary work, we explored self-supervised learning on PSG data in a smaller cohort of participants11. Although this effort highlighted the potential of foundation models for analyzing sleep data, it targeted primarily sleep-specific outcomes and lacked the flexibility to accommodate the diverse configurations of PSG recordings. These limitations emphasize the need for models that can generalize across heterogeneous datasets and systematically uncover the role of sleep in predicting a wider range of diseases.

fulltextpubmed· Main· item 41495409

a, it targeted primarily sleep-specific outcomes and lacked the flexibility to accommodate the diverse configurations of PSG recordings. These limitations emphasize the need for models that can generalize across heterogeneous datasets and systematically uncover the role of sleep in predicting a wider range of diseases. In this paper we present SleepFM, a foundation model trained on over 585,000 h of PSG data from 65,000+ participants. SleepFM captures the diverse information present in multimodal sleep recordings—integrating EEG, ECG, EMG and respiratory signals. Its channel-agnostic architecture enables joint learning across several modalities, producing representations that generalize across environments. We also introduce a new leave-one-out (LOO) contrastive learning (CL) (LOO-CL) algorithm that aligns information across modalities during pretraining while remaining resilient to missing or heterogeneous channels during inference. Our model uses 5–25 times more data than previously trained supervised sleep3,7,9,10 or biosignal models15,16. Inspired by phenome-wide association studies (PheWAS)17, we examined whether sleep characteristics, as captured by SleepFM, can predict the onset of a wide range of diseases. Leveraging electronic health record (EHR) disease codes, we develop a framework to systematically explore predictive associations between multimodal sleep and diverse health conditions.

fulltextpubmed· Dataset and SleepFM architecture· item 41495409

We describe our dataset and training procedures in detail in Methods. Briefly, we used PSG data from four primary cohorts: Stanford Sleep Clinic (SSC)11, BioSerenity18,19, the Multi-Ethnic Study of Atherosclerosis (MESA)20,21 and the Outcomes of Sleep Disorders in Older Men (MrOS)20,22. SSC includes 35,052 studies from participants aged 1–100 years; BioSerenity adds 18,900 studies from people aged 7–90 years; MESA and MrOS contribute 2,237 and 3,930 PSGs, respectively, from older adults. Together, these cohorts span 65,000 participants and more than 585,000 h of sleep recordings. We further evaluated generalization using the Sleep Heart Health Study (SHHS)20,23—a multicenter dataset of 6,441 adults aged 40 years and older, held out from pretraining and used solely for transfer learning. Dataset distributions postfiltering are shown in Table 1. Demographics for SSC and BioSerenity appear in Extended Data Tables 1 and 2, whereas details for SHHS, MrOS and MESA are available in their respective publications.Table 1Distribution of PSG recordings across cohorts and data splitsSplitSSCBioSerenityMESAMROSSHHSTotalTrain24,13718,8691,7473,3403,29151,384Validation76410010185001,392Test5,019–1502862,0007,455Temporal test5,132––––5,132Total35,05218,9691,9073,6445,79165,363The model was first pretrained on SSC, BioSerenity, MESA and MROS data, following which these same recordings were used for task-specific fine-tuning. The SHHS dataset is reserved exclusively for evaluating transfer learning capabilities and was used only during fine-tuning not during pretraining. The temporal test set consists of SSC recordings from 2020 onwards, used to evaluate model robustness to temporal distribution shifts. Dashes (–) indicate that no data is available for that split.

fulltextpubmed· Dataset and SleepFM architecture· item 41495409

rved exclusively for evaluating transfer learning capabilities and was used only during fine-tuning not during pretraining. The temporal test set consists of SSC recordings from 2020 onwards, used to evaluate model robustness to temporal distribution shifts. Dashes (–) indicate that no data is available for that split. Our preprocessing pipeline begins by resampling all signals to 128 Hz for consistency across cohorts. Signals are then segmented into 5-s windows, which serve as the model’s fundamental input tokens. The architecture includes one-dimensional (1D) convolutional layers for feature extraction, followed by channel-agnostic attention pooling to address variability in channel number and order across cohorts. A transformer block captures temporal dependencies over a 5-min context window. During pretraining, we use a multimodal CL objective to align representations across all modalities. The robustness of the model stems from its channel-agnostic design, enabling it to accommodate missing channels, varying channel counts and heterogeneous signal types. For downstream tasks, we leverage the pretrained model’s embeddings through lightweight fine-tuning. The token embeddings from different modalities are pooled again and processed by a two-layer long short-term memory (LSTM) network before passing through task-specific output heads. For patient-level prediction tasks (for example, disease prediction), an additional temporal pooling layer before the output layer compresses all token embeddings into a single 128-dimensional embedding.

fulltextpubmed· Dataset and SleepFM architecture· item 41495409

nd processed by a two-layer long short-term memory (LSTM) network before passing through task-specific output heads. For patient-level prediction tasks (for example, disease prediction), an additional temporal pooling layer before the output layer compresses all token embeddings into a single 128-dimensional embedding. To evaluate model performance across tasks, we use appropriate task-specific metrics. For classification tasks such as sex classification, we report area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC); for sleep apnea classification we show confusion matrices and report accuracy; for age estimation, we use mean absolute error (MAE) and Pearson correlation. Sleep staging is evaluated using the F1 score, which is well suited for class-imbalanced settings. For disease prediction, we report AUROC and Harrell’s concordance index (C-Index)—a standard survival analysis metric that measures the proportion of correctly ranked risk pairs. All metrics range from 0 to 1, with higher values indicating better performance; 95% confidence intervals (CIs) are computed using bootstrapping.

fulltextpubmed· SleepFM supports standard sleep analysis tasks· item 41495409

After pretraining SleepFM, we assessed the general utility of its learned representations by fine-tuning on four common benchmark tasks: age estimation, sex classification, sleep stage classification and sleep apnea classification. Although these tasks are not the main focus of our work, they are useful validations showing that the model captures fundamental sleep patterns. For all tasks, we trained lightweight LSTM-based heads on top of the frozen multimodal embeddings derived from entire nights of PSG data.

fulltextpubmed· SleepFM supports standard sleep analysis tasks· item 41495409

ation and sleep apnea classification. Although these tasks are not the main focus of our work, they are useful validations showing that the model captures fundamental sleep patterns. For all tasks, we trained lightweight LSTM-based heads on top of the frozen multimodal embeddings derived from entire nights of PSG data. For age estimation, we assessed the ability of the model to predict chronological age. Overall performance is shown in Extended Data Fig. 1, with the model achieving a MAE of 7.33 years and a correlation coefficient of 0.88. Performance varied across age groups, with higher accuracy in pediatric and middle-aged participants and greater error in elderly adults, suggesting that age prediction is more challenging at the extremes of the age spectrum. Sex classification yielded an AUROC of 0.86 (0.85–0.87) and AUPRC of 0.90 (0.89–0.91). For sleep stage classification, we fine-tuned a LSTM-based classifier to distinguish Wake, Stage 1, Stage 2, Stage 3 and rapid eye movement (REM) using 5-s windows—a more granular resolution than the standard 30-s epochs, which has been shown to improve precision in certain conditions (for example, narcolepsy10). As shown in Supplementary Fig. 1, SleepFM performs well on Wake, Stage 2 and REM, with expected confusion in transitional stages like Stage 1—consistent with known human scoring variability. We report results across SSC, MESA, MrOS and SHHS, where SleepFM achieves competitive performance compared to U-Sleep7, YASA24, GSSC25 and STAGES10—state-of-the-art sleep staging models, as shown in Extended Data Tables 3 and 4. Furthermore, we compare SleepFM to three PhysioEx26 models on the public datasets DCSM27 and HMC28 in a fully external validation setting, achieving an F1 score of 0.68 on DCSM—outperforming all models—and 0.55 on HMC (Supplementary Table 1). Although the source alone has little impact, using several datasets for pretraining and fine-tuning improves generalization, boosting macro F1 by around 0.1 (Supplementary Tables 2, 3 and 4), consistent with previous work26.

fulltextpubmed· SleepFM supports standard sleep analysis tasks· item 41495409

g an F1 score of 0.68 on DCSM—outperforming all models—and 0.55 on HMC (Supplementary Table 1). Although the source alone has little impact, using several datasets for pretraining and fine-tuning improves generalization, boosting macro F1 by around 0.1 (Supplementary Tables 2, 3 and 4), consistent with previous work26. For sleep apnea classification, we performed patient-level severity classification to distinguish between four commonly used severity groups on the basis of the apnea–hypopnea index (AHI): none (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30) and severe (AHI ≥ 30). Across MESA, MrOs and SHHS, we observe competitive performance, with a severity classification accuracy of 0.69 and a presence classification accuracy (none/mild versus moderate/severe) of 0.87. The confusion matrix for apnea classification is shown in Fig. 1.Fig. 1Overview of SleepFM framework.a, PSG setup and dataset statistics across several sleep centers. Bars show the number of independent PSG recordings (participants) per cohort and the corresponding total recording hours. b, Multimodal contrastive pretraining: raw signals from each modality are encoded by a CNN, channel embeddings are pooled within modality and a temporal transformer with temporal pooling yields sequence-level representations for LOO-CL. C: channels, S: sequence length, D: embedding dimension. c, Fine-tuning using frozen embeddings for downstream tasks (sleep staging, apnea detection, disease prediction). Eight hours of multimodal embeddings are aggregated to patient-level representations, concatenated with age and sex, and passed to an LSTM followed by a fully connected layer. d, Evaluation across representative tasks and clinical applications. Left and middle: confusion matrices for sleep staging (SHHS) and AHI categories (SSC) shown as row-normalized percentages. Right: disease prediction performance on the Stanford cohort (n = 5,019 participants). Box plots summarize 1,000 patient-level bootstrap resamples: faint dots (individual bootstrap draws), and vertical line with end caps (95% bootstrap percentile CI). Numeric labels are means. Number of positive samples for each disease: CKD (354), death (224), dementia (221), HF (283) and stroke (297).

fulltextpubmed· SleepFM supports standard sleep analysis tasks· item 41495409

19 participants). Box plots summarize 1,000 patient-level bootstrap resamples: faint dots (individual bootstrap draws), and vertical line with end caps (95% bootstrap percentile CI). Numeric labels are means. Number of positive samples for each disease: CKD (354), death (224), dementia (221), HF (283) and stroke (297). a, PSG setup and dataset statistics across several sleep centers. Bars show the number of independent PSG recordings (participants) per cohort and the corresponding total recording hours. b, Multimodal contrastive pretraining: raw signals from each modality are encoded by a CNN, channel embeddings are pooled within modality and a temporal transformer with temporal pooling yields sequence-level representations for LOO-CL. C: channels, S: sequence length, D: embedding dimension. c, Fine-tuning using frozen embeddings for downstream tasks (sleep staging, apnea detection, disease prediction). Eight hours of multimodal embeddings are aggregated to patient-level representations, concatenated with age and sex, and passed to an LSTM followed by a fully connected layer. d, Evaluation across representative tasks and clinical applications. Left and middle: confusion matrices for sleep staging (SHHS) and AHI categories (SSC) shown as row-normalized percentages. Right: disease prediction performance on the Stanford cohort (n = 5,019 participants). Box plots summarize 1,000 patient-level bootstrap resamples: faint dots (individual bootstrap draws), and vertical line with end caps (95% bootstrap percentile CI). Numeric labels are means. Number of positive samples for each disease: CKD (354), death (224), dementia (221), HF (283) and stroke (297).

fulltextpubmed· SleepFM enables comprehensive disease prediction from sleep data· item 41495409

To enable disease prediction, we paired SSC data with EHRs, extracting all diagnostic codes (International Classification of Diseases, ninth revision (ICD-9) and International Classification of Diseases, tenth revision (ICD-10)) and their timestamps. These codes were mapped to phecodes—a hierarchical system of 1,868 disease categories designed for PheWAS29. The timestamp of each phecode was defined as the earliest among its corresponding ICD codes. Positive cases were defined as patients whose first phecode instance occurred more than 7 days after the sleep study, avoiding trivial associations. We excluded phecodes with prevalence below 1.5% to ensure statistical power, resulting in 1,041 phecodes for evaluation. For model fine-tuning, we used a multilabel extension of the Cox proportional hazards (CoxPH) loss, averaging independent losses computed for each label.

fulltextpubmed· SleepFM enables comprehensive disease prediction from sleep data· item 41495409

e sleep study, avoiding trivial associations. We excluded phecodes with prevalence below 1.5% to ensure statistical power, resulting in 1,041 phecodes for evaluation. For model fine-tuning, we used a multilabel extension of the Cox proportional hazards (CoxPH) loss, averaging independent losses computed for each label. Figure 2 illustrates the performance of SleepFM across disease categories on the test set. Although performance varies across categories, SleepFM demonstrates strong results in several areas, including neoplasms, pregnancy complications, circulatory conditions and mental disorders. Overall, 130 future diseases achieved a C-Index and AUROC of at least 0.75 on held-out participants (Bonferroni-corrected P < 0.01), as summarized in Supplementary Table 5. AUROC was calculated using a 6-year horizon, meaning a condition is considered positive if the patient develops the disease within 6 years of their PSG study. The 6-year horizon for AUROC calculation was chosen to balance performance and account for both long-term and short-term conditions. Supplementary Fig. 2 shows AUROC values across 1–6 year horizons for several conditions.Fig. 2Performance of SleepFM on the held-out test set (n = 5,019) as stratified by disease category.Individual dots represent a disease within a category. The results are evaluated using two metrics: the C-Index, which measures the model’s ability to rank patient risk accurately, and the 6-year AUROC, which assesses the model’s discrimination performance by evaluating its ability to distinguish between patients who experience the event of interest and those who do not within a 6-year prediction window. For reference, the horizontal dashed line indicates a threshold of 0.75.

fulltextpubmed· SleepFM enables comprehensive disease prediction from sleep data· item 41495409

ent risk accurately, and the 6-year AUROC, which assesses the model’s discrimination performance by evaluating its ability to distinguish between patients who experience the event of interest and those who do not within a 6-year prediction window. For reference, the horizontal dashed line indicates a threshold of 0.75. The model showed high accuracy for mild cognitive impairment (AUROC 0.84 (0.80–0.880)), aligning with studies showing sleep disturbances as early markers of cognitive decline30. Strong performance was observed for Parkinson’s disease (0.93 (0.89–0.96)), where sleep disorders are increasingly recognized as potential early indicators31, and developmental delays and disorders (0.84 (0.79–0.87)). Among circulatory conditions, the model effectively predicted hypertensive heart disease (0.88 (0.85–0.91)) and intracranial hemorrhage (0.82 (0.73–0.90)), consistent with established links between sleep disorders and cardiovascular risk32. In the Neoplasm category, the model showed strong predictive performance for several cancers: prostate cancer (0.90 (0.87–0.93)), breast cancer (0.90 (0.86–0.93)) and melanomas of skin (0.83 (0.76–0.90)). These findings align with existing literature linking sleep patterns to cancer risk33,34.

fulltextpubmed· SleepFM enables comprehensive disease prediction from sleep data· item 41495409

nd cardiovascular risk32. In the Neoplasm category, the model showed strong predictive performance for several cancers: prostate cancer (0.90 (0.87–0.93)), breast cancer (0.90 (0.86–0.93)) and melanomas of skin (0.83 (0.76–0.90)). These findings align with existing literature linking sleep patterns to cancer risk33,34. Drawing on sleep expertise and previous literature, we identified 14 conditions with strong potential links to sleep patterns. Previous studies associate sleep regularity with mortality35, prolonged sleep with early neurodegeneration36 and sleep disturbances with dementia37, stroke38 and cardiovascular outcomes9. Related phecodes were grouped into unified disease categories in consultation with a medical doctor (Supplementary Table 6). Results for selected conditions—including death, stroke, heart failure (HF) and dementia—are shown in Extended Data Fig. 2. SleepFM demonstrates strong predictive performance, with particularly high accuracy for death (AUROC 0.84 (0.80–0.88)), HF (0.83 (0.79–0.86)), chronic kidney disease (CKD) (0.82 (0.79–0.85)), dementia (0.87 (0.84–0.91)) and stroke (0.81 (0.78–0.85)). All reported associations are statistically significant (P < 0.01, Bonferroni-corrected).

fulltextpubmed· SleepFM enables comprehensive disease prediction from sleep data· item 41495409

strong predictive performance, with particularly high accuracy for death (AUROC 0.84 (0.80–0.88)), HF (0.83 (0.79–0.86)), chronic kidney disease (CKD) (0.82 (0.79–0.85)), dementia (0.87 (0.84–0.91)) and stroke (0.81 (0.78–0.85)). All reported associations are statistically significant (P < 0.01, Bonferroni-corrected). To better understand the physiological basis of disease prediction, we analyzed model performance stratified by both sleep stages and signal modalities. We found that although most sleep stages contribute similarly to disease prediction, certain stages such as Stage 1/2 and REM can offer slightly better predictive power for specific conditions, including cardiovascular and neurodegenerative diseases. Likewise, different signal modalities showed nuanced differences, with BAS signals better capturing mental and neurological conditions, respiratory signals more predictive of respiratory and metabolic disorders, and electrocardiogram (EKG) signals more informative for circulatory diseases. Although these differences align with known physiology, the overall predictive performance was highest when combining all modalities. Full results and condition-specific breakdowns are provided in Supplementary Figs. 3 and 4 and Supplementary Tables 7 and 8. Furthermore, we trained separate SleepFM models on each modality to directly assess modality-level importance. Performance comparisons stratified by disease category, presented in Supplementary Tables 9 and 10, further confirm that combining all modalities yields the optimal performance.

fulltextpubmed· SleepFM demonstrates robust generalization across time and cohorts· item 41495409

We evaluate the generalization capabilities of SleepFM across temporal distribution shifts and external site validation. For temporal generalization, we test the model on a separate cohort comprising Stanford patients from 2020 onwards. All model pretraining and training was done on data from before 2020. Despite the limited follow-up period, SleepFM maintains strong predictive performance. Extended Data Fig. 3 shows results for our 14 selected conditions, with particularly robust and statistically significant performance (Bonferroni-corrected P < 0.01) for death (0.83 (0.73–0.91)), HF (0.80 (0.75–0.85)) and dementia (0.83 (0.76–0.89)). Comprehensive temporal-split performance across all disease phenotypes and categories is provided in Supplementary Figs. 5 and 6. Supplementary Fig. 7 further reports temporal-split performance comparisons with baseline models, stratified by disease category. To assess cross-site generalization, we evaluate SleepFM’s transfer learning capabilities on SHHS—a dataset entirely excluded from the pretraining phase. We use the pretrained model to extract embeddings and then fine-tune it on a subset of SHHS. Specifically, the SHHS fine-tuning set includes 3,291 participants, and the test set includes 2,000 participants. Due to differences in task availability between SSC and SHHS, our evaluation focuses on six overlapping cardiovascular conditions. This setup mimics real-world deployment scenarios where foundation models must be adapted to new clinical sites with minimal supervision.

fulltextpubmed· SleepFM demonstrates robust generalization across time and cohorts· item 41495409

ants, and the test set includes 2,000 participants. Due to differences in task availability between SSC and SHHS, our evaluation focuses on six overlapping cardiovascular conditions. This setup mimics real-world deployment scenarios where foundation models must be adapted to new clinical sites with minimal supervision. As shown in Fig. 3, SleepFM demonstrates strong transfer learning performance across key outcomes. For example, the model achieves statistically significant predictive accuracy (Bonferroni-corrected P < 0.01) for stroke (0.82 (0.76–0.87)), congestive HF (0.85 (0.82–0.88)) and mortality related to cardiovascular disease (0.88 (0.83–0.91)).Fig. 3SleepFM prediction performance on the SHHS test set (n = 2,000 participants).Due to differences in available outcome data between SHHS and Stanford datasets, evaluation was limited to a subset of conditions. Results demonstrate transfer learning capabilities across these key clinical outcomes, including stroke, congestive HF and cardiovascular disease-related mortality. Each panel uses barplots derived from 1,000 patient-level bootstrapping: faint points are individual bootstrap draws, and the vertical line with end caps marks the 95% bootstrap percentile CI. Numbers above bars report the mean. Metrics are C-Index (top) and AUROC at 6 years (bottom). The number of positive samples for each outcome is as follows: angina (704), cardiovascular disease death (128), congestive HF (190), coronary heart disease death (80), myocardial infarction (103) and stroke (95). All conditions are statistically significant with a P value <0.01 after Bonferroni correction.

fulltextpubmed· SleepFM demonstrates robust generalization across time and cohorts· item 41495409

s (bottom). The number of positive samples for each outcome is as follows: angina (704), cardiovascular disease death (128), congestive HF (190), coronary heart disease death (80), myocardial infarction (103) and stroke (95). All conditions are statistically significant with a P value <0.01 after Bonferroni correction. Due to differences in available outcome data between SHHS and Stanford datasets, evaluation was limited to a subset of conditions. Results demonstrate transfer learning capabilities across these key clinical outcomes, including stroke, congestive HF and cardiovascular disease-related mortality. Each panel uses barplots derived from 1,000 patient-level bootstrapping: faint points are individual bootstrap draws, and the vertical line with end caps marks the 95% bootstrap percentile CI. Numbers above bars report the mean. Metrics are C-Index (top) and AUROC at 6 years (bottom). The number of positive samples for each outcome is as follows: angina (704), cardiovascular disease death (128), congestive HF (190), coronary heart disease death (80), myocardial infarction (103) and stroke (95). All conditions are statistically significant with a P value <0.01 after Bonferroni correction.

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

We compare SleepFM against two supervised baselines: Demographics and End-to-End PSG. The demographics baseline is a multilayer perceptron (MLP) trained on structured clinical features (age, sex, race/ethnicity and body mass index (BMI)). This baseline includes more demographic variables than the SleepFM-based models, which only use age and sex. The End-to-End PSG model is trained directly on raw PSG data using the same architecture and parameter count as SleepFM, and it includes age and sex but does not use any pretraining. From Fig. 4, we observe that the percentage difference in AUROC between SleepFM and both baseline models ranges from 5% to 17%. The magnitude of improvement varies across disease categories; for example, gains are more pronounced in neurological and hematopoietic conditions, whereas in neoplasm-related conditions the improvements are comparatively modest. Supplementary Fig. 8 reports the overall test-set performance comparison between SleepFM and the baseline models across all disease phenotypes.Fig. 4Performance improvements of SleepFM over baseline models across disease categories on Stanford test set (n = 5,019 participants).SleepFM and the End-to-End PSG model include age and sex demographic features, whereas the demographics-only model includes age, sex, BMI and race/ethnicity. Each box shows the distribution of disease-level percentage improvements of SleepFM relative to each baseline within the indicated disease category. Improvements are shown for both C-Index (top) and 6-year AUROC (bottom) metrics. Boxes represent the interquartile range (IQR), with whiskers extending to 1.5× IQR and outliers shown as points. Diamonds denote the mean improvement within each category. The horizontal dashed line at zero indicates no improvement.

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

egory. Improvements are shown for both C-Index (top) and 6-year AUROC (bottom) metrics. Boxes represent the interquartile range (IQR), with whiskers extending to 1.5× IQR and outliers shown as points. Diamonds denote the mean improvement within each category. The horizontal dashed line at zero indicates no improvement. Next, we evaluated three different variants of SleepFM using identical training configurations, as shown in Table 2 and Extended Data Table 5. SleepFM-LSTM (without Demo) uses SleepFM embeddings with a two-layer LSTM fine-tuning head but no demographic features. SleepFM-Linear uses SleepFM embeddings with a simple linear prediction head and includes age and sex. Finally, SleepFM-LSTM, combines the pretrained SleepFM embeddings with a two-layer LSTM head and includes age and sex.Table 2Comparison of category-averaged AUROC across SleepFM variants and baselinesCategoryDemoE2E-PSGSleepFM-1SleepFM-2SleepFM-3Circulatory system0.74(0.73, 0.74)0.74(0.73, 0.75)0.78(0.77, 0.78)0.79(0.78, 0.80)0.79(0.78, 0.80)Dermatologic0.64(0.63, 0.65)0.63(0.62, 0.64)0.68(0.67, 0.69)0.71(0.70, 0.72)0.70(0.70, 0.71)Digestive0.63(0.62, 0.64)0.64(0.63, 0.65)0.69(0.69, 0.70)0.72(0.71, 0.73)0.72(0.71, 0.73)Endocrine/metabolic0.68(0.68, 0.69)0.67(0.66, 0.68)0.74(0.73, 0.75)0.75(0.74, 0.76)0.75(0.74, 0.76)Hematopoietic0.64(0.63, 0.66)0.66(0.64, 0.67)0.73(0.72, 0.75)0.75(0.73, 0.76)0.74(0.73, 0.76)Infectious diseases0.62(0.61, 0.64)0.62(0.60, 0.63)0.67(0.65, 0.69)0.70(0.68, 0.71)0.70(0.68, 0.71)Injuries and poisonings0.62(0.61, 0.63)0.63(0.61, 0.64)0.68(0.67, 0.69)0.70(0.69, 0.71)0.70(0.69, 0.71)Mental disorders0.66(0.65, 0.67)0.66(0.66, 0.67)0.72(0.71, 0.73)0.74(0.73, 0.75)0.74(0.74, 0.75)Musculoskeletal0.68(0.67, 0.68)0.68(0.67, 0.69)0.70(0.69, 0.71)0.72(0.72, 0.73)0.72(0.71, 0.73)Neoplasms0.73(0.71, 0.74)0.73(0.71, 0.74)0.73(0.72, 0.74)0.76(0.75, 0.77)0.76(0.75, 0.77)Neurological0.62(0.61, 0.63)0.63(0.62, 0.64)0.70(0.69, 0.71)0.72(0.71, 0.73)0.72(0.71, 0.73)Respiratory0.63(0.62, 0.64)0.64(0.63, 0.65)0.69(0.68, 0.70)0.69(0.69, 0.70)0.70(0.69, 0.71)Sense organs0.66(0.65, 0.67)0.67(0.66, 0.68)0.71(0.70, 0.72)0.73(0.72, 0.74)0.73(0.72, 0.74)Symptoms0.65(0.64, 0.66)0.66(0.64, 0.67)0.72(0.71, 0.73)0.75(0.74, 0.76)0.75(0.74, 0.76)Category-averaged 6-year AUROC (mean(95% CI)) comparing SleepFM variants with two baselines across disease categories on Stanford cohort (n = 5,019). The Demographics baseline (Demo) uses only structured clinical features (age, sex, BMI and race/ethnicity).

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

.66(0.64, 0.67)0.72(0.71, 0.73)0.75(0.74, 0.76)0.75(0.74, 0.76)Category-averaged 6-year AUROC (mean(95% CI)) comparing SleepFM variants with two baselines across disease categories on Stanford cohort (n = 5,019). The Demographics baseline (Demo) uses only structured clinical features (age, sex, BMI and race/ethnicity). The End-to-End PSG baseline (E2E-PSG) is trained directly on raw PSG signals with age and sex, without any pretraining. SleepFM-1 denotes SleepFM-LSTM (without Demo), using two LSTM layers in the fine-tuning prediction module and no demographic features. SleepFM-2 denotes SleepFM-Linear, a linear prediction module on SleepFM embeddings with age and sex. SleepFM-3 denotes SleepFM-LSTM, which uses two LSTM layers in the fine-tuning prediction module with age and sex. Values are averaged within each category across conditions. Uncertainty is estimated by nonparametric bootstrapping (n = 1,000 resamples): for each resample, conditions within a category are sampled with replacement and the category mean is computed; 95% CIs are the 2.5th–97.5th percentiles across resamples. Comparison of category-averaged AUROC across SleepFM variants and baselines

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

The End-to-End PSG baseline (E2E-PSG) is trained directly on raw PSG signals with age and sex, without any pretraining. SleepFM-1 denotes SleepFM-LSTM (without Demo), using two LSTM layers in the fine-tuning prediction module and no demographic features. SleepFM-2 denotes SleepFM-Linear, a linear prediction module on SleepFM embeddings with age and sex. SleepFM-3 denotes SleepFM-LSTM, which uses two LSTM layers in the fine-tuning prediction module with age and sex. Values are averaged within each category across conditions. Uncertainty is estimated by nonparametric bootstrapping (n = 1,000 resamples): for each resample, conditions within a category are sampled with replacement and the category mean is computed; 95% CIs are the 2.5th–97.5th percentiles across resamples. Comparison of category-averaged AUROC across SleepFM variants and baselines Category-averaged 6-year AUROC (mean(95% CI)) comparing SleepFM variants with two baselines across disease categories on Stanford cohort (n = 5,019). The Demographics baseline (Demo) uses only structured clinical features (age, sex, BMI and race/ethnicity). The End-to-End PSG baseline (E2E-PSG) is trained directly on raw PSG signals with age and sex, without any pretraining. SleepFM-1 denotes SleepFM-LSTM (without Demo), using two LSTM layers in the fine-tuning prediction module and no demographic features. SleepFM-2 denotes SleepFM-Linear, a linear prediction module on SleepFM embeddings with age and sex. SleepFM-3 denotes SleepFM-LSTM, which uses two LSTM layers in the fine-tuning prediction module with age and sex. Values are averaged within each category across conditions. Uncertainty is estimated by nonparametric bootstrapping (n = 1,000 resamples): for each resample, conditions within a category are sampled with replacement and the category mean is computed; 95% CIs are the 2.5th–97.5th percentiles across resamples.

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

e and sex. Values are averaged within each category across conditions. Uncertainty is estimated by nonparametric bootstrapping (n = 1,000 resamples): for each resample, conditions within a category are sampled with replacement and the category mean is computed; 95% CIs are the 2.5th–97.5th percentiles across resamples. As seen in Table 2, the demographics-only baseline performs well, reflecting the fact that many diseases are associated strongly with age, sex, BMI and race/ethnicity. For example, in the Neoplasm category, older age is a strong predictor of cancer risk. Nevertheless, all SleepFM-based models, including the SleepFM-LSTM (without Demo) variant, consistently outperform the demographics and End-to-End PSG baselines across most disease categories. This demonstrates the benefit of using pretrained SleepFM embeddings for disease prediction. Furthermore, SleepFM-LSTM (without Demo) achieves over +5 AUROC points in 9 out of 14 conditions, whereas SleepFM-Linear and SleepFM-LSTM achieve over +5 AUROC points in 12 out of 14 conditions, compared to supervised demographics baseline. As seen from the 95% CI bars, these improvements are robust, with most differences being larger than the uncertainty intervals. Finally, SleepFM-Linear performs comparably to SleepFM-LSTM, suggesting that the strength of the model lies in the pretrained embeddings rather than the complexity of the downstream head. Percentage improvement comparisons across models are provided in Supplementary Fig. 9, and a scatterplot comparison of all disease phenotypes across different fine-tuning architectures on top of SleepFM is shown in Supplementary Fig. 10.

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

es in the pretrained embeddings rather than the complexity of the downstream head. Percentage improvement comparisons across models are provided in Supplementary Fig. 9, and a scatterplot comparison of all disease phenotypes across different fine-tuning architectures on top of SleepFM is shown in Supplementary Fig. 10. To further examine disease-specific performance, full results are provided in Supplementary Tables 11, 12 and 13, and clinician-selected conditions are presented in Supplementary Fig. 11. These comparisons show that SleepFM achieves substantial gains across several neurological, mental, circulatory, endocrine/metabolic and respiratory conditions. For neurological and mental disorders, SleepFM attains higher C-Index scores for senile dementia (0.99 (0.98–1.00) versus 0.87 (0.75–0.96)), myoneural disorders (0.81 (0.73–0.88) versus 0.42 (0.28–0.55)) and developmental delays (0.80 (0.77–0.84) versus 0.58 (0.51–0.64)). For circulatory diseases, SleepFM outperforms in atherosclerosis (0.92 (0.88–0.95) versus 0.74 (0.64–0.89)) and acute pulmonary heart disease (0.80 (0.75–0.85) versus 0.74 (0.68–0.80)). Improvements in endocrine/metabolic conditions include diabetes type 2 with circulatory complications (0.87 (0.83–0.91) versus 0.79 (0.74–0.85)) and diabetic retinopathy (0.81 (0.77–0.85) versus 0.75 (0.69–0.80)). For respiratory conditions, SleepFM achieves higher C-Index in respiratory insufficiency (0.79 (0.72–0.85)] versus 0.59 (0.51–0.67)) and failure (0.77 (0.73–0.80) versus 0.70 (0.65–0.74)). These findings highlight the versatility of SleepFM in predicting a broad range of diseases beyond what is captured by demographics alone.

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

piratory conditions, SleepFM achieves higher C-Index in respiratory insufficiency (0.79 (0.72–0.85)] versus 0.59 (0.51–0.67)) and failure (0.77 (0.73–0.80) versus 0.70 (0.65–0.74)). These findings highlight the versatility of SleepFM in predicting a broad range of diseases beyond what is captured by demographics alone. Similarly, full comparisons with the End-to-End PSG model are provided in Supplementary Table 14. This comparison highlights the value of foundation model pretraining: although both models share similar architecture and input signals, SleepFM benefits from self-supervised pretraining, enabling more robust and informative representations. This advantage is reflected in consistent performance gains across neurological, circulatory, endocrine/metabolic and respiratory conditions. For neurological and mental disorders, SleepFM outperforms the end-to-end model in myoneural disorders (0.84 (0.75–0.91) versus 0.54 (0.40–0.69)), developmental delays (0.84 (0.79–0.87) versus 0.61 (0.52–0.69)) and speech/language disorders (0.83 (0.74–0.90) versus 0.71 (0.60–0.83)). For circulatory conditions, improvements are observed in atherosclerosis of native arteries of the extremities (0.95 (0.92–0.98) versus 0.65 (0.61–0.69)), atherosclerosis of the extremities (0.84 (0.75–0.90) versus 0.78 (0.71–0.85)) and acute pulmonary heart disease (0.84 (0.77–0.90) versus 0.76 (0.69–0.83)). In endocrine/metabolic disorders, SleepFM demonstrates stronger performance for predicting diabetes with circulatory complications (0.89 (0.85–0.93) versus 0.79 (0.70–0.87)), neurological manifestations (0.86 (0.81–0.90) versus 0.73 (0.67–0.78)) and diabetic retinopathy (0.84 (0.79, 0.89) versus 0.76 (0.69–0.82)). Respiratory conditions also benefit, with better performance in predicting respiratory insufficiency (0.82 (0.72–0.91) versus 0.64 (0.54–0.73)) and respiratory failure (0.76 (0.71–0.82) versus 0.68 (0.62–0.74)). In predicting all-cause mortality, SleepFM achieves a AUROC of 0.85 (0.80–0.89), outperforming both the Demographic baseline and End-to-End PSG model, which achieve AUROC of 0.78 (0.72–0.82).

fulltextpubmed· SleepFM surpasses supervised baselines in disease prediction· item 41495409

ory insufficiency (0.82 (0.72–0.91) versus 0.64 (0.54–0.73)) and respiratory failure (0.76 (0.71–0.82) versus 0.68 (0.62–0.74)). In predicting all-cause mortality, SleepFM achieves a AUROC of 0.85 (0.80–0.89), outperforming both the Demographic baseline and End-to-End PSG model, which achieve AUROC of 0.78 (0.72–0.82). Finally, we compare fine-tuning scalability by evaluating SleepFM alongside two baseline models as we increase the amount of fine-tuning data and measure performance on the same test set. These results are shown in Extended Data Fig. 4 for SHHS and Extended Data Fig. 5 and Supplementary Fig. 12 for SSC. In both plots, the key observation is that SleepFM consistently outperforms the supervised baselines, with its performance improving steadily as more data are used, remaining above the baseline curves for nearly all conditions. For SHHS, SleepFM surpasses the Demographics baseline in five out of six conditions across all data percentages, with particularly large improvements in smaller dataset splits. For example, SleepFM trained on just 10% of the data outperforms the Demographics baseline trained on five times more data across all conditions in SSC and four out of six conditions in SHHS (for example, cardiovascular disease death, congestive HF, myocardial infarction and stroke). SleepFM also outperforms the End-to-End PSG baseline in five out of six conditions, although the gap is slightly smaller than with the Demographics baseline. SleepFM exhibits stable scaling behavior across data percentages, with smoother performance improvements, whereas the baseline models show greater variability.

fulltextpubmed· Discussion· item 41495409

This study presents a large-scale foundation model for sleep analysis, developed on more than 585,000 h of PSG data from 65,000 participants. Our work makes several contributions. First, we address challenges in sleep analysis by leveraging self-supervised learning to train a foundation model that learns from unlabeled data and is agnostic to channel type and number, enabling broad exploration of sleep data across diverse clinical settings. Second, through extensive evaluation across 1,041 disease phenotypes, we demonstrate sleep’s broad predictive power for diverse health outcomes. The model shows strong performance in predicting death (C-Index 0.84), dementia (0.85), HF (0.80) and CKD (0.79). Third, we demonstrated transfer learning capabilities through strong performance on the SHHS dataset. Despite SHHS being entirely excluded from pretraining, our model maintains robust predictive power for key outcomes such as stroke (C-Index 0.81), congestive HF (0.83) and death related to cardiovascular disease (0.86). Finally, SleepFM achieves competitive performance on standard sleep analysis tasks, including sleep staging and apnea detection, with mean F1 scores ranging from 0.70 to 0.78 across cohorts—comparable to state-of-the-art models such as U-Sleep7, GSSC25, STAGES10 and YASA24. Furthermore, in a fully external validation setting, SleepFM outperforms all models on DCSM (F1 = 0.68) and is competitive with the PhysioEx26 models. For apnea classification, SleepFM achieves 87% accuracy in MESA, MrOS and SHHS, comparable to state-of-the-art models8.

fulltextpubmed· Discussion· item 41495409

ls such as U-Sleep7, GSSC25, STAGES10 and YASA24. Furthermore, in a fully external validation setting, SleepFM outperforms all models on DCSM (F1 = 0.68) and is competitive with the PhysioEx26 models. For apnea classification, SleepFM achieves 87% accuracy in MESA, MrOS and SHHS, comparable to state-of-the-art models8. SleepFM predicts all-cause mortality more accurately than both the Demographics-based model and the End-to-End PSG model, achieving a higher C-Index of 0.84 (0.81–0.87), compared to 0.79 (0.75–0.82). This indicates that pretraining efficiently captures subtle mortality-related signals in the PSG data. Research shows strong association between all-cause mortality and sleep-related factors, including high arousal burden39, low REM sleep40, sleep-disordered breathing41, hypoxemia and low sleep efficiency42. Increased ‘brain age’ derived from EEG has also been identified as an important predictor of mortality3. SleepFM probably integrates these multifactorial contributors, capturing respiratory events, sleep fragmentation, arousal burden and sleep efficiency, along with markers of cardiovascular, metabolic and other diseases.

fulltextpubmed· Discussion· item 41495409

reased ‘brain age’ derived from EEG has also been identified as an important predictor of mortality3. SleepFM probably integrates these multifactorial contributors, capturing respiratory events, sleep fragmentation, arousal burden and sleep efficiency, along with markers of cardiovascular, metabolic and other diseases. Predictive and prognostic models for neurological and mental disorders are advancing rapidly, offering the potential for earlier and more individualized treatment. Among the top conditions predicted by SleepFM were Alzheimer’s disease and Parkinson’s disease, with C-Indices of 0.91 (0.87–0.98) and 0.89 (0.85–0.92), respectively. Sleep disorders are associated strongly with preclinical Alzheimer’s disease43, including abnormalities in non-REM sleep, such as reduced slow-wave activity44, REM sleep disturbances45 and decreased spindle activity46. In early Alzheimer’s disease, REM sleep abnormalities have been linked to basal forebrain cholinergic lesions, which probably contribute to cognitive decline47. Similarly, Parkinson’s disease is frequently preceded by REM sleep behavior disorder, characterized by REM sleep without atonia and abnormalities in BAS and ECG patterns48. Recent studies have also shown that respiratory signals can capture phenotypes specific to Parkinson’s disease49.

fulltextpubmed· Discussion· item 41495409

ontribute to cognitive decline47. Similarly, Parkinson’s disease is frequently preceded by REM sleep behavior disorder, characterized by REM sleep without atonia and abnormalities in BAS and ECG patterns48. Recent studies have also shown that respiratory signals can capture phenotypes specific to Parkinson’s disease49. Consistent with these findings, SleepFM identified BAS as the strongest predictor of neurological and mental disorders, whereas respiratory signals were particularly effective in predicting senile dementia. Most studies in this domain rely on imaging modalities such as magnetic resonance imaging (MRI) and functional MRI (fMRI) to predict dementia. For example, one study using hippocampal MRI achieved a C-Index of 0.86 (ref. 50), whereas another using fMRI reported an AUROC of 0.82 for predicting dementia up to 9 years in advance51. Although direct performance comparisons are challenging due to differences in sample distributions, the ability of SleepFM to leverage PSG data to predict neurological and mental disorders underscores its potential as an alternative to imaging-based approaches. Other established biomarkers for Alzheimer’s disease—such as amyloid PET, decreased cerebrospinal fluid β-amyloid42, and increased cerebrospinal fluid phosphorylated tau (for example, p-tau129)52,53—have been used widely for diagnosis and prognosis. More recently, plasma p-tau217 has emerged as a promising less invasive marker54. Sleep biomarkers from PSG data offer a complementary, noninvasive tool for the prognosis of dementia and mild cognitive impairment.

fulltextpubmed· Discussion· item 41495409

l fluid phosphorylated tau (for example, p-tau129)52,53—have been used widely for diagnosis and prognosis. More recently, plasma p-tau217 has emerged as a promising less invasive marker54. Sleep biomarkers from PSG data offer a complementary, noninvasive tool for the prognosis of dementia and mild cognitive impairment. SleepFM accurately modeled cardiovascular disease in both the SSC and SHHS datasets, leveraging data-driven methods commonly used in prognostic modeling of cardiovascular disease, particularly with ECG data55 and lead II ECG from PSG studies9. Foundation models have demonstrated state-of-the-art performance with ECG data in various cross-sectional tasks15. For predicting cardiovascular mortality over 10 years, a previous study reported an AUROC of 0.84 (0.78–0.89) in a subset of SHHS participants with sleep apnea, whereas SleepFM achieved a slightly higher AUROC of 0.88 (0.83–0.91). Similarly, for atrial fibrillation, earlier work reported an AUROC of 0.82 (ref. 9), which aligns with SleepFM’s performance of 0.81 (0.78–0.84). Our ablation study further demonstrated that both ECG and respiratory signals contribute to the prediction of circulatory system phenotypes, suggesting that SleepFM integrates information on sleep apnea and heart activity in ways that are consistent with known disease mechanisms56.

fulltextpubmed· Discussion· item 41495409

FM’s performance of 0.81 (0.78–0.84). Our ablation study further demonstrated that both ECG and respiratory signals contribute to the prediction of circulatory system phenotypes, suggesting that SleepFM integrates information on sleep apnea and heart activity in ways that are consistent with known disease mechanisms56. Most disease categories, including neurological, circulatory, hematopoietic, mental disorders and endocrine/metabolic, were predicted with notably improved performance by SleepFM compared to the Demographics-based and End-to-End PSG baseline models. Many of these diseases are either associated with sleep (for example, type 2 diabetes57) or influenced directly by the signal modalities (for example, heart arrhythmia). Disrupted and unhealthy sleep contributes to dysfunction across several physiological systems, increasing the risk of diseases such as obesity, type 2 diabetes, hypertension, stroke and cardiovascular disease58. Sleep-specific conditions, including sleep apnea56 and less conclusively periodic leg movements59, are also linked to cardiovascular outcomes. Furthermore, specific EEG waveforms, such as coupled slow-wave and spindle activity, have been identified as markers of next-day blood glucose regulation60.

fulltextpubmed· Discussion· item 41495409

ovascular disease58. Sleep-specific conditions, including sleep apnea56 and less conclusively periodic leg movements59, are also linked to cardiovascular outcomes. Furthermore, specific EEG waveforms, such as coupled slow-wave and spindle activity, have been identified as markers of next-day blood glucose regulation60. Despite these promising results, several limitations should be acknowledged. Although our dataset is large, it consists primarily of patients referred for sleep studies due to suspected sleep disorders or other medical conditions requiring overnight monitoring. This selection bias means our cohort is not representative of the general population, as people without sleep complaints or those with limited access to specialized sleep clinics are underrepresented. The model’s performance shows some degradation in temporal test sets, highlighting the challenge of maintaining predictive accuracy over time as clinical practices and patient populations evolve. Furthermore, interpreting the predictions made by SleepFM is inherently challenging due to the complexity of the learned features during training by a deep model. To mitigate this, we stratified the model’s performance across sleep stages and data modalities, and conducted evaluations on temporal test sets and unseen datasets to gain insights into its behavior. However, further work is needed to enhance case-level interpretability and understand the specific sleep patterns and features driving these predictions.

fulltextpubmed· Discussion· item 41495409

e model’s performance across sleep stages and data modalities, and conducted evaluations on temporal test sets and unseen datasets to gain insights into its behavior. However, further work is needed to enhance case-level interpretability and understand the specific sleep patterns and features driving these predictions. In building our model, we selected hyperparameters for SleepFM based on previous work and ensured all training converged in loss; more extensive hyperparameter searches may further boost performance. Furthermore, although we evaluated SleepFM’s transfer learning performance on an independent dataset, SHHS, only a subset of the full 1,041 conditions could be assessed in this sample due to limited diagnostic overlap with SSC; this prevented a comprehensive evaluation of generalization across the full spectrum of diseases. Our sleep apnea analysis was limited to binary and four-class classification on the basis of AHI thresholds; we did not explore more granular formulations such as AHI regression or event detection, we leave this for future research. Similarly, although SleepFM achieves competitive results on sleep staging tasks across most datasets, it lags behind specialized sleep staging models on certain external validation datasets (for example, HMC). Further specialized modeling may be necessary to optimize SleepFM for sleep staging.

fulltextpubmed· Discussion· item 41495409

this for future research. Similarly, although SleepFM achieves competitive results on sleep staging tasks across most datasets, it lags behind specialized sleep staging models on certain external validation datasets (for example, HMC). Further specialized modeling may be necessary to optimize SleepFM for sleep staging. This study underscores the potential of sleep-based foundation models for risk stratification and longitudinal health monitoring. By integrating several physiological signals and leveraging large-scale pretraining, SleepFM performs consistently well across diverse disease categories and outperforms supervised baselines. Its stable performance across fine-tuning splits suggests that pretraining may improve model generalizability, particularly in clinical contexts with limited labeled data. These results suggest that SleepFM can complement existing risk assessment tools and help identify early signs of diseases. As wearable sleep technologies continue to advance, models such as SleepFM may offer opportunities for noninvasive, real-time health monitoring. Future efforts should explore how combining sleep-based models with data from EHRs, omics and imaging can further enhance their utility.