Browse the corpus

Walk the Even Hospital Database by book and chapter — the raw source passages that ground Ask, DDx, and the rest.

121 passages

abstractpubmed· Abstract· item 42135531

Advancing conversational diagnostic AI with multimodal reasoning. Real-world clinical practice is inherently multimodal, relying on the synthesis of patient history with visual information such as medical imagery and clinical documents. Although large language models (LLMs) have shown promise in diagnostic dialogue, their evaluation has been largely restricted to text-only interactions, failing to capture the complexity of modern remote care delivery. Here we introduce a multimodal extension of the Articulate Medical Intelligence Explorer (multimodal AMIE), capable of gathering, interpreting and reasoning about multimodal data within a diagnostic conversation. To achieve this, we developed a state-aware dialogue framework that dynamically guides history-taking based on diagnostic uncertainty and evolving patient states, emulating the structured reasoning of experienced clinicians. We evaluated this updated, state-aware version of multimodal AMIE against primary care physicians (PCPs) in a randomized, blinded exploratory study comprising 105 simulated telehealth consultations, which included dermatology photographs, electrocardiograms and clinical documents. As assessed by 18 specialist physicians, multimodal AMIE outperformed PCPs not only in diagnostic accuracy but also in conversation quality, including history-taking and empathy. Specifically, multimodal AMIE demonstrated superior performance on 29 of 32 evaluation axes, including seven of nine metrics that assess multimodal reasoning. These results validate the efficacy of state-aware reasoning in bridging the gap between text and visual information and demonstrate the potential for artificial intelligence (AI) systems to augment clinicians in complex, multimodal diagnostic settings.

fulltextpubmed· Main· item 42135531

Healthcare delivery faces considerable challenges globally from aging populations and increasing care fragmentation through to clinician burnout and misalignment between clinical and financial incentives1. This drives increased wait times for providers, delayed care and potentially worsened morbidity and mortality2. Even in high-income countries, there are many regions where access to primary care is limited and challenging3.

fulltextpubmed· Main· item 42135531

tation through to clinician burnout and misalignment between clinical and financial incentives1. This drives increased wait times for providers, delayed care and potentially worsened morbidity and mortality2. Even in high-income countries, there are many regions where access to primary care is limited and challenging3. Recently, we have witnessed the promise of generative AI systems in healthcare, especially those based on LLMs, as potential tools to augment healthcare delivery and address some of these challenges. Research systems such as the AMIE, an LLM-based conversational diagnostic AI system, have demonstrated physician-like capabilities to steer text-based clinical conversations, attaining superior or similar performance compared to PCPs across a broad array of measurements for consultation quality in studies involving patient-actors4,5. Although the evidence base for such autonomous diagnostic systems in real-world clinical settings remains limited, some conversational AI systems for triage and care navigation have shown promising results in specific applications. For instance, Zeltzer et al.6 conducted an evaluation of an AI system deployed in a real-world virtual urgent care setting for common symptoms and showed that the initial AI recommendations were generally concordant or rated by expert adjudicators as better than the diagnosis and management recommendations from the treating physicians. Furthermore, studies on other conversational systems have also shown high patient acceptability, a crucial factor for successful adoption7,8.

fulltextpubmed· Main· item 42135531

the initial AI recommendations were generally concordant or rated by expert adjudicators as better than the diagnosis and management recommendations from the treating physicians. Furthermore, studies on other conversational systems have also shown high patient acceptability, a crucial factor for successful adoption7,8. Despite the early promise of this technology, LLM-based medical AI systems have predominantly been studied and implemented as text-only chatbots. This represents a substantial deviation from how multimodal information is routinely used in care. As demonstrated during the COVID-19 pandemic9, to augment in-person care through visits at lower costs and increased patient convenience10, it is integral for remote care participants to be able to converse about and interpret multimodal medical information. This can take the form of images such as patient-provided skin photographs, electrocardiogram (ECG) tracings or laboratory reports, which can be shared via ubiquitous messaging platforms11,12 that combine text and multimodal chat or even more commonly can occur via real-time video consultations. We also note that these data (images or other attachments) provided by remote care participants are distinct from data obtained directly in the clinical environment, such as radiology imaging.

fulltextpubmed· Main· item 42135531

ssaging platforms11,12 that combine text and multimodal chat or even more commonly can occur via real-time video consultations. We also note that these data (images or other attachments) provided by remote care participants are distinct from data obtained directly in the clinical environment, such as radiology imaging. Multimodal medical tests are essential in effective care and can meaningfully guide the course of a consultation. However, evidence validating the use of LLMs for diagnostic conversations involving such multimodal data is scarce, revealing an important discrepancy between clinical needs and current technology. Patients may struggle to precisely articulate the results of an investigation using text alone, potentially omitting crucial details such as precise laboratory test values, and essential clinical information commonly resides only in non-textual formats11. For example, visual data such as photographs are paramount for remote assessment of dermatological conditions, and clinical documents, such as blood reports or specialist letters, contain objective findings that can be otherwise impractical to relay accurately via transcription to text chat. A text-only approach prevents AI from leveraging these rich, supplementary objective sources of information, hindering its ability to form a complete clinical picture.

fulltextpubmed· Main· item 42135531

as blood reports or specialist letters, contain objective findings that can be otherwise impractical to relay accurately via transcription to text chat. A text-only approach prevents AI from leveraging these rich, supplementary objective sources of information, hindering its ability to form a complete clinical picture. An overreliance on text-only input not only increases the risk of diagnostic errors but also risks exacerbating disparities in access to telehealth13, presenting barriers for individuals with lower digital literacy and language proficiency. By contrast, instant messaging apps that allow users to send text, voice and video messages (or share photo, audio and video messages) are commonplace with billions of users. Although real-time video call is more common in telemedicine, such multimedia instant messaging platforms enable considerably richer communication than text-only conversation and have seen reported use as tools for remote clinical consultations in which doctors and patients can exchange multimodal medical data (laboratory results, videos, photographs and more)12,14.

fulltextpubmed· Main· item 42135531

telemedicine, such multimedia instant messaging platforms enable considerably richer communication than text-only conversation and have seen reported use as tools for remote clinical consultations in which doctors and patients can exchange multimodal medical data (laboratory results, videos, photographs and more)12,14. The ability of LLMs to request, interpret and reason about multimodal medical data during clinical conversations has not been investigated or evaluated. To address this, we introduce multimodal AMIE—advancing the conversational diagnostic capabilities of the original system4 by integrating multimodal medical perception. We introduce a state-aware reasoning framework designed to orchestrate complex clinical conversation flows, which dynamically adapts responses based on intermediate model outputs, reflecting the evolving patient state and diagnostic uncertainty. This allows multimodal AMIE to strategically request, interpret and integrate multimodal artifacts, including skin images, ECGs and clinical documents, into the dialogue to refine diagnoses in a manner that emulates the diagnostic reasoning process of experienced clinicians in a telehealth setting. To enable rapid iteration and validation of design choices, we developed a simulation environment for generating realistic patient scenarios and conducting automated turn-by-turn dialogue assessments. Finally, to evaluate the performance of our finalized system, we conducted a randomized, blinded human evaluation that emulates an objective structured clinical examination (OSCE), a standardized assessment method used in both medical schools and physician licensing examinations worldwide. We compared multimodal AMIE to PCPs in multimodal text-based chats and evaluated their behaviors on 105 carefully designed multimodal clinical scenarios with validated patient-actors. We note, however, that our study is not a randomized clinical trial with prespecified endpoints and preregistered statistical analysis. Rather, it is an exploratory study investigating the properties of multimodal diagnostic dialogue. Furthermore, we designed a dedicated multimodal understanding and handling (MUH) rubric to rigorously compare AI and human performance in artifact interpretation over numerous dimensions. Figure 1 provides an overview of our system’s key components, our evaluation methodology and key findings.Fig.

fulltextpubmed· Main· item 42135531

diagnostic dialogue. Furthermore, we designed a dedicated multimodal understanding and handling (MUH) rubric to rigorously compare AI and human performance in artifact interpretation over numerous dimensions. Figure 1 provides an overview of our system’s key components, our evaluation methodology and key findings.Fig. 1Overview of key contributions.This figure provides a schematic overview of the key components enabling and evaluating multimodal diagnostic conversations within the AMIE system, facilitated through a multimodal chat interface. a, Multimodal state-aware reasoning. Multimodal AMIE employs a novel state-aware dialogue phase transition framework built on the publicly available Gemini 2.0 Flash model. This dynamically controls the conversation flow through history-taking, diagnosis and management phases, guided by intermediate outputs reflecting patient state and diagnostic uncertainty, allowing strategic requests and interpretation of multimodal artifacts. b, Simulation environment. A comprehensive simulation framework enables rapid development and automated evaluation. This framework involves generating realistic patient scenarios grounded in real images and metadata, using Gemini 2.0 Flash, simulating turn-by-turn multimodal dialogues between multimodal AMIE and patient agents, and using an auto-rater agent for assessment against clinical criteria. c, Randomized comparative study. The performance of multimodal AMIE was evaluated against PCPs in a randomized, double-blind, OSCE-style study. Trained patient-actors conducted synchronous chat consultations based on 105 diverse multimodal scenarios involving artifacts such as skin photographs, ECGs and clinical documents. d, Evaluation results. Specialist evaluations demonstrated that multimodal AMIE attains similar or superior performance to PCPs in handling and reasoning about multimodal data over multiple evaluation axes alongside strong performance in diagnostic accuracy and overall consultation quality. UI, user interface.

fulltextpubmed· Main· item 42135531

cuments. d, Evaluation results. Specialist evaluations demonstrated that multimodal AMIE attains similar or superior performance to PCPs in handling and reasoning about multimodal data over multiple evaluation axes alongside strong performance in diagnostic accuracy and overall consultation quality. UI, user interface. This figure provides a schematic overview of the key components enabling and evaluating multimodal diagnostic conversations within the AMIE system, facilitated through a multimodal chat interface. a, Multimodal state-aware reasoning. Multimodal AMIE employs a novel state-aware dialogue phase transition framework built on the publicly available Gemini 2.0 Flash model. This dynamically controls the conversation flow through history-taking, diagnosis and management phases, guided by intermediate outputs reflecting patient state and diagnostic uncertainty, allowing strategic requests and interpretation of multimodal artifacts. b, Simulation environment. A comprehensive simulation framework enables rapid development and automated evaluation. This framework involves generating realistic patient scenarios grounded in real images and metadata, using Gemini 2.0 Flash, simulating turn-by-turn multimodal dialogues between multimodal AMIE and patient agents, and using an auto-rater agent for assessment against clinical criteria. c, Randomized comparative study. The performance of multimodal AMIE was evaluated against PCPs in a randomized, double-blind, OSCE-style study. Trained patient-actors conducted synchronous chat consultations based on 105 diverse multimodal scenarios involving artifacts such as skin photographs, ECGs and clinical documents. d, Evaluation results. Specialist evaluations demonstrated that multimodal AMIE attains similar or superior performance to PCPs in handling and reasoning about multimodal data over multiple evaluation axes alongside strong performance in diagnostic accuracy and overall consultation quality. UI, user interface.

fulltextpubmed· Results· item 42135531

To evaluate the clinical utility of multimodal AMIE, we conducted a randomized, blinded OSCE-style study comparing the system against board-certified PCPs in multimodal text chat consultations. These findings indicate that, within the context of this randomized, blinded exploratory study, multimodal AMIE performance was similar to or higher than PCP performance across the assessed axes, as rated by specialists and patient-actors, including history-taking, diagnostic accuracy, management reasoning, communication skills, empathy and the understanding and handling of multimodal data. Subsequently, we report findings from our automated evaluation framework, using controlled ablation studies. (Here and throughout the text, the precise meaning of terms such as ‘matching’ or ‘exceeding’ that we use when comparing multimodal AMIE to PCPs, including the magnitude of difference and associated statistical analysis, is described in the corresponding results figures and Methods section.)

fulltextpubmed· Results· item 42135531

ontrolled ablation studies. (Here and throughout the text, the precise meaning of terms such as ‘matching’ or ‘exceeding’ that we use when comparing multimodal AMIE to PCPs, including the magnitude of difference and associated statistical analysis, is described in the corresponding results figures and Methods section.) In our remote OSCE-style evaluation of simulated consultations using multimodal text chat, the performance of multimodal AMIE was rated as similar or superior to that of PCPs across all assessed axes of comparison. Below, we report performance on an objective measure of diagnostic accuracy and subjective but blinded measures of patient experience and other key attributes of effective clinical consultations rated by specialist physicians. A detailed statistical breakdown of the consultation characteristics, including dialogue length and verbosity by participant type and modality, is provided in Supplementary Section 2.1.

fulltextpubmed· Results· item 42135531

ut blinded measures of patient experience and other key attributes of effective clinical consultations rated by specialist physicians. A detailed statistical breakdown of the consultation characteristics, including dialogue length and verbosity by participant type and modality, is provided in Supplementary Section 2.1. Multimodal AMIE achieved higher diagnostic accuracy compared to PCPs in the OSCE-style simulated clinical consultations that both multimodal AMIE and PCPs performed with patient-actors using multimodal text chat. As illustrated in Fig. 2a, which plots the top-k diagnostic accuracy, the differential diagnosis (DDx) of multimodal AMIE was both more accurate and more comprehensive.Fig. 2Comparison of OSCE metrics between PCPs and multimodal AMIE.a, Top-k DDx accuracy. Both multimodal AMIE and PCPs have the opportunity to submit a DDx list (with at least three and up to 10 plausible items, ordered by likelihood) for each dialogue. The top-k accuracy was computed using an auto-rater that compares each of the 10 diagnoses with the ground truth. Multimodal AMIE outperformed PCPs in diagnostic accuracy on average (P < 0.001, n = 210 conversations). b, Subgroup analysis of DDx accuracy. Top-3 accuracy was compared across subgroups within three axes as defined in the MUH rubric: (1) image quality (n = 204 conversations; 115 high quality, 89 low quality), (2) image-grounded reasoning (n = 204 conversations; 143 ‘Yes’, 61 ‘No’) and (3) hallucination of image findings in consultation (n = 204 conversations; 135 ‘No hallucination’, 56 ‘Hallucination’, 13 ‘Significant hallucination’) (see Extended Data Table 1 for details). c, Distributions of relative performance across patient scenarios. The proportion of scenarios for which PCPs were rated more favorably than multimodal AMIE and vice versa for three categories in the OSCE rubric are presented—namely, MUH rubric (right), non-multimodal specialist metrics (middle) and non-multimodal patient-centric metrics (right). Each specialist and patient-actor separately rated two dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote.

fulltextpubmed· Results· item 42135531

wo dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote. The ‘Tie’ label was assigned to cases where two or more specialists assigned the same ratings to both PCP and multimodal AMIE consultations or the cases for which the opinions on the relative performance were equally divided. Significant differences between multimodal AMIE and PCP segments, calculated using a one-sided χ2 test comparing only the proportions of preference (ignoring ties), are indicated by asterisks (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). n = 210 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs. 1 and 2 show the underlying distributions of ratings for the same set of metrics. For all panels, the error bars center on the mean and represent 95% confidence intervals estimated based on 10,000 bootstrap samples.

fulltextpubmed· Results· item 42135531

10 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs. 1 and 2 show the underlying distributions of ratings for the same set of metrics. For all panels, the error bars center on the mean and represent 95% confidence intervals estimated based on 10,000 bootstrap samples. a, Top-k DDx accuracy. Both multimodal AMIE and PCPs have the opportunity to submit a DDx list (with at least three and up to 10 plausible items, ordered by likelihood) for each dialogue. The top-k accuracy was computed using an auto-rater that compares each of the 10 diagnoses with the ground truth. Multimodal AMIE outperformed PCPs in diagnostic accuracy on average (P < 0.001, n = 210 conversations). b, Subgroup analysis of DDx accuracy. Top-3 accuracy was compared across subgroups within three axes as defined in the MUH rubric: (1) image quality (n = 204 conversations; 115 high quality, 89 low quality), (2) image-grounded reasoning (n = 204 conversations; 143 ‘Yes’, 61 ‘No’) and (3) hallucination of image findings in consultation (n = 204 conversations; 135 ‘No hallucination’, 56 ‘Hallucination’, 13 ‘Significant hallucination’) (see Extended Data Table 1 for details). c, Distributions of relative performance across patient scenarios. The proportion of scenarios for which PCPs were rated more favorably than multimodal AMIE and vice versa for three categories in the OSCE rubric are presented—namely, MUH rubric (right), non-multimodal specialist metrics (middle) and non-multimodal patient-centric metrics (right). Each specialist and patient-actor separately rated two dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote. The ‘Tie’ label was assigned to cases where two or more specialists assigned the same ratings to both PCP and multimodal AMIE consultations or the cases for which the opinions on the relative performance were equally divided. Significant differences between multimodal AMIE and PCP segments, calculated using a one-sided χ2 test comparing only the proportions of preference (ignoring ties), are indicated by asterisks (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). n = 210 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs.

fulltextpubmed· Results· item 42135531

l AMIE and PCP segments, calculated using a one-sided χ2 test comparing only the proportions of preference (ignoring ties), are indicated by asterisks (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). n = 210 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs. 1 and 2 show the underlying distributions of ratings for the same set of metrics. For all panels, the error bars center on the mean and represent 95% confidence intervals estimated based on 10,000 bootstrap samples. Multimodal AMIE accuracy consistently outperformed PCPs across the full range of k (1−10 diagnoses). Multimodal AMIE’s leading diagnosis (top-1) and broader DDx list (top-2 to top-10) more frequently contained the ground truth condition compared to the lists generated by PCPs. This overall difference in diagnostic accuracy was statistically significant (P < 0.001, based on a mixed-effects model accounting for scenario difficulty; see ‘Statistical analysis’ section). We note that although we analyze accuracy up to top-10, neither multimodal AMIE nor PCPs always report 10 differential diagnoses.

fulltextpubmed· Results· item 42135531

is overall difference in diagnostic accuracy was statistically significant (P < 0.001, based on a mixed-effects model accounting for scenario difficulty; see ‘Statistical analysis’ section). We note that although we analyze accuracy up to top-10, neither multimodal AMIE nor PCPs always report 10 differential diagnoses. This work advances from prior comparison of multimodal AMIE in simulated consultations by incorporating the ability for patients to upload multimodal artifacts into text chat. To understand the extent to which our results specifically reflected this new capability, and to explore the extent to which the comprehension and utilization of multimodal data contributed to the superior overall diagnostic accuracy of multimodal AMIE, we performed specific subgroup analyses using specialist physicians’ blinded consultation ratings (Fig. 2b). First, regarding the impact of image quality (Fig. 2b, left), when independent specialists rated the provided artifact quality as low, the top-3 DDx accuracy decreased for both multimodal AMIE and PCPs as anticipated. However, multimodal AMIE demonstrated a significantly smaller performance drop than that of PCPs. Second, regarding the impact of artifact use in reasoning (Fig. 2b, middle), when specialists judged that the consulting agent (multimodal AMIE or PCP) appropriately used the visual artifact, diagnostic accuracy improved similarly for both. Finally, we examined consultations where specialists identified instances of the clinician or multimodal AMIE reporting findings not actually present in the artifact, which we refer to here as ‘hallucination’ (Fig. 2b, right). Such misreporting negatively impacted diagnostic accuracy for both but significantly more so for PCPs compared to multimodal AMIE.

fulltextpubmed· Results· item 42135531

ere specialists identified instances of the clinician or multimodal AMIE reporting findings not actually present in the artifact, which we refer to here as ‘hallucination’ (Fig. 2b, right). Such misreporting negatively impacted diagnostic accuracy for both but significantly more so for PCPs compared to multimodal AMIE. We also analyzed performance variation across the different types of multimodal data used in our OSCE scenarios: skin photographs (Skin Condition Image Network (SCIN)), ECG tracings (PTB-XL) and clinical documents (Fig. 3a). Here, we used a monotonic regression model on k (number of diagnoses in the differential used to compute accuracy), with additional effects estimated for experimental arm and modalities. Both multimodal AMIE and PCPs were more accurate for scenarios based on clinical documents (P < 0.001). Multimodal AMIE retained statistically significantly higher accuracy across all modalities. For the full statistical analysis details, see Supplementary Section 2.2.Fig. 3Variation in performance across modality types.a,The average top-k DDx accuracy is shown as a function of the size of the differentials (k) for three artifact types. Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. b, The panel shows the distributions of specialist physicians’ ratings (DDx appropriateness and management plan appropriateness) and patient-actors’ ratings (‘patient happy to return in future’); this analysis attempts to capture the overall user satisfaction of the consultation. The distributions of ratings are displayed separately for the sessions with PCPs and multimodal AMIE and for modality types. c, The distributions of scores for three MUH metrics are displayed. For illustration purposes, all responses from five-point rating scales were mapped to a generic five-point scale ranging from ‘Very favorable’ to ‘Very unfavorable’. For Yes/No questions as indicated by ‘(Y/N)’, a (positive) ‘Yes’ response was mapped to the same color as ‘Favorable’ and a (negative) ‘No’ response to the same color as ‘Unfavorable’. The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant).

fulltextpubmed· Results· item 42135531

The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). a,The average top-k DDx accuracy is shown as a function of the size of the differentials (k) for three artifact types. Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. b, The panel shows the distributions of specialist physicians’ ratings (DDx appropriateness and management plan appropriateness) and patient-actors’ ratings (‘patient happy to return in future’); this analysis attempts to capture the overall user satisfaction of the consultation. The distributions of ratings are displayed separately for the sessions with PCPs and multimodal AMIE and for modality types. c, The distributions of scores for three MUH metrics are displayed. For illustration purposes, all responses from five-point rating scales were mapped to a generic five-point scale ranging from ‘Very favorable’ to ‘Very unfavorable’. For Yes/No questions as indicated by ‘(Y/N)’, a (positive) ‘Yes’ response was mapped to the same color as ‘Favorable’ and a (negative) ‘No’ response to the same color as ‘Unfavorable’. The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant).

fulltextpubmed· Results· item 42135531

The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). We compared multimodal AMIE and PCPs on the patient-centric quality metrics that were collected upon conclusion of the consultations. Patient-actors rated the interactions with multimodal AMIE as similar to or higher than those with board-certified PCPs across the assessed dimensions in Fig. 2c (right) (see Extended Data Fig. 1 for full ratings). This included aspects such as being polite, listening, explaining conditions, involving patients in decisions, appearing honest/trustworthy, building rapport, showing empathy and managing patient concerns (General Medical Council Patient Questionnaire (GMCPQ) and Practical Assessment of Clinical Examination Skills (PACES) criteria, P < 0.01 for all axes). We also asked patient-actors two questions specifically about the nature of multimodal interaction (Extended Data Table 1). The two questions, rated from 1 to 5, asked how well the doctor addressed the patients’ questions about image artifacts and how well the doctor explained the findings from the artifacts. Figure 2c (left) shows that multimodal AMIE was rated more highly than clinicians on both questions (P < 0.01).

fulltextpubmed· Results· item 42135531

tended Data Table 1). The two questions, rated from 1 to 5, asked how well the doctor addressed the patients’ questions about image artifacts and how well the doctor explained the findings from the artifacts. Figure 2c (left) shows that multimodal AMIE was rated more highly than clinicians on both questions (P < 0.01). A group of 18 specialists in dermatology, cardiology and general practice (internal medicine) evaluated multimodal AMIE and PCPs in terms of numerous attributes of effective diagnostic conversations, including multimodal reasoning, history-taking, diagnostic accuracy, management reasoning, communication skills and empathy. Aggregated results are presented in Fig. 2c (see Extended Data Fig. 2 for full ratings). To account for potential interdisciplinary variations, we also present results by specialty (Fig. 3). The conversations held by multimodal AMIE were consistently rated more highly by specialists overall and across three disciplines. This is reflected in all items in our assessment, across diagnosis and management and quality of history-taking. The positive assessments of the diagnostic dialogues that multimodal AMIE conducted were also supported by results in the MUH component of our assessment: specialists assigned higher ratings on average to multimodal AMIE’s interpretation of the multimodal artifacts, its reasoning about these artifacts and the way it handled patient questions and concerns about the multimodal artifacts.

fulltextpubmed· Results· item 42135531

AMIE conducted were also supported by results in the MUH component of our assessment: specialists assigned higher ratings on average to multimodal AMIE’s interpretation of the multimodal artifacts, its reasoning about these artifacts and the way it handled patient questions and concerns about the multimodal artifacts. This result is also reflected in all three disciplines studied. Multimodal AMIE provided more appropriate diagnoses and management plans, and patients were more happy to return for a second visit with multimodal AMIE in our three fields (Fig. 3b; all P < 0.001). Similarly, in all three disciplines, multimodal AMIE performed more accurate interpretation of artifacts and reasoning about those artifacts, with no increase in hallucination or mis-reporting (Fig. 3c; all P < 0.001, P < 0.05 and P > 0.05, respectively). This higher performance across all disciplines was broadly mirrored in the top-k accuracy (Fig. 3a), although the superiority was statistically significant only for clinical documents, potentially due to the smaller sample size. In Supplementary Section 4.2, we provide qualitative examples from the three disciplines, of dialogues between multimodal AMIE and a patient-actor and between a PCP and a patient-actor where multimodal AMIE has correctly identified the most probable diagnosis. For each discipline, we use the same patient scenario for the dialogue with multimodal AMIE and with the PCP, to facilitate a qualitative comparison.

fulltextpubmed· Results· item 42135531

of dialogues between multimodal AMIE and a patient-actor and between a PCP and a patient-actor where multimodal AMIE has correctly identified the most probable diagnosis. For each discipline, we use the same patient scenario for the dialogue with multimodal AMIE and with the PCP, to facilitate a qualitative comparison. Beyond the OSCE study, this section details automated evaluations of multimodal AMIE using simulated dialogues and auto-rating (detailed in the ‘Auto-rating simulated diagnostic conversations’ section and Fig. 4) to quantify diagnostic performance across different system configurations.Fig. 4Overview of multimodal dialogue simulation and the evaluation framework.This figure illustrates the three key components of the system. Step 1 (patient scenario generation): Comprehensive patient profiles are created, including condition, demographics, symptoms and medical history, using information derived from web searches and real-world datasets—for example, PTB-XL (for cardiology) and SCIN (for dermatology). These profiles are then used to generate detailed patient scenarios, outlining the patient’s presentation and expectations. Step 2 (dialogue simulation): A doctor agent and a patient agent engage in a text-based consultation with multimodal artifact upload. The doctor agent is instructed to provide empathetic and clinically accurate responses while the patient agent responds truthfully based on the generated scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management.

fulltextpubmed· Results· item 42135531

scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management. This figure illustrates the three key components of the system. Step 1 (patient scenario generation): Comprehensive patient profiles are created, including condition, demographics, symptoms and medical history, using information derived from web searches and real-world datasets—for example, PTB-XL (for cardiology) and SCIN (for dermatology). These profiles are then used to generate detailed patient scenarios, outlining the patient’s presentation and expectations. Step 2 (dialogue simulation): A doctor agent and a patient agent engage in a text-based consultation with multimodal artifact upload. The doctor agent is instructed to provide empathetic and clinically accurate responses while the patient agent responds truthfully based on the generated scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management.

fulltextpubmed· Results· item 42135531

scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management. To further understand the contributions of different system components to overall performance, we conducted several ablation studies using our automated evaluation framework. This section details experiments comparing the full multimodal AMIE system against simpler baselines and evaluating the specific impact of test-time reasoning, history-taking and robustness to variations in patient presentation on conversation quality, as assessed by the auto-rater. One of the key elements of our system is the state-aware dialogue phase transition framework (detailed in the ‘Multimodal state-aware reasoning’ section). To assess its contribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach.

fulltextpubmed· Results· item 42135531

tribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach. Using our simulation environment and auto-rater, we evaluated conversations generated for synthetic scenarios derived from PAD-UFES-20, SCIN, PTB-XL and ClinicalDoc-QA datasets. We compared two conditions: (1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components.

fulltextpubmed· Results· item 42135531

1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components. Across all datasets (Fig. 5a, Extended Data Table 2 and Supplementary Section 3.4) (PAD-UFES-20, SCIN, Clinical Documents and PTB-XL), multimodal AMIE with reasoning consistently outperformed the vanilla baseline on key diagnostic metrics. Specifically, Extended Data Table 2 (also see Supplementary Section 3.4) shows that multimodal AMIE with reasoning achieved higher DDx accuracy. Notable top-1 accuracy gains occurred on Clinical Documents (0.89 to 0.98), PAD-UFES-20 (0.75 to 0.84) and PTB-XL (0.20 to 0.28). Reasoning also improved information-gathering scores (for example, Clinical Documents: 4.00 to 4.17). Hallucination rates remained negligible in both settings although was sometimes higher for state-aware reasoning. Management plan appropriateness also improved (for example, SCIN: 3.69 to 3.89).Fig. 5Auto-rater evaluation to quantify the value of state-aware reasoning and dialogue-based interaction.a, State-aware reasoning ablation compares the performance of multimodal AMIE with its state-aware reasoning against a baseline without it, using the dialogue simulation environment across four datasets: skin photographs (SCIN, n = 885), skin photographs (PAD-UFES-20, n = 1,290), ECG traces (PTB-XL, n = 2,724) and Clinical Documents (created by our clinical providers, n = 138). The metrics include top-k DDx accuracy, managment plan appropriateness (scaled), quality of information gathering (scaled) and non-hallucination rate (the proportion of cases without any hallucination). b, Dialogue ablation aims to quantify the value of dialogue-based interaction by comparing the top-k DDx accuracy of the baseline LLM that infers the differentials directly from the image artifacts (denoted by ‘Images-only’) against that of the multimodal AMIE system, which leverages the combination of provided images and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric.

fulltextpubmed· Results· item 42135531

ages and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan.

fulltextpubmed· Results· item 42135531

ages and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan. a, State-aware reasoning ablation compares the performance of multimodal AMIE with its state-aware reasoning against a baseline without it, using the dialogue simulation environment across four datasets: skin photographs (SCIN, n = 885), skin photographs (PAD-UFES-20, n = 1,290), ECG traces (PTB-XL, n = 2,724) and Clinical Documents (created by our clinical providers, n = 138). The metrics include top-k DDx accuracy, managment plan appropriateness (scaled), quality of information gathering (scaled) and non-hallucination rate (the proportion of cases without any hallucination). b, Dialogue ablation aims to quantify the value of dialogue-based interaction by comparing the top-k DDx accuracy of the baseline LLM that infers the differentials directly from the image artifacts (denoted by ‘Images-only’) against that of the multimodal AMIE system, which leverages the combination of provided images and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan.

fulltextpubmed· Results· item 42135531

dence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan. To understand the additional performance gained with history-taking versus analyzing multimodal artifacts alone, we conducted an ablation comparing two settings: (1) ‘Images-only’, where the base model diagnosed from images without dialogue, omitting history-taking, and (2) ‘Images + Dialogue’, where multimodal AMIE used images, full dialogues and its state-aware reasoning at inference. Results are shown in Fig. 5b. The ‘Images-only’ setting yielded markedly lower accuracy; for instance, top-1 accuracy on PAD-UFES-20 dropped significantly without dialogue. Conversely, the ‘Images + Dialogue’ setting consistently improved performance across all datasets (SCIN, PAD-UFES-20, PTB-XL and Clinical Documents).

fulltextpubmed· Results· item 42135531

Results are shown in Fig. 5b. The ‘Images-only’ setting yielded markedly lower accuracy; for instance, top-1 accuracy on PAD-UFES-20 dropped significantly without dialogue. Conversely, the ‘Images + Dialogue’ setting consistently improved performance across all datasets (SCIN, PAD-UFES-20, PTB-XL and Clinical Documents). To evaluate system robustness against variations in patient presentation, we used LLM-driven augmentation to introduce variations to the original patient scenarios across three axes: personality styles (affecting interaction dynamics), demographics (testing fairness) and semantic changes to background/symptoms (evaluating resilience to minor history variations). The results, detailed in Extended Data Fig. 3 (more details in Supplementary Section 3.5), demonstrate that the performance of multimodal AMIE across key auto-rater metrics—including diagnostic accuracy, information gathering, hallucination rate and management plan appropriateness—remained highly consistent between the original and augmented scenarios.

fulltextpubmed· Results· item 42135531

Data Fig. 3 (more details in Supplementary Section 3.5), demonstrate that the performance of multimodal AMIE across key auto-rater metrics—including diagnostic accuracy, information gathering, hallucination rate and management plan appropriateness—remained highly consistent between the original and augmented scenarios. In addition to inference-time strategies, we also explored the potential of training-time adaptations through domain-specific supervised fine-tuning (SFT) to enhance model performance. We conducted experiments comparing our base model to a version fine-tuned on specialized medical dialogue and Q&A datasets. Although these experiments revealed that SFT could yield statistically significant improvements in diagnostic accuracy for highly specialized tasks such as ECG analysis, this came at the cost of a notable decrease in performance on other crucial aspects of the consultation, including management plan appropriateness (Extended Data Fig. 4). Additional details on these critical tradeoffs are presented in Supplementary Section 3.6.

fulltextpubmed· Comparison of multimodal AMIE and PCPs on OSCE assessment· item 42135531

In our remote OSCE-style evaluation of simulated consultations using multimodal text chat, the performance of multimodal AMIE was rated as similar or superior to that of PCPs across all assessed axes of comparison. Below, we report performance on an objective measure of diagnostic accuracy and subjective but blinded measures of patient experience and other key attributes of effective clinical consultations rated by specialist physicians. A detailed statistical breakdown of the consultation characteristics, including dialogue length and verbosity by participant type and modality, is provided in Supplementary Section 2.1.

fulltextpubmed· Diagnostic accuracy of multimodal AMIE and PCPs· item 42135531

Multimodal AMIE achieved higher diagnostic accuracy compared to PCPs in the OSCE-style simulated clinical consultations that both multimodal AMIE and PCPs performed with patient-actors using multimodal text chat. As illustrated in Fig. 2a, which plots the top-k diagnostic accuracy, the differential diagnosis (DDx) of multimodal AMIE was both more accurate and more comprehensive.Fig. 2Comparison of OSCE metrics between PCPs and multimodal AMIE.a, Top-k DDx accuracy. Both multimodal AMIE and PCPs have the opportunity to submit a DDx list (with at least three and up to 10 plausible items, ordered by likelihood) for each dialogue. The top-k accuracy was computed using an auto-rater that compares each of the 10 diagnoses with the ground truth. Multimodal AMIE outperformed PCPs in diagnostic accuracy on average (P < 0.001, n = 210 conversations). b, Subgroup analysis of DDx accuracy. Top-3 accuracy was compared across subgroups within three axes as defined in the MUH rubric: (1) image quality (n = 204 conversations; 115 high quality, 89 low quality), (2) image-grounded reasoning (n = 204 conversations; 143 ‘Yes’, 61 ‘No’) and (3) hallucination of image findings in consultation (n = 204 conversations; 135 ‘No hallucination’, 56 ‘Hallucination’, 13 ‘Significant hallucination’) (see Extended Data Table 1 for details). c, Distributions of relative performance across patient scenarios. The proportion of scenarios for which PCPs were rated more favorably than multimodal AMIE and vice versa for three categories in the OSCE rubric are presented—namely, MUH rubric (right), non-multimodal specialist metrics (middle) and non-multimodal patient-centric metrics (right). Each specialist and patient-actor separately rated two dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote.

fulltextpubmed· Conversation quality rated by patient-actors· item 42135531

We compared multimodal AMIE and PCPs on the patient-centric quality metrics that were collected upon conclusion of the consultations. Patient-actors rated the interactions with multimodal AMIE as similar to or higher than those with board-certified PCPs across the assessed dimensions in Fig. 2c (right) (see Extended Data Fig. 1 for full ratings). This included aspects such as being polite, listening, explaining conditions, involving patients in decisions, appearing honest/trustworthy, building rapport, showing empathy and managing patient concerns (General Medical Council Patient Questionnaire (GMCPQ) and Practical Assessment of Clinical Examination Skills (PACES) criteria, P < 0.01 for all axes). We also asked patient-actors two questions specifically about the nature of multimodal interaction (Extended Data Table 1). The two questions, rated from 1 to 5, asked how well the doctor addressed the patients’ questions about image artifacts and how well the doctor explained the findings from the artifacts. Figure 2c (left) shows that multimodal AMIE was rated more highly than clinicians on both questions (P < 0.01).

fulltextpubmed· Specialist evaluation of the performance and robustness of multimodal AMIE· item 42135531

A group of 18 specialists in dermatology, cardiology and general practice (internal medicine) evaluated multimodal AMIE and PCPs in terms of numerous attributes of effective diagnostic conversations, including multimodal reasoning, history-taking, diagnostic accuracy, management reasoning, communication skills and empathy. Aggregated results are presented in Fig. 2c (see Extended Data Fig. 2 for full ratings). To account for potential interdisciplinary variations, we also present results by specialty (Fig. 3). The conversations held by multimodal AMIE were consistently rated more highly by specialists overall and across three disciplines. This is reflected in all items in our assessment, across diagnosis and management and quality of history-taking. The positive assessments of the diagnostic dialogues that multimodal AMIE conducted were also supported by results in the MUH component of our assessment: specialists assigned higher ratings on average to multimodal AMIE’s interpretation of the multimodal artifacts, its reasoning about these artifacts and the way it handled patient questions and concerns about the multimodal artifacts.

fulltextpubmed· LLM-as-a-judge for automated evaluations· item 42135531

Beyond the OSCE study, this section details automated evaluations of multimodal AMIE using simulated dialogues and auto-rating (detailed in the ‘Auto-rating simulated diagnostic conversations’ section and Fig. 4) to quantify diagnostic performance across different system configurations.Fig. 4Overview of multimodal dialogue simulation and the evaluation framework.This figure illustrates the three key components of the system. Step 1 (patient scenario generation): Comprehensive patient profiles are created, including condition, demographics, symptoms and medical history, using information derived from web searches and real-world datasets—for example, PTB-XL (for cardiology) and SCIN (for dermatology). These profiles are then used to generate detailed patient scenarios, outlining the patient’s presentation and expectations. Step 2 (dialogue simulation): A doctor agent and a patient agent engage in a text-based consultation with multimodal artifact upload. The doctor agent is instructed to provide empathetic and clinically accurate responses while the patient agent responds truthfully based on the generated scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management.

fulltextpubmed· Ablation studies on state-aware reasoning· item 42135531

To further understand the contributions of different system components to overall performance, we conducted several ablation studies using our automated evaluation framework. This section details experiments comparing the full multimodal AMIE system against simpler baselines and evaluating the specific impact of test-time reasoning, history-taking and robustness to variations in patient presentation on conversation quality, as assessed by the auto-rater. One of the key elements of our system is the state-aware dialogue phase transition framework (detailed in the ‘Multimodal state-aware reasoning’ section). To assess its contribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach. Using our simulation environment and auto-rater, we evaluated conversations generated for synthetic scenarios derived from PAD-UFES-20, SCIN, PTB-XL and ClinicalDoc-QA datasets. We compared two conditions: (1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components.

fulltextpubmed· State-aware reasoning improves dialogue quality· item 42135531

One of the key elements of our system is the state-aware dialogue phase transition framework (detailed in the ‘Multimodal state-aware reasoning’ section). To assess its contribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach. Using our simulation environment and auto-rater, we evaluated conversations generated for synthetic scenarios derived from PAD-UFES-20, SCIN, PTB-XL and ClinicalDoc-QA datasets. We compared two conditions: (1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components.

fulltextpubmed· History-taking improves diagnostic accuracy· item 42135531

To understand the additional performance gained with history-taking versus analyzing multimodal artifacts alone, we conducted an ablation comparing two settings: (1) ‘Images-only’, where the base model diagnosed from images without dialogue, omitting history-taking, and (2) ‘Images + Dialogue’, where multimodal AMIE used images, full dialogues and its state-aware reasoning at inference. Results are shown in Fig. 5b. The ‘Images-only’ setting yielded markedly lower accuracy; for instance, top-1 accuracy on PAD-UFES-20 dropped significantly without dialogue. Conversely, the ‘Images + Dialogue’ setting consistently improved performance across all datasets (SCIN, PAD-UFES-20, PTB-XL and Clinical Documents).

fulltextpubmed· Robust diagnostic conversation with patient scenario augmentations· item 42135531

To evaluate system robustness against variations in patient presentation, we used LLM-driven augmentation to introduce variations to the original patient scenarios across three axes: personality styles (affecting interaction dynamics), demographics (testing fairness) and semantic changes to background/symptoms (evaluating resilience to minor history variations). The results, detailed in Extended Data Fig. 3 (more details in Supplementary Section 3.5), demonstrate that the performance of multimodal AMIE across key auto-rater metrics—including diagnostic accuracy, information gathering, hallucination rate and management plan appropriateness—remained highly consistent between the original and augmented scenarios.

fulltextpubmed· Tradeoffs of domain-specific supervised fine-tuning for base model· item 42135531

In addition to inference-time strategies, we also explored the potential of training-time adaptations through domain-specific supervised fine-tuning (SFT) to enhance model performance. We conducted experiments comparing our base model to a version fine-tuned on specialized medical dialogue and Q&A datasets. Although these experiments revealed that SFT could yield statistically significant improvements in diagnostic accuracy for highly specialized tasks such as ECG analysis, this came at the cost of a notable decrease in performance on other crucial aspects of the consultation, including management plan appropriateness (Extended Data Fig. 4). Additional details on these critical tradeoffs are presented in Supplementary Section 3.6.

fulltextpubmed· Discussion· item 42135531

In this work, as illustrated in Fig. 1, we sought to address a key unmet need for medical AI systems to be able to understand, reason about and appropriately use information from multimodal medical data in the course of diagnostic conversations with patients. To contextualize these capabilities, it is important to distinguish domain-specific systems such as multimodal AMIE from general-purpose multiagent frameworks such as AutoGen15 and Magentic-One16. Although those offer broad flexibility, AMIE is explicitly engineered for the safety-critical reasoning required in medical dialogue. This specialization addresses the field’s shift from static benchmarks toward dynamic, process-oriented evaluations of the complete patient journey (for example, MedR-Bench17 and Agent Hospital18). By combining artifact perception with dynamic reasoning in standard chat interfaces—already ubiquitous in global remote care19—AMIE bridges a critical gap toward real-world clinical deployment. Integrating multimodal perception into clinical dialogue, our study demonstrates that adding these reasoning capabilities into multimodal AMIE maintained the high standards of diagnostic accuracy and conversational quality previously established in text-only settings4. Across the multimodal OSCE evaluations, multimodal AMIE achieved diagnostic accuracy and core conversational metrics—such as empathy, clarity and history-taking—that were higher than those of PCPs, including in scenarios requiring interpretation of images or documents.

fulltextpubmed· Discussion· item 42135531

ity previously established in text-only settings4. Across the multimodal OSCE evaluations, multimodal AMIE achieved diagnostic accuracy and core conversational metrics—such as empathy, clarity and history-taking—that were higher than those of PCPs, including in scenarios requiring interpretation of images or documents. Although foundation models such as Gemini 2.0 provide multimodal capabilities (see ‘Validation of base model perception of medical artifacts’ section), a key challenge is effectively weaving artifact interpretation into the flow of clinical dialogue without degrading the interaction. Multimodal AMIE achieved this via our state-aware reasoning framework (Fig. 6 and ‘Multimodal state-aware reasoning’ section), enabling the system to handle multimodal data mid-interaction while preserving clinically sound, empathetic dialogue. This capability is essential for real-world telehealth where multimodal data exchange is routine13,20. The modalities tested reflect the expanding breadth of data present in telehealth; for instance, clinical photographs and reports are readily accessible to consumers. Additionally, ECGs are common in primary care21, but they are increasingly available on consumer-grade wearable devices, which enable recording of both single-lead and, with guidance, multi-lead configurations22.Fig. 6Multimodal state-aware reasoning.Illustrated here is multimodal AMIE’s state-aware dialogue phase transition framework, which structures the diagnostic conversation. The system progresses through three distinct phases, each with a specific goal: (1) History-taking (gathering comprehensive patient information), (2) (Differential) Diagnosis & management (formulating and presenting DDx and management plan) and (3) Answer follow-up questions (addressing remaining concerns). Within each phase, multimodal AMIE maintains an internal state—its dynamic understanding of the patient’s situation, evolving diagnoses (DDx) and knowledge gaps—derived from the dialogue history and any multimodal inputs. This state guides specific actions, such as asking targeted questions, requesting multimodal data (if needed), generating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation.

fulltextpubmed· Discussion· item 42135531

rating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians.

fulltextpubmed· Discussion· item 42135531

rating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians. Illustrated here is multimodal AMIE’s state-aware dialogue phase transition framework, which structures the diagnostic conversation. The system progresses through three distinct phases, each with a specific goal: (1) History-taking (gathering comprehensive patient information), (2) (Differential) Diagnosis & management (formulating and presenting DDx and management plan) and (3) Answer follow-up questions (addressing remaining concerns). Within each phase, multimodal AMIE maintains an internal state—its dynamic understanding of the patient’s situation, evolving diagnoses (DDx) and knowledge gaps—derived from the dialogue history and any multimodal inputs. This state guides specific actions, such as asking targeted questions, requesting multimodal data (if needed), generating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians.

fulltextpubmed· Discussion· item 42135531

ically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians. Notably, multimodal AMIE demonstrated proficiency in reasoning about patient-provided multimodal artifacts, as indicated by specialist ratings. Assessed blinded using our MUH rubric, specialist physicians gave multimodal AMIE higher scores compared to PCPs on axes encompassing artifact engagement, interpretation and artifact-grounded reasoning.

fulltextpubmed· Discussion· item 42135531

emonstrated proficiency in reasoning about patient-provided multimodal artifacts, as indicated by specialist ratings. Assessed blinded using our MUH rubric, specialist physicians gave multimodal AMIE higher scores compared to PCPs on axes encompassing artifact engagement, interpretation and artifact-grounded reasoning. Moreover, patient-actors rated multimodal AMIE significantly higher than PCPs in addressing questions regarding artifacts and explaining findings derived from visual evidence. Although caution is warranted when generalizing to broader remote care—given that clinician performance might be constrained by the text chat modality versus video—multimodal AMIE was rated higher in these communicative tasks. This suggests that patients valued multimodal AMIE’s systematic approach to verbalizing visual findings, potentially more than the implicit assessments offered by clinicians. In one such case, after the patient sends images, multimodal AMIE summarizes the images and relates findings to the conversation history. By contrast, the PCP sometimes did not directly address the shared artifacts (see Supplementary Information for dialogue examples). Explicitly addressing images, as multimodal AMIE did, may provide reassurance that the images were transmitted correctly and that the doctor reached a conclusion based on specific evidence. We regard this explicit acknowledgment as vital for multimodal interaction, as patients expect explanations referring to the visual evidence provided.

fulltextpubmed· Discussion· item 42135531

g images, as multimodal AMIE did, may provide reassurance that the images were transmitted correctly and that the doctor reached a conclusion based on specific evidence. We regard this explicit acknowledgment as vital for multimodal interaction, as patients expect explanations referring to the visual evidence provided. The scenarios were designed to ensure that successful diagnosis required appropriate use of multimodal information. Subgroup analyses supported the validity of this design. Both multimodal AMIE and PCP diagnostic accuracy dropped in cases where image quality was deemed inferior; if performance remained high irrespective of image quality, it would suggest that the consultation did not depend on the visual upload. Instead, the parallel drop in performance confirms that multimodal inputs were critical to the diagnostic process, validating these scenarios as effective tests of multimodal reasoning. Further support was seen in how diagnostic accuracy improved for both multimodal AMIE and PCPs when they were rated as having appropriately used visual information. These findings lend credence to multimodal AMIE’s potential robustness to lower-quality images and ability to overcome intermediate misinterpretations.

fulltextpubmed· Discussion· item 42135531

ng. Further support was seen in how diagnostic accuracy improved for both multimodal AMIE and PCPs when they were rated as having appropriately used visual information. These findings lend credence to multimodal AMIE’s potential robustness to lower-quality images and ability to overcome intermediate misinterpretations. Ensuring robustness, reliability and equity is essential for responsible diagnostic AI, although achieving this requires extensive further study. Our study includes extensive dialogue-level error analysis through our rubrics (for example, artifact-grounded reasoning, hallucination checks and DDx appropriateness), but we acknowledge that a large-scale, fine-grained analysis of the model’s internal reasoning states is a limitation of our present work. Our evaluations also undertook initial steps to explore robustness. Automated evaluations using LLM-driven augmentations showed that the core metrics of multimodal AMIE remained largely stable with respect to variations in simulated patient personality, demographics and minor semantic changes. Specialist evaluations further examined factors influencing accuracy. Notably, multimodal AMIE demonstrated greater robustness in diagnostic accuracy against specialist-rated image quality than PCPs. Although these preliminary findings regarding stability are encouraging, further research is necessary to ensure reliable performance across diverse populations. Finally, we acknowledge that although the clinical documents and in-house generated ECG trace images were private, the skin images and underlying raw ECG signals were publicly available and may have been encountered during pretraining. We sought to mitigate this risk by integrating these artifacts into unique, synthetic patient scenarios; notably, our perception test results suggest that the influence of data leakage on performance is limited. Nevertheless, validating on fully private or prospectively collected datasets is one of the key directions for future research.

fulltextpubmed· Discussion· item 42135531

te this risk by integrating these artifacts into unique, synthetic patient scenarios; notably, our perception test results suggest that the influence of data leakage on performance is limited. Nevertheless, validating on fully private or prospectively collected datasets is one of the key directions for future research. Considering the role of model specialization, adapting foundation models like Gemini23 for medicine involves strategies ranging from training-time adaptations such as SFT24 to inference-time approaches25. Training-time specialization can boost performance on specific tasks (for example, ECG analysis) that are underrepresented in pretraining26. However, this requires substantial resources and carries the risk of catastrophic forgetting, potentially degrading general conversational competence. To investigate this, we compared fine-tuned Gemini 2.0 Flash with the original model. Although experiments displayed potential gains on specific tasks, they highlighted performance degradation on other aspects, such as management plan appropriateness.

fulltextpubmed· Discussion· item 42135531

forgetting, potentially degrading general conversational competence. To investigate this, we compared fine-tuned Gemini 2.0 Flash with the original model. Although experiments displayed potential gains on specific tasks, they highlighted performance degradation on other aspects, such as management plan appropriateness. Consequently, we focused on leveraging a strong, general-purpose base model enhanced with domain-specific inference-time strategy. We selected Gemini 2.0 Flash for its higher out-of-the-box performance in multimodal understanding and conversational fluency compared to other variants. This strong baseline provided a robust foundation for the state-aware reasoning framework (Fig. 6). Although training-time specialization remains an area for future investigation, our findings highlight the effectiveness of inference-time enhancements, a strategy also proven successful for management reasoning in related work5. This state-based approach builds on human-centered research showing that user−clinician conversations unfold in phases8. This aligns with clinician feedback preferring thorough information gathering to avoid premature assessments. However, a limitation of this structured approach is potential rigidity when critical information (for example, allergy and contraindication) emerges late in the dialogue, suggesting a need for more fluid state transitions.

fulltextpubmed· Discussion· item 42135531

ith clinician feedback preferring thorough information gathering to avoid premature assessments. However, a limitation of this structured approach is potential rigidity when critical information (for example, allergy and contraindication) emerges late in the dialogue, suggesting a need for more fluid state transitions. Regarding the limitations of chat-based interactions, our evaluation used a synchronous chat interface, mirroring ubiquitous instant messaging often used for medical conversation12,19. Although accessible, this modality presents limitations compared to video or in-person visits. Chat restricts non-verbal cues, limits dynamic visual assessments and precludes physical examinations, all of which provide crucial diagnostic information. Furthermore, video calls may foster stronger rapport and trust27,28. Consequently, the scope of conditions and depth of assessment in this format are constrained. Additionally, the potential for unblinding due to systematic stylistic differences between multimodal AMIE and PCPs poses a challenge for comparative studies. These limitations must be considered when interpreting findings.

fulltextpubmed· Discussion· item 42135531

8. Consequently, the scope of conditions and depth of assessment in this format are constrained. Additionally, the potential for unblinding due to systematic stylistic differences between multimodal AMIE and PCPs poses a challenge for comparative studies. These limitations must be considered when interpreting findings. Highlighting the importance of real-world validation and future directions, although our findings demonstrate the potential of multimodal AMIE in simulated consultations, considerable research is essential before clinical translation. Our study is not a randomized clinical trial but, rather, represents an exploratory investigation into the properties of multimodal diagnostic dialogue. Before real-world clinical deployment, it will be important in future work to conduct a full randomized clinical trial, conforming with full CONSORT criteria, including prespecified endpoints and preregistered statistical analysis conducted by an independent contract research organization. Future studies must rigorously evaluate the performance, safety and reliability of multimodal AMIE under the complexities of actual healthcare delivery, including its impact on workflows and health equity. Furthermore, we acknowledge that comprehensive clinical care often relies on multimodal clinical data beyond those currently covered in this work (for example, ECGs, skin images and scanned clinical documents). Extending this framework to support additional modalities, such as radiology or pathology imaging, is a valuable direction for future research to broaden the system’s diagnostic scope. We emphasize that multimodal AMIE remains an evolving research system.

fulltextpubmed· Discussion· item 42135531

(for example, ECGs, skin images and scanned clinical documents). Extending this framework to support additional modalities, such as radiology or pathology imaging, is a valuable direction for future research to broaden the system’s diagnostic scope. We emphasize that multimodal AMIE remains an evolving research system. In conclusion, this work advances AMIE to dynamically integrate multimodal reasoning within diagnostic conversations. Achieved through a novel state-aware inference-time reasoning technique using Gemini 2.0 Flash, the performance of multimodal AMIE within the OSCE study matched or surpassed PCPs across diagnostic accuracy and consultation quality as rated by specialists and patient-actors. Although further research is essential before real-world translation, these findings demonstrate progress toward AI systems that are capable of comprehensive clinical interactions.

fulltextpubmed· Methods· item 42135531

This work advances the diagnostic capabilities of the multimodal AMIE system4,5 by integrating multimodal perception, enabling it to conduct diagnostic conversations that are more clinically realistic and that incorporate various forms of medical data beyond text. Our approach centers on a state-aware dialogue phase transition framework, which leverages the multimodal reasoning capabilities of Gemini 2.0 Flash29. This framework guides multimodal AMIE through structured phases of history-taking, diagnosis and management and follow-up, dynamically adapting the conversation based on intermediate model outputs that reflect the evolving patient state and diagnostic hypotheses. To rigorously evaluate the system’s performance during model development and in comparison to human clinicians, we employed a two-pronged evaluation strategy. First, we developed an automated evaluation pipeline involving perception tests on isolated medical artifacts and simulated dialogues assessed by an auto-rater across key clinical dimensions, such as diagnostic accuracy and information gathering. Second, we conducted an expert evaluation using an OSCE-style methodology to assess the capabilities of multimodal AMIE in realistic, simulated patient encounters involving multimodal data.

fulltextpubmed· Methods· item 42135531

mulated dialogues assessed by an auto-rater across key clinical dimensions, such as diagnostic accuracy and information gathering. Second, we conducted an expert evaluation using an OSCE-style methodology to assess the capabilities of multimodal AMIE in realistic, simulated patient encounters involving multimodal data. Real clinical diagnostic dialogues follow a structured yet flexible path. Clinicians methodically gather information, form potential diagnoses, strategically request and interpret further details (including multimodal data such as skin photographs or ECGs), continually update their assessment based on new evidence and eventually formulate a management plan. This process requires a clinician to adapt the line of questioning based on evolving hypotheses and uncertainty while ensuring that all critical information is considered30.

fulltextpubmed· Methods· item 42135531

odal data such as skin photographs or ECGs), continually update their assessment based on new evidence and eventually formulate a management plan. This process requires a clinician to adapt the line of questioning based on evolving hypotheses and uncertainty while ensuring that all critical information is considered30. Given the rapid advancements in LLM capabilities and their increasing proficiency in following complex instructions, one might achieve considerable progress toward emulating this process using a sophisticated system prompt alone. However, we hypothesize that, for a safety-critical and highly dynamic task like multimodal diagnosis, building an explicit state-aware reasoning system layered on top of the LLM offers critical advantages. Such a system provides greater control over the dialogue flow, enables more reliable tracking of the diagnostic state and uncertainty, facilitates more deliberate integration of multimodal inputs and ultimately leads to higher-quality, more dependable clinical reasoning rather than relying solely on complex prompting (validated in the ‘Automated evaluations of multimodal AMIE configurations’ section).

fulltextpubmed· Methods· item 42135531

tracking of the diagnostic state and uncertainty, facilitates more deliberate integration of multimodal inputs and ultimately leads to higher-quality, more dependable clinical reasoning rather than relying solely on complex prompting (validated in the ‘Automated evaluations of multimodal AMIE configurations’ section). Therefore, the multimodal AMIE system implements this state-aware dialogue phase transition framework to manage the diagnostic process. This framework dynamically controls the progression of multimodal AMIE through three distinct phases: (1) history-taking, (2) diagnosis and management and (3) follow-up. Transitions between phases, and actions within each phase, are driven by intermediate model outputs representing the evolving patient state and diagnostic hypotheses (Fig. 6). Crucially, each phase builds upon the accumulated dialogue history, which contextually incorporates various forms of patient data, including text, images (such as skin photographs or ECG tracings) and clinical documents (such as laboratory reports or prior consultation notes). This state-aware approach allows multimodal AMIE to emulate the structured yet adaptive reasoning process of clinicians (see examples in Supplementary Figs. 5 and 6).

fulltextpubmed· Methods· item 42135531

data, including text, images (such as skin photographs or ECG tracings) and clinical documents (such as laboratory reports or prior consultation notes). This state-aware approach allows multimodal AMIE to emulate the structured yet adaptive reasoning process of clinicians (see examples in Supplementary Figs. 5 and 6). Patient profile initialization: A structured patient profile is initialized and dynamically updated throughout the interaction. This profile acts as a condensed, evolving record of known patient information, encompassing the chief complaint; history of present illness; demographics (age, sex and race); positive and negative symptoms; past medical, family and social/travel histories; medications; other relevant details; and a prioritized list of knowledge gaps. Initially, this profile may contain minimal information.Evolving DDx generation: An internal, evolving DDx is generated. This DDx is not initially presented to the patient. DDx generation begins after an initial interaction period, allowing baseline information collection. The frequency of DDx updates is configurable.Continuation decision: A key decision point is whether to continue gathering history or transition to presenting a diagnosis. This decision uses a set of criteria, including assessing if sufficient information exists to formulate a reasonable differential diagnosis. A decision module, querying Gemini 2.0 Flash, determines if current information is sufficient to proceed or if more targeted questions are needed. This module considers the dialogue history and preliminary DDx.Targeted question and multimodal data request generation: If history-taking continues, the system generates focused questions to address information gaps identified in the patient profile and DDx uncertainty. It is important to note that this internal measure of uncertainty, derived from the model’s iterative DDx generation, is used as a pragmatic heuristic to guide the history-taking dialogue toward more relevant lines of questioning. It is not presented to the user nor is it a formally calibrated diagnostic probability intended for direct clinical interpretation. Crucially, the system is designed to recognize when multimodal data are necessary and to strategically request them. For instance, based on reported symptoms such as a rash, multimodal AMIE will prompt the user to upload skin photographs.

fulltextpubmed· Methods· item 42135531

ly calibrated diagnostic probability intended for direct clinical interpretation. Crucially, the system is designed to recognize when multimodal data are necessary and to strategically request them. For instance, based on reported symptoms such as a rash, multimodal AMIE will prompt the user to upload skin photographs. Its reasoning extends to requesting additional views if needed (for example, ‘Could you please provide a photo from a different angle or in better lighting?’ or ‘Do you have photos showing how the rash looked previously?’). Similarly, for reported cardiac symptoms, it might request an ECG tracing if available. Upon receiving an artifact, multimodal AMIE elicits detailed descriptions adhering to instructions for interpreting the artifacts and their key aspects for determining salient findings (for example, for skin: lesion morphology, distribution and color; for ECGs: heart rate, rhythm, key waveforms and intervals). This explicit prompting for descriptive details ensures that the model extracts salient features from the artifact to inform the ongoing conversation and diagnostic reasoning. The generated question or request is presented to the patient.Iterative refinement: The process is iterative. New information from patient responses and data uploads is incorporated into the dialogue history, updating the patient profile and internal DDx. The patient summary is periodically refreshed to reflect the current understanding.

fulltextpubmed· Methods· item 42135531

ion or request is presented to the patient.Iterative refinement: The process is iterative. New information from patient responses and data uploads is incorporated into the dialogue history, updating the patient profile and internal DDx. The patient summary is periodically refreshed to reflect the current understanding. Patient profile initialization: A structured patient profile is initialized and dynamically updated throughout the interaction. This profile acts as a condensed, evolving record of known patient information, encompassing the chief complaint; history of present illness; demographics (age, sex and race); positive and negative symptoms; past medical, family and social/travel histories; medications; other relevant details; and a prioritized list of knowledge gaps. Initially, this profile may contain minimal information. Evolving DDx generation: An internal, evolving DDx is generated. This DDx is not initially presented to the patient. DDx generation begins after an initial interaction period, allowing baseline information collection. The frequency of DDx updates is configurable. Continuation decision: A key decision point is whether to continue gathering history or transition to presenting a diagnosis. This decision uses a set of criteria, including assessing if sufficient information exists to formulate a reasonable differential diagnosis. A decision module, querying Gemini 2.0 Flash, determines if current information is sufficient to proceed or if more targeted questions are needed. This module considers the dialogue history and preliminary DDx.

fulltextpubmed· Methods· item 42135531

teria, including assessing if sufficient information exists to formulate a reasonable differential diagnosis. A decision module, querying Gemini 2.0 Flash, determines if current information is sufficient to proceed or if more targeted questions are needed. This module considers the dialogue history and preliminary DDx. Targeted question and multimodal data request generation: If history-taking continues, the system generates focused questions to address information gaps identified in the patient profile and DDx uncertainty. It is important to note that this internal measure of uncertainty, derived from the model’s iterative DDx generation, is used as a pragmatic heuristic to guide the history-taking dialogue toward more relevant lines of questioning. It is not presented to the user nor is it a formally calibrated diagnostic probability intended for direct clinical interpretation. Crucially, the system is designed to recognize when multimodal data are necessary and to strategically request them. For instance, based on reported symptoms such as a rash, multimodal AMIE will prompt the user to upload skin photographs. Its reasoning extends to requesting additional views if needed (for example, ‘Could you please provide a photo from a different angle or in better lighting?’ or ‘Do you have photos showing how the rash looked previously?’). Similarly, for reported cardiac symptoms, it might request an ECG tracing if available. Upon receiving an artifact, multimodal AMIE elicits detailed descriptions adhering to instructions for interpreting the artifacts and their key aspects for determining salient findings (for example, for skin: lesion morphology, distribution and color; for ECGs: heart rate, rhythm, key waveforms and intervals). This explicit prompting for descriptive details ensures that the model extracts salient features from the artifact to inform the ongoing conversation and diagnostic reasoning. The generated question or request is presented to the patient.

fulltextpubmed· Methods· item 42135531

stribution and color; for ECGs: heart rate, rhythm, key waveforms and intervals). This explicit prompting for descriptive details ensures that the model extracts salient features from the artifact to inform the ongoing conversation and diagnostic reasoning. The generated question or request is presented to the patient. Iterative refinement: The process is iterative. New information from patient responses and data uploads is incorporated into the dialogue history, updating the patient profile and internal DDx. The patient summary is periodically refreshed to reflect the current understanding. Once the continuation decision indicates readiness, the system transitions to phase 2.

fulltextpubmed· Methods· item 42135531

Iterative refinement: The process is iterative. New information from patient responses and data uploads is incorporated into the dialogue history, updating the patient profile and internal DDx. The patient summary is periodically refreshed to reflect the current understanding. Once the continuation decision indicates readiness, the system transitions to phase 2. Patient profile (optional update): The patient profile may be further refined.DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient.DDx presentation: The system presents a ranked DDx (5−10 conditions). Crucially, the explanation for each diagnosis is grounded in evidence from the entire interaction, explicitly referencing and explaining findings from the provided multimodal data (for example, ‘Based on the photo you sent, the circular shape and central clearing of the rash are characteristic of…,’ or ‘The ECG shows specific changes in the ST segment which support the possibility of…’).Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment.Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns.

fulltextpubmed· Methods· item 42135531

an based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment.Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns. Patient profile (optional update): The patient profile may be further refined. DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient. DDx presentation: The system presents a ranked DDx (5−10 conditions). Crucially, the explanation for each diagnosis is grounded in evidence from the entire interaction, explicitly referencing and explaining findings from the provided multimodal data (for example, ‘Based on the photo you sent, the circular shape and central clearing of the rash are characteristic of…,’ or ‘The ECG shows specific changes in the ST segment which support the possibility of…’). Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment. Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns.

fulltextpubmed· Methods· item 42135531

Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment. Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns. The presentation of the refined DDx and management plan signals the transition to phase 3. Patient profile (optional update): The patient profile may continue to be updated with information from follow-up questions.Plan communication and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan.Dialogue termination: Dialogue continues until patient questions are addressed, the management plan is communicated and a natural conclusion is reached. Patient profile (optional update): The patient profile may continue to be updated with information from follow-up questions. Plan communication and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan.

fulltextpubmed· Methods· item 42135531

cation and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan. Dialogue termination: Dialogue continues until patient questions are addressed, the management plan is communicated and a natural conclusion is reached. Upon reaching this termination state, the system proceeds to synthesize the interaction into a structured summary.

fulltextpubmed· Methods· item 42135531

cation and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan. Dialogue termination: Dialogue continues until patient questions are addressed, the management plan is communicated and a natural conclusion is reached. Upon reaching this termination state, the system proceeds to synthesize the interaction into a structured summary. Once the dialogue reaches a natural conclusion (phase 3 completion), the system automatically generates a structured post-questionnaire. This process leverages the full multimodal dialogue history and the final internal state (patient profile, final DDx and management plan) to produce a comprehensive summary suitable for clinical review. Specifically, multimodal AMIE is prompted to:Finalize DDx: Based on the entire interaction, including all textual exchanges and interpretations of multimodal artifacts, generate a final, ranked DDx listing the most probable condition and several plausible alternatives.Formulate the management plan: Leveraging both the established DDx and a constrained web search process for grounding in current medical knowledge and guidelines, detail the recommended management plan. This includes proposed in-visit and ordered investigations, specific actions or recommendations for the patient and necessary escalation level (for example, video call or in-person visit) with justification and follow-up requirements (necessity, timeframe and reason). This retrieval-based approach is used exclusively during this offline step to ensure that recommendations are informed by up-to-date information. The specific methodology is detailed in Supplementary Section 5.Extract salient artifact findings: Identify and list the key clinical findings observed in any provided images (skin photographs, ECGs and documents) that were relevant to the diagnostic and management reasoning.

fulltextpubmed· Methods· item 42135531

ons are informed by up-to-date information. The specific methodology is detailed in Supplementary Section 5.Extract salient artifact findings: Identify and list the key clinical findings observed in any provided images (skin photographs, ECGs and documents) that were relevant to the diagnostic and management reasoning. Finalize DDx: Based on the entire interaction, including all textual exchanges and interpretations of multimodal artifacts, generate a final, ranked DDx listing the most probable condition and several plausible alternatives. Formulate the management plan: Leveraging both the established DDx and a constrained web search process for grounding in current medical knowledge and guidelines, detail the recommended management plan. This includes proposed in-visit and ordered investigations, specific actions or recommendations for the patient and necessary escalation level (for example, video call or in-person visit) with justification and follow-up requirements (necessity, timeframe and reason). This retrieval-based approach is used exclusively during this offline step to ensure that recommendations are informed by up-to-date information. The specific methodology is detailed in Supplementary Section 5. Extract salient artifact findings: Identify and list the key clinical findings observed in any provided images (skin photographs, ECGs and documents) that were relevant to the diagnostic and management reasoning.

fulltextpubmed· Methods· item 42135531

Formulate the management plan: Leveraging both the established DDx and a constrained web search process for grounding in current medical knowledge and guidelines, detail the recommended management plan. This includes proposed in-visit and ordered investigations, specific actions or recommendations for the patient and necessary escalation level (for example, video call or in-person visit) with justification and follow-up requirements (necessity, timeframe and reason). This retrieval-based approach is used exclusively during this offline step to ensure that recommendations are informed by up-to-date information. The specific methodology is detailed in Supplementary Section 5. Extract salient artifact findings: Identify and list the key clinical findings observed in any provided images (skin photographs, ECGs and documents) that were relevant to the diagnostic and management reasoning. The structured post-questionnaire serves as a standardized record of multimodal AMIE’s clinical assessment, reasoning and recommendations based on the completed multimodal consultation; it is used both as an artifact rated by clinicians in the OSCE evaluation and as the standardized output scored by the auto-rater in simulated dialogues.

fulltextpubmed· Methods· item 42135531

-questionnaire serves as a standardized record of multimodal AMIE’s clinical assessment, reasoning and recommendations based on the completed multimodal consultation; it is used both as an artifact rated by clinicians in the OSCE evaluation and as the standardized output scored by the auto-rater in simulated dialogues. We established an automated evaluation framework to support rapid iteration and rigorous assessment of the multimodal AMIE system. This included evaluating perception capabilities on isolated medical artifacts and using a simulation environment with auto-raters to assess complete diagnostic multimodal dialogues and perform component ablations and determine optimal model selection. A critical precursor to effective multimodal diagnostic conversation is ensuring that our underlying models possess a fundamental capability: accurate perception of diverse medical artifacts. Although LLMs have demonstrated remarkable progress in understanding and generating text, their ability to reliably ‘see’ and interpret medical images and documents— akin to a clinician’s initial visual assessment—remains less explored in the context of conversational DDx. Without robust perceptual grounding, even the most sophisticated conversational framework would be limited in its ability to meaningfully integrate multimodal data into history-taking and diagnostic reasoning.

fulltextpubmed· Methods· item 42135531

s— akin to a clinician’s initial visual assessment—remains less explored in the context of conversational DDx. Without robust perceptual grounding, even the most sophisticated conversational framework would be limited in its ability to meaningfully integrate multimodal data into history-taking and diagnostic reasoning. Therefore, we first conducted a suite of perception tests designed to isolate and evaluate the visual understanding of our base models when presented with common medical artifacts: smartphone-captured skin images, ECG tracings and clinical documents. The objective was not to achieve state-of-the-art performance on complex diagnostic tasks per se but, rather, to establish a baseline confidence in whether our models could reliably discern key visual features and clinical information from these modalities in isolation. For skin images, we used the SCIN dataset31 and PAD-UFES-20 (ref. 32), assessing the ability of the model to describe lesion morphology, color and distribution. For ECGs, we employed PTB-XL33 and ECG-QA benchmark34, prompting the model to provide probable diagnoses or answer expert-validated questions based on ECG images generated from raw signals. Finally, for clinical documents, we created the ClinicalDoc-QA dataset that consists of question-answering tasks based on a large collection of deidentified clinical notes and patient records generated by physicians to evaluate the model’s comprehension of medical information from clinical documents. Additional dataset details can be found in Supplementary Section 3.1.

fulltextpubmed· Methods· item 42135531

nicalDoc-QA dataset that consists of question-answering tasks based on a large collection of deidentified clinical notes and patient records generated by physicians to evaluate the model’s comprehension of medical information from clinical documents. Additional dataset details can be found in Supplementary Section 3.1. To ensure accurate processing of artifacts, we verified the perceptual capabilities of our base model, Gemini 2.0 Flash. Our objective was not to achieve state-of-the-art performance on isolated benchmarks but, rather, to confirm a sufficient perceptual foundation for our state-aware reasoning framework. Overall, this verification (detailed in Extended Data Fig. 5 and Supplementary Section 3.1.2) demonstrated that Gemini 2.0 Flash possessed the necessary foundational perceptual grounding across our target modalities. This provided the confidence required to build multimodal AMIE’s advanced reasoning and dialogue functions upon this base model. This is quantitatively validated by our ablation studies (Fig. 5) and ultimately demonstrated by the superior diagnostic accuracy of multimodal AMIE in the human-evaluated OSCE study (Fig. 2). The outcomes of these perception tests, presented in the ‘Automated evaluations of multimodal AMIE configurations’ section and Supplementary Section 3.1.2, provide an essential confidence check on our base LLM, Gemini 2.0 Flash, and also enable us to explore critical questions such as the impact of history-taking in addition to multimodal perception.

fulltextpubmed· Methods· item 42135531

tests, presented in the ‘Automated evaluations of multimodal AMIE configurations’ section and Supplementary Section 3.1.2, provide an essential confidence check on our base LLM, Gemini 2.0 Flash, and also enable us to explore critical questions such as the impact of history-taking in addition to multimodal perception. The following subsections detail our three-step approach illustrated in Fig. 4 to creating a simulation environment for auto-evaluation. Step (1) encompasses patient profile and scenario generation, which is then used in step (2) for the turn-by-turn multimodal dialogue generation process used to mimic real-world telemedicine interactions, and, finally, the simulated dialogue is used in step (3) by the auto-rater for automated assessment of multimodal AMIE’s conversational abilities.

fulltextpubmed· Methods· item 42135531

ofile and scenario generation, which is then used in step (2) for the turn-by-turn multimodal dialogue generation process used to mimic real-world telemedicine interactions, and, finally, the simulated dialogue is used in step (3) by the auto-rater for automated assessment of multimodal AMIE’s conversational abilities. Generating realistic synthetic dialogues requires detailed patient profiles and corresponding clinical scenarios. Our methodology involves compiling comprehensive patient metadata, including symptoms, demographics and medical history, tailored to specific medical domains.Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation.Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web.Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs.

fulltextpubmed· Methods· item 42135531

the model’s medical knowledge and real-world examples from the web.Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs. Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation. Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web. Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs.

fulltextpubmed· Methods· item 42135531

Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web. Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs. Once comprehensive metadata were established for all domains, we used Gemini 2.0 Flash to generate detailed clinical scenarios. These scenarios provide context for the patient agent in the simulation, outlining the initial presentation, details to share only upon questioning, patient expectations and concerns, desired outcomes and potential questions for the doctor. We guided Gemini using few-shot examples, ensuring scenario diversity (for example, varied ethnicities and occupations) and clinical realism. Although the raw images are from public datasets, we augmented them with unique, synthetically generated scenarios. We designed this approach to mitigate potential data leakage by ensuring that the model encounters these images within novel clinical contexts. Crucially, the scenarios deliberately omit the final diagnosis, requiring the AI agent to deduce it through the simulated conversation and analysis of any provided multimodal artifacts.

fulltextpubmed· Methods· item 42135531

ed this approach to mitigate potential data leakage by ensuring that the model encounters these images within novel clinical contexts. Crucially, the scenarios deliberately omit the final diagnosis, requiring the AI agent to deduce it through the simulated conversation and analysis of any provided multimodal artifacts. We simulate dialogues turn by turn between a doctor agent (representing multimodal AMIE) and a patient agent, mimicking a telemedical interaction. The doctor agent (multimodal AMIE) is instructed to be empathetic and clinically accurate. It uses the state-aware dialogue phase transition framework (Fig. 6) to navigate history-taking, diagnosis and management and follow-up phases. A key capability is its multimodal nature: it can strategically request and analyze relevant medical artifacts (such as skin images or ECGs) based on patient-reported symptoms during the conversation. The patient agent simulates a patient strictly following a predefined scenario from step 1, which details their profile and clinical scenario. It is prompted to respond truthfully using concise, casual language appropriate for an online consultation while adhering to specific rules regarding politeness and pacing the release of information to avoid overwhelming the doctor agent.

fulltextpubmed· Methods· item 42135531

edefined scenario from step 1, which details their profile and clinical scenario. It is prompted to respond truthfully using concise, casual language appropriate for an online consultation while adhering to specific rules regarding politeness and pacing the release of information to avoid overwhelming the doctor agent. In each dialogue turn, the doctor agent generates a question or statement based on the conversation history and its internal state. The patient agent then formulates a response according to its scenario instructions. If the scenario dictates and the doctor agent requests it, the patient agent can upload a relevant medical artifact (for example, an image). Multimodal AMIE then analyzes this artifact, integrating the findings into its ongoing diagnostic reasoning and subsequent dialogue turns. This turn-by-turn exchange continues until the conversation reaches a natural conclusion (for example, patient questions are resolved) or hits a predefined maximum turn limit. After the exchange is concluded, the doctor agent generates the structured post-questionnaire. The output is a complete simulated dialogue transcript, capturing the dynamic exchange of text and multimodal data. Auto-rating (automated evaluation) is crucial for the iterative development and safety assessment of conversational models. Although human evaluation is valuable, it is often constrained by cost, time and scalability. Auto-raters provide a mechanism for rapid, scalable and consistent performance assessment across essential characteristics.

fulltextpubmed· Methods· item 42135531

d evaluation) is crucial for the iterative development and safety assessment of conversational models. Although human evaluation is valuable, it is often constrained by cost, time and scalability. Auto-raters provide a mechanism for rapid, scalable and consistent performance assessment across essential characteristics. We employ an auto-rater based on Gemini 2.0 Flash to evaluate the simulated dialogues. The auto-rater, which is given access to the ground truth condition, assesses various aspects, from quantitative metrics such as diagnostic accuracy to qualitative dimensions such as information gathering and safety. The specific criteria and scoring methods used by our auto-rater are detailed below (Extended Data Table 3):DDx accuracy: Measures if the patient’s ground truth condition is present within the top-1, top-3 or top-10 diagnoses listed by multimodal AMIE’s generated post-questionnaire. Evaluation accounts for synonyms and uses binary scoring (1 if present in top-n, 0 otherwise), averaged across dialogues.Gathering information: Assesses the effectiveness of the model in eliciting information, including its use of open-ended questions, active listening, addressing patient concerns and summarization. Rated on a five-point Likert scale (from 1 = very poor to 5 = excellent).Management plan appropriateness: Evaluates the suitability of the proposed management plan (including recommended actions) relative to best medical practices and the patient’s specific situation. Rated on a five-point Likert scale (from 1 = very poor to 5 = excellent).Hallucination: Detects instances of fabricated information stated by the model (for example, mentioning details not provided by the patient, making incorrect assumptions and claiming access to non-existent data or medications). This safety-critical metric excludes assessments of the diagnostic or management reasoning itself and is evaluated with a binary score (1 = hallucination present, 0 = absent).

fulltextpubmed· Methods· item 42135531

example, mentioning details not provided by the patient, making incorrect assumptions and claiming access to non-existent data or medications). This safety-critical metric excludes assessments of the diagnostic or management reasoning itself and is evaluated with a binary score (1 = hallucination present, 0 = absent). DDx accuracy: Measures if the patient’s ground truth condition is present within the top-1, top-3 or top-10 diagnoses listed by multimodal AMIE’s generated post-questionnaire. Evaluation accounts for synonyms and uses binary scoring (1 if present in top-n, 0 otherwise), averaged across dialogues. Gathering information: Assesses the effectiveness of the model in eliciting information, including its use of open-ended questions, active listening, addressing patient concerns and summarization. Rated on a five-point Likert scale (from 1 = very poor to 5 = excellent). Management plan appropriateness: Evaluates the suitability of the proposed management plan (including recommended actions) relative to best medical practices and the patient’s specific situation. Rated on a five-point Likert scale (from 1 = very poor to 5 = excellent). Hallucination: Detects instances of fabricated information stated by the model (for example, mentioning details not provided by the patient, making incorrect assumptions and claiming access to non-existent data or medications). This safety-critical metric excludes assessments of the diagnostic or management reasoning itself and is evaluated with a binary score (1 = hallucination present, 0 = absent).

fulltextpubmed· Methods· item 42135531

example, mentioning details not provided by the patient, making incorrect assumptions and claiming access to non-existent data or medications). This safety-critical metric excludes assessments of the diagnostic or management reasoning itself and is evaluated with a binary score (1 = hallucination present, 0 = absent). This simulation framework advances beyond previous text-only versions4 by generating multimodal patient scenarios grounded in real-world datasets (SCIN, PTB-XL and Clinical Documents) and simulating realistic patient−clinician interactions involving these multimodal medical data. As detailed in Supplementary Section 3.3, calibration analysis confirms good alignment with human expert judgments (Supplementary Fig. 4), thus enabling automated evaluation of the system’s multimodal reasoning capabilities. To compare multimodal AMIE’s capabilities in undertaking multimodal diagnostic conversations to those of PCPs, we extended the remote OSCE study design introduced by Tu et al.4. The quality of dialogues was assessed using a set of rubrics and metrics reflecting the perspective of patients and specialist physicians. We also introduced a new evaluation rubric specifically to assess the ability to use multimodal medical data effectively in the context of clinical consultations.

fulltextpubmed· Methods· item 42135531

ced by Tu et al.4. The quality of dialogues was assessed using a set of rubrics and metrics reflecting the perspective of patients and specialist physicians. We also introduced a new evaluation rubric specifically to assess the ability to use multimodal medical data effectively in the context of clinical consultations. The OSCE is a standardized practical assessment widely used in healthcare education to objectively evaluate clinical skills and competencies by simulating real-world practice35,36. Unlike traditional knowledge-based examinations, the OSCE assesses practical skills of real-world clinical encounters, typically involving candidates rotating through a series of timed stations where they encounter a trained patient-actor portraying a specific clinical scenario. Test-takers perform designated tasks such as taking a medical history, conducting a physical examination, interpreting results or counseling the patient. Examiners observe these interactions and score the test-taker’s performance against detailed, predefined checklists that assess crucial skills such as history-taking, examination technique, clinical reasoning and communication.

fulltextpubmed· Methods· item 42135531

history, conducting a physical examination, interpreting results or counseling the patient. Examiners observe these interactions and score the test-taker’s performance against detailed, predefined checklists that assess crucial skills such as history-taking, examination technique, clinical reasoning and communication. In this work, we designed and conducted a virtual analogue of the OSCE adapted for multimodal text chat. In this setting, patient-actors engaged in blinded, synchronous chat conversations with either the multimodal AMIE system or PCPs, conducted through a chat interface (Supplementary Fig. 1) that allowed exchange of both text and uploaded images as is now commonplace for mobile chat applications (see Extended Data Fig. 6 for an overview). Within the virtual consultations, patient-actors were instructed to upload images such as skin photographs, laboratory tests or ECG tracings, emulating how popular text chat platforms have been reportedly used as a means for remote consultation.

fulltextpubmed· Methods· item 42135531

e for mobile chat applications (see Extended Data Fig. 6 for an overview). Within the virtual consultations, patient-actors were instructed to upload images such as skin photographs, laboratory tests or ECG tracings, emulating how popular text chat platforms have been reportedly used as a means for remote consultation. In collaboration with two organizations that routinely perform OSCE assessments in Canada and India, we developed 105 case scenarios. The scenarios were centered around three types of image artifacts that are commonly available to patients in telemedical primary care: (1) smartphone-captured skin images, (2) ECG tracings and (3) clinical documents. We selected these modalities as the most likely to occur in primary telecare: patients can readily photograph their skin, scan documents they received or report with ECGs collected via consumer tools22. Extended Data Table 4 provides an overview and examples of image artifacts. Scenarios contained different representations of these image artifacts to represent varying levels of image quality, including photographs of a given skin concern from different angles as well as screenshots and smartphone photographs of ECG tracings and clinical documents.

fulltextpubmed· Methods· item 42135531

an overview and examples of image artifacts. Scenarios contained different representations of these image artifacts to represent varying levels of image quality, including photographs of a given skin concern from different angles as well as screenshots and smartphone photographs of ECG tracings and clinical documents. We selected cases for dermatology from the SCIN dataset31. SCIN contains representative skin images along with rich metadata crowdsourced from real internet users with skin concerns. ECG tracings were taken from PTB-XL33, the largest publicly available dataset for this modality. Clinical documents were crafted by OSCE laboratories for the purpose of this study. Notably, we included normal cases, too. Scenarios were designed such that both requesting and interpreting images as well as taking the patient’s history were required—while either alone was insufficient—in order to form a confident diagnosis. To this end, for both skin photographs and ECG-based scenarios, we selected challenging images with high annotation ambiguity in their diagnosis labels as detailed in Extended Data Table 4. Furthermore, dermatologists, cardiologists and internists (medical specialists in internal medicine) ensured that the accompanying text-based component of scenarios (for example, family/medical history and symptoms) was complementary in the sense that both image and textual information were required to arrive at an accurate diagnosis. Nevertheless, we note that, although the scenarios are consistent with the artifacts and the diagnosis, there is no guarantee that they reflect the true case history, as they were created post hoc. Scenarios were crafted to match case metadata provided in the SCIN and PTB-XL datasets (such as age, sex and ethnicity) when available. For Clinical Documents, we selected conditions for which diagnosis formation can be guided based on the results of laboratory reports.

fulltextpubmed· Methods· item 42135531

true case history, as they were created post hoc. Scenarios were crafted to match case metadata provided in the SCIN and PTB-XL datasets (such as age, sex and ethnicity) when available. For Clinical Documents, we selected conditions for which diagnosis formation can be guided based on the results of laboratory reports. Lastly, to simulate variability in image quality in real-world care settings, for half of the ECG and clinical document cases, we used smartphone photographs of a computer screen showing the artifacts (Extended Data Table 4), as in previous work37. Extended Data Fig. 6 provides an overview of our OSCE study design. For each case scenario, the same patient-actor performed one conversation with multimodal AMIE and a qualified PCP each via a synchronous chat interface (Supplementary Fig. 1), in a blinded and randomized order. (Although we conducted a randomized, blinded evaluation, our study is not a randomized clinical trial.) The chat interface supported text-based communication and sharing of images from the scenarios while displaying the scenario pack information to patient-actors throughout the conversation.

fulltextpubmed· Methods· item 42135531

inded and randomized order. (Although we conducted a randomized, blinded evaluation, our study is not a randomized clinical trial.) The chat interface supported text-based communication and sharing of images from the scenarios while displaying the scenario pack information to patient-actors throughout the conversation. After each consultation, patient-actors completed a questionnaire to rate their experience to represent the patient-actor perspective. Separately, a different questionnaire was completed by both multimodal AMIE (using offline generation) and PCPs to summarize key clinical findings and next steps. In particular, the questionnaire asked for a DDx list (at least three and up to 10 plausible items, ordered by likelihood), a management plan and a description of salient image findings. Lastly, a group of specialist physicians assessed the performance of multimodal AMIE and PCPs in a blinded fashion (Supplementary Fig. 2), based on their consultation transcripts and responses to the post-questionnaires using various rubrics, representing the specialist physician perspective.

fulltextpubmed· Methods· item 42135531

of salient image findings. Lastly, a group of specialist physicians assessed the performance of multimodal AMIE and PCPs in a blinded fashion (Supplementary Fig. 2), based on their consultation transcripts and responses to the post-questionnaires using various rubrics, representing the specialist physician perspective. A large part of the collected patient-centric as well as specialist metrics is derived from the evaluation rubric introduced in the previous work4 for their direct applicability in the assessment of multimodal diagnostic dialogues. These criteria are designed primarily to evaluate the consultation quality, the appropriateness of diagnostic and management decisions, the accuracy of clinical reasoning and various aspects of effective communication skills (such as the ability to elicit information and manage patient concerns). The questions in this rubric were derived from consideration of authoritative assessment schemes of clinical interactions such as the PACES, used by the Royal College of Physicians in the UK for examining history-taking skills38; the GMCPQ (https://edwebcontent.ed.ac.uk/sites/default/files/imports/fileManager/patient_questionnaire%20pdf_48210488.pdf); and approaches to Patient-Centered Communication Best Practices (PCCBP)39.

fulltextpubmed· Methods· item 42135531

interactions such as the PACES, used by the Royal College of Physicians in the UK for examining history-taking skills38; the GMCPQ (https://edwebcontent.ed.ac.uk/sites/default/files/imports/fileManager/patient_questionnaire%20pdf_48210488.pdf); and approaches to Patient-Centered Communication Best Practices (PCCBP)39. The MUH rubric was designed to assess competence in handling and interpreting multimodal artifacts in the context of clinical consultations, including the ability to understand medical image artifacts; to use that understanding to guide the conversation and inform a clinically accurate assessment; and to communicate the salient findings and address the patients’ questions in an appropriate manner. Details are provided in Extended Data Table 1.

fulltextpubmed· Methods· item 42135531

onsultations, including the ability to understand medical image artifacts; to use that understanding to guide the conversation and inform a clinically accurate assessment; and to communicate the salient findings and address the patients’ questions in an appropriate manner. Details are provided in Extended Data Table 1. This study involved 19 board-certified PCPs and 25 validated patient-actors, with participants split between India and Canada (10/9 for PCPs and 10/10 for patient-actors for India/Canada, respectively). Informed consent was obtained from each participant before their participation. To ensure consistent and high-quality interactions, all patient-actors and PCPs completed a standardized training prior to participation. This training included an interactive workshop to familiarize them with the chat interface and the specifics of the OSCE interaction format based on detailed instructional guides. The PCPs had a median post-residency experience of 6 years with an interquartile range of 3.5−11.5 years. For quality assessment of the multimodal AMIE/PCP consultations and their post-questionnaire responses, we recruited 18 independent specialist physicians from India and North America over three medical specialities (dermatology, cardiology and internal medicine) to ensure diverse clinical viewpoints as well as requisite expertise. These specialists were independent of the study team and the patient-actor cohort, had a median post-residency experience of 5 years (interquartile range, 4−8) and were assigned evaluation tasks matching the medical specialty of the scenarios (for example, dermatology scenarios were evaluated by dermatologists). For each of 105 scenarios, each assigned patient-actor performed two sessions, one with a PCP and another with multimodal AMIE in a randomized order, yielding a total of 210 consultations. Each of these conversations was evaluated by three independent specialists.

fulltextpubmed· Methods· item 42135531

ermatology scenarios were evaluated by dermatologists). For each of 105 scenarios, each assigned patient-actor performed two sessions, one with a PCP and another with multimodal AMIE in a randomized order, yielding a total of 210 consultations. Each of these conversations was evaluated by three independent specialists. We analyzed the results from the remote OSCE study using a mixed-effect approach to account for differences in the scenarios using random effects. Most rating choice options for patient-actors and specialist physicians were ordinal (for example, ranging from ‘Very unfavorable’ to ‘Very favorable’), and we model them using a cumulative ordinal model with logit link function. For binary ratings (for example, ‘Yes/No’), we use a Bernoulli model with logit link. For diagnostic accuracy, we again use a Bernoulli model for ‘Correct/Incorrect’. In all models, the scenario is used as a random intercept to model differences in the quality of the patient vignettes or the difficulty of the diagnostic task. Both of these factors can affect the flow of conversation and should, therefore, be considered when modeling ratings of these conversations. The experimental arm (PCP/multimodal AMIE) is modeled as a fixed effect. For diagnostic accuracy, we also model the fixed effect of the differential size (k) using monotonic regression (Supplementary Section 2.2).

fulltextpubmed· Methods· item 42135531

he flow of conversation and should, therefore, be considered when modeling ratings of these conversations. The experimental arm (PCP/multimodal AMIE) is modeled as a fixed effect. For diagnostic accuracy, we also model the fixed effect of the differential size (k) using monotonic regression (Supplementary Section 2.2). Here, clinical scenarios were modeled as random intercepts to allow generalization of findings to the broader population of medical cases rather than restricting inference to the specific vignettes sampled. We explicitly note that this study is a comparative system evaluation, not a randomized controlled trial. Additionally, because clinical utility requires simultaneous success in accuracy, empathy and trust, we treated these as co-equal outcomes adjusted via false discovery rate (FDR) rather than designating a single primary endpoint.

fulltextpubmed· Methods· item 42135531

ote that this study is a comparative system evaluation, not a randomized controlled trial. Additionally, because clinical utility requires simultaneous success in accuracy, empathy and trust, we treated these as co-equal outcomes adjusted via false discovery rate (FDR) rather than designating a single primary endpoint. In some analyses, we also estimate preference by comparing ordinal or binary metrics (for example, in Fig. 3c). Here, we model the proportion of preference for multimodal AMIE and PCPs using the one-sided χ2 test. For ablation results (for example, in Fig. 5), we test differences in metrics using Mann–Whitney U-tests. All P values were corrected for FDR using the Benjamini–Hochberg method as implemented in the Python library ‘statsmodels’. These corrections were applied to all statistical estimates of the experimental arm on either patient-actor or expert ratings, both when estimating parametric ordinal models and when estimating preference (Fig. 2). In all figures, we display confidence intervals that show the 2.5th and 97.5th percentiles of bootstrapped distributions of the displayed quantity. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

fulltextpubmed· Multimodal state-aware reasoning· item 42135531

Real clinical diagnostic dialogues follow a structured yet flexible path. Clinicians methodically gather information, form potential diagnoses, strategically request and interpret further details (including multimodal data such as skin photographs or ECGs), continually update their assessment based on new evidence and eventually formulate a management plan. This process requires a clinician to adapt the line of questioning based on evolving hypotheses and uncertainty while ensuring that all critical information is considered30. Given the rapid advancements in LLM capabilities and their increasing proficiency in following complex instructions, one might achieve considerable progress toward emulating this process using a sophisticated system prompt alone. However, we hypothesize that, for a safety-critical and highly dynamic task like multimodal diagnosis, building an explicit state-aware reasoning system layered on top of the LLM offers critical advantages. Such a system provides greater control over the dialogue flow, enables more reliable tracking of the diagnostic state and uncertainty, facilitates more deliberate integration of multimodal inputs and ultimately leads to higher-quality, more dependable clinical reasoning rather than relying solely on complex prompting (validated in the ‘Automated evaluations of multimodal AMIE configurations’ section).

fulltextpubmed· Phase 1: history-taking—building a comprehensive picture· item 42135531

Patient profile initialization: A structured patient profile is initialized and dynamically updated throughout the interaction. This profile acts as a condensed, evolving record of known patient information, encompassing the chief complaint; history of present illness; demographics (age, sex and race); positive and negative symptoms; past medical, family and social/travel histories; medications; other relevant details; and a prioritized list of knowledge gaps. Initially, this profile may contain minimal information.Evolving DDx generation: An internal, evolving DDx is generated. This DDx is not initially presented to the patient. DDx generation begins after an initial interaction period, allowing baseline information collection. The frequency of DDx updates is configurable.Continuation decision: A key decision point is whether to continue gathering history or transition to presenting a diagnosis. This decision uses a set of criteria, including assessing if sufficient information exists to formulate a reasonable differential diagnosis. A decision module, querying Gemini 2.0 Flash, determines if current information is sufficient to proceed or if more targeted questions are needed. This module considers the dialogue history and preliminary DDx.Targeted question and multimodal data request generation: If history-taking continues, the system generates focused questions to address information gaps identified in the patient profile and DDx uncertainty. It is important to note that this internal measure of uncertainty, derived from the model’s iterative DDx generation, is used as a pragmatic heuristic to guide the history-taking dialogue toward more relevant lines of questioning. It is not presented to the user nor is it a formally calibrated diagnostic probability intended for direct clinical interpretation. Crucially, the system is designed to recognize when multimodal data are necessary and to strategically request them. For instance, based on reported symptoms such as a rash, multimodal AMIE will prompt the user to upload skin photographs.

fulltextpubmed· Phase 2: diagnosis and management—from data to actionable plan· item 42135531

Patient profile (optional update): The patient profile may be further refined.DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient.DDx presentation: The system presents a ranked DDx (5−10 conditions). Crucially, the explanation for each diagnosis is grounded in evidence from the entire interaction, explicitly referencing and explaining findings from the provided multimodal data (for example, ‘Based on the photo you sent, the circular shape and central clearing of the rash are characteristic of…,’ or ‘The ECG shows specific changes in the ST segment which support the possibility of…’).Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment.Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns. Patient profile (optional update): The patient profile may be further refined.

fulltextpubmed· Phase 2: diagnosis and management—from data to actionable plan· item 42135531

Patient profile (optional update): The patient profile may be further refined.DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient.DDx presentation: The system presents a ranked DDx (5−10 conditions). Crucially, the explanation for each diagnosis is grounded in evidence from the entire interaction, explicitly referencing and explaining findings from the provided multimodal data (for example, ‘Based on the photo you sent, the circular shape and central clearing of the rash are characteristic of…,’ or ‘The ECG shows specific changes in the ST segment which support the possibility of…’).Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment.Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns. Patient profile (optional update): The patient profile may be further refined. DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient.

fulltextpubmed· Phase 2: diagnosis and management—from data to actionable plan· item 42135531

Patient profile (optional update): The patient profile may be further refined. DDx validation (subphase): The system enters a DDx validation phase. Focused questions are generated to gather specific evidence that supports or refutes potential diagnoses within the internal DDx. A decision module determines when the DDx is sufficiently validated to present to the patient. DDx presentation: The system presents a ranked DDx (5−10 conditions). Crucially, the explanation for each diagnosis is grounded in evidence from the entire interaction, explicitly referencing and explaining findings from the provided multimodal data (for example, ‘Based on the photo you sent, the circular shape and central clearing of the rash are characteristic of…,’ or ‘The ECG shows specific changes in the ST segment which support the possibility of…’). Management plan formulation: After presenting the DDx, the system formulates a management plan based on dialogue history, patient profile and the presented DDx. The plan includes recommendations for investigations, testing and/or treatment. Management plan delivery: The system delivers the management plan, potentially iteratively across several turns, allowing for clarification and addressing patient concerns. The presentation of the refined DDx and management plan signals the transition to phase 3.

fulltextpubmed· Phase 3: answer follow-up questions— ensuring patient understanding· item 42135531

Patient profile (optional update): The patient profile may continue to be updated with information from follow-up questions.Plan communication and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan.Dialogue termination: Dialogue continues until patient questions are addressed, the management plan is communicated and a natural conclusion is reached. Patient profile (optional update): The patient profile may continue to be updated with information from follow-up questions. Plan communication and question answering: The system addresses remaining patient questions, ensuring that the patient understands the proposed management plan, potentially referencing the multimodal artifacts again to clarify points. Responses are guided by the dialogue history, patient profile, presented DDx and management plan. Dialogue termination: Dialogue continues until patient questions are addressed, the management plan is communicated and a natural conclusion is reached. Upon reaching this termination state, the system proceeds to synthesize the interaction into a structured summary.

fulltextpubmed· After dialogue conclusion: structured post-questionnaire generation· item 42135531

Once the dialogue reaches a natural conclusion (phase 3 completion), the system automatically generates a structured post-questionnaire. This process leverages the full multimodal dialogue history and the final internal state (patient profile, final DDx and management plan) to produce a comprehensive summary suitable for clinical review. Specifically, multimodal AMIE is prompted to:Finalize DDx: Based on the entire interaction, including all textual exchanges and interpretations of multimodal artifacts, generate a final, ranked DDx listing the most probable condition and several plausible alternatives.Formulate the management plan: Leveraging both the established DDx and a constrained web search process for grounding in current medical knowledge and guidelines, detail the recommended management plan. This includes proposed in-visit and ordered investigations, specific actions or recommendations for the patient and necessary escalation level (for example, video call or in-person visit) with justification and follow-up requirements (necessity, timeframe and reason). This retrieval-based approach is used exclusively during this offline step to ensure that recommendations are informed by up-to-date information. The specific methodology is detailed in Supplementary Section 5.Extract salient artifact findings: Identify and list the key clinical findings observed in any provided images (skin photographs, ECGs and documents) that were relevant to the diagnostic and management reasoning.

fulltextpubmed· Automatic evaluations and system validation· item 42135531

We established an automated evaluation framework to support rapid iteration and rigorous assessment of the multimodal AMIE system. This included evaluating perception capabilities on isolated medical artifacts and using a simulation environment with auto-raters to assess complete diagnostic multimodal dialogues and perform component ablations and determine optimal model selection. A critical precursor to effective multimodal diagnostic conversation is ensuring that our underlying models possess a fundamental capability: accurate perception of diverse medical artifacts. Although LLMs have demonstrated remarkable progress in understanding and generating text, their ability to reliably ‘see’ and interpret medical images and documents— akin to a clinician’s initial visual assessment—remains less explored in the context of conversational DDx. Without robust perceptual grounding, even the most sophisticated conversational framework would be limited in its ability to meaningfully integrate multimodal data into history-taking and diagnostic reasoning. Therefore, we first conducted a suite of perception tests designed to isolate and evaluate the visual understanding of our base models when presented with common medical artifacts: smartphone-captured skin images, ECG tracings and clinical documents. The objective was not to achieve state-of-the-art performance on complex diagnostic tasks per se but, rather, to establish a baseline confidence in whether our models could reliably discern key visual features and clinical information from these modalities in isolation.

fulltextpubmed· Automatic evaluations and system validation· item 42135531

skin images, ECG tracings and clinical documents. The objective was not to achieve state-of-the-art performance on complex diagnostic tasks per se but, rather, to establish a baseline confidence in whether our models could reliably discern key visual features and clinical information from these modalities in isolation. For skin images, we used the SCIN dataset31 and PAD-UFES-20 (ref. 32), assessing the ability of the model to describe lesion morphology, color and distribution. For ECGs, we employed PTB-XL33 and ECG-QA benchmark34, prompting the model to provide probable diagnoses or answer expert-validated questions based on ECG images generated from raw signals. Finally, for clinical documents, we created the ClinicalDoc-QA dataset that consists of question-answering tasks based on a large collection of deidentified clinical notes and patient records generated by physicians to evaluate the model’s comprehension of medical information from clinical documents. Additional dataset details can be found in Supplementary Section 3.1.

fulltextpubmed· Automatic evaluations and system validation· item 42135531

example, mentioning details not provided by the patient, making incorrect assumptions and claiming access to non-existent data or medications). This safety-critical metric excludes assessments of the diagnostic or management reasoning itself and is evaluated with a binary score (1 = hallucination present, 0 = absent). This simulation framework advances beyond previous text-only versions4 by generating multimodal patient scenarios grounded in real-world datasets (SCIN, PTB-XL and Clinical Documents) and simulating realistic patient−clinician interactions involving these multimodal medical data. As detailed in Supplementary Section 3.3, calibration analysis confirms good alignment with human expert judgments (Supplementary Fig. 4), thus enabling automated evaluation of the system’s multimodal reasoning capabilities.

fulltextpubmed· Validation of base model perception of medical artifacts· item 42135531

A critical precursor to effective multimodal diagnostic conversation is ensuring that our underlying models possess a fundamental capability: accurate perception of diverse medical artifacts. Although LLMs have demonstrated remarkable progress in understanding and generating text, their ability to reliably ‘see’ and interpret medical images and documents— akin to a clinician’s initial visual assessment—remains less explored in the context of conversational DDx. Without robust perceptual grounding, even the most sophisticated conversational framework would be limited in its ability to meaningfully integrate multimodal data into history-taking and diagnostic reasoning. Therefore, we first conducted a suite of perception tests designed to isolate and evaluate the visual understanding of our base models when presented with common medical artifacts: smartphone-captured skin images, ECG tracings and clinical documents. The objective was not to achieve state-of-the-art performance on complex diagnostic tasks per se but, rather, to establish a baseline confidence in whether our models could reliably discern key visual features and clinical information from these modalities in isolation.

fulltextpubmed· Auto-rating simulated diagnostic conversations· item 42135531

The following subsections detail our three-step approach illustrated in Fig. 4 to creating a simulation environment for auto-evaluation. Step (1) encompasses patient profile and scenario generation, which is then used in step (2) for the turn-by-turn multimodal dialogue generation process used to mimic real-world telemedicine interactions, and, finally, the simulated dialogue is used in step (3) by the auto-rater for automated assessment of multimodal AMIE’s conversational abilities.

fulltextpubmed· Step 1: Patient profile and scenario generation· item 42135531

Generating realistic synthetic dialogues requires detailed patient profiles and corresponding clinical scenarios. Our methodology involves compiling comprehensive patient metadata, including symptoms, demographics and medical history, tailored to specific medical domains.Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation.Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web.Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs. Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation.

fulltextpubmed· Step 1: Patient profile and scenario generation· item 42135531

Generating realistic synthetic dialogues requires detailed patient profiles and corresponding clinical scenarios. Our methodology involves compiling comprehensive patient metadata, including symptoms, demographics and medical history, tailored to specific medical domains.Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation.Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web.Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs. Dermatology: We used the SCIN and PAD-UES-20 datasets, which provide rich, preexisting patient attributes (for example, age, sex, symptoms, skin tone and medical and social history), requiring no synthetic imputation. Cardiology: We started with the PTB-XL ECG dataset. As it includes only age and sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web.

fulltextpubmed· Step 1: Patient profile and scenario generation· item 42135531

sex, we employed Gemini 2.0 Flash, using its web search tool, to impute clinically plausible symptoms, social/family/medical history and cardiovascular risk factors associated with the ECG’s indicated condition. This grounded the synthetic profiles in the model’s medical knowledge and real-world examples from the web. Clinical documents: For this domain, we collaborated with the same external clinical partners who developed our OSCE scenarios. They created realistic patient profiles, document artifacts and corresponding metadata specifically for our simulation needs. Once comprehensive metadata were established for all domains, we used Gemini 2.0 Flash to generate detailed clinical scenarios. These scenarios provide context for the patient agent in the simulation, outlining the initial presentation, details to share only upon questioning, patient expectations and concerns, desired outcomes and potential questions for the doctor. We guided Gemini using few-shot examples, ensuring scenario diversity (for example, varied ethnicities and occupations) and clinical realism. Although the raw images are from public datasets, we augmented them with unique, synthetically generated scenarios. We designed this approach to mitigate potential data leakage by ensuring that the model encounters these images within novel clinical contexts. Crucially, the scenarios deliberately omit the final diagnosis, requiring the AI agent to deduce it through the simulated conversation and analysis of any provided multimodal artifacts.

fulltextpubmed· Step 2: Turn-by-turn multimodal dialogue generation· item 42135531

We simulate dialogues turn by turn between a doctor agent (representing multimodal AMIE) and a patient agent, mimicking a telemedical interaction. The doctor agent (multimodal AMIE) is instructed to be empathetic and clinically accurate. It uses the state-aware dialogue phase transition framework (Fig. 6) to navigate history-taking, diagnosis and management and follow-up phases. A key capability is its multimodal nature: it can strategically request and analyze relevant medical artifacts (such as skin images or ECGs) based on patient-reported symptoms during the conversation. The patient agent simulates a patient strictly following a predefined scenario from step 1, which details their profile and clinical scenario. It is prompted to respond truthfully using concise, casual language appropriate for an online consultation while adhering to specific rules regarding politeness and pacing the release of information to avoid overwhelming the doctor agent. In each dialogue turn, the doctor agent generates a question or statement based on the conversation history and its internal state. The patient agent then formulates a response according to its scenario instructions. If the scenario dictates and the doctor agent requests it, the patient agent can upload a relevant medical artifact (for example, an image). Multimodal AMIE then analyzes this artifact, integrating the findings into its ongoing diagnostic reasoning and subsequent dialogue turns.

fulltextpubmed· Step 2: Turn-by-turn multimodal dialogue generation· item 42135531

according to its scenario instructions. If the scenario dictates and the doctor agent requests it, the patient agent can upload a relevant medical artifact (for example, an image). Multimodal AMIE then analyzes this artifact, integrating the findings into its ongoing diagnostic reasoning and subsequent dialogue turns. This turn-by-turn exchange continues until the conversation reaches a natural conclusion (for example, patient questions are resolved) or hits a predefined maximum turn limit. After the exchange is concluded, the doctor agent generates the structured post-questionnaire. The output is a complete simulated dialogue transcript, capturing the dynamic exchange of text and multimodal data.

fulltextpubmed· Step 3: Criteria and scoring with auto-raters· item 42135531

Auto-rating (automated evaluation) is crucial for the iterative development and safety assessment of conversational models. Although human evaluation is valuable, it is often constrained by cost, time and scalability. Auto-raters provide a mechanism for rapid, scalable and consistent performance assessment across essential characteristics.

fulltextpubmed· Expert evaluation: multimodal virtual OSCE· item 42135531

To compare multimodal AMIE’s capabilities in undertaking multimodal diagnostic conversations to those of PCPs, we extended the remote OSCE study design introduced by Tu et al.4. The quality of dialogues was assessed using a set of rubrics and metrics reflecting the perspective of patients and specialist physicians. We also introduced a new evaluation rubric specifically to assess the ability to use multimodal medical data effectively in the context of clinical consultations. The OSCE is a standardized practical assessment widely used in healthcare education to objectively evaluate clinical skills and competencies by simulating real-world practice35,36. Unlike traditional knowledge-based examinations, the OSCE assesses practical skills of real-world clinical encounters, typically involving candidates rotating through a series of timed stations where they encounter a trained patient-actor portraying a specific clinical scenario. Test-takers perform designated tasks such as taking a medical history, conducting a physical examination, interpreting results or counseling the patient. Examiners observe these interactions and score the test-taker’s performance against detailed, predefined checklists that assess crucial skills such as history-taking, examination technique, clinical reasoning and communication.

fulltextpubmed· Expert evaluation: multimodal virtual OSCE· item 42135531

ote that this study is a comparative system evaluation, not a randomized controlled trial. Additionally, because clinical utility requires simultaneous success in accuracy, empathy and trust, we treated these as co-equal outcomes adjusted via false discovery rate (FDR) rather than designating a single primary endpoint. In some analyses, we also estimate preference by comparing ordinal or binary metrics (for example, in Fig. 3c). Here, we model the proportion of preference for multimodal AMIE and PCPs using the one-sided χ2 test. For ablation results (for example, in Fig. 5), we test differences in metrics using Mann–Whitney U-tests. All P values were corrected for FDR using the Benjamini–Hochberg method as implemented in the Python library ‘statsmodels’. These corrections were applied to all statistical estimates of the experimental arm on either patient-actor or expert ratings, both when estimating parametric ordinal models and when estimating preference (Fig. 2). In all figures, we display confidence intervals that show the 2.5th and 97.5th percentiles of bootstrapped distributions of the displayed quantity.

fulltextpubmed· Scenario packs· item 42135531

In collaboration with two organizations that routinely perform OSCE assessments in Canada and India, we developed 105 case scenarios. The scenarios were centered around three types of image artifacts that are commonly available to patients in telemedical primary care: (1) smartphone-captured skin images, (2) ECG tracings and (3) clinical documents. We selected these modalities as the most likely to occur in primary telecare: patients can readily photograph their skin, scan documents they received or report with ECGs collected via consumer tools22. Extended Data Table 4 provides an overview and examples of image artifacts. Scenarios contained different representations of these image artifacts to represent varying levels of image quality, including photographs of a given skin concern from different angles as well as screenshots and smartphone photographs of ECG tracings and clinical documents.

fulltextpubmed· Scenario packs· item 42135531

true case history, as they were created post hoc. Scenarios were crafted to match case metadata provided in the SCIN and PTB-XL datasets (such as age, sex and ethnicity) when available. For Clinical Documents, we selected conditions for which diagnosis formation can be guided based on the results of laboratory reports. Lastly, to simulate variability in image quality in real-world care settings, for half of the ECG and clinical document cases, we used smartphone photographs of a computer screen showing the artifacts (Extended Data Table 4), as in previous work37.

fulltextpubmed· Study design· item 42135531

Extended Data Fig. 6 provides an overview of our OSCE study design. For each case scenario, the same patient-actor performed one conversation with multimodal AMIE and a qualified PCP each via a synchronous chat interface (Supplementary Fig. 1), in a blinded and randomized order. (Although we conducted a randomized, blinded evaluation, our study is not a randomized clinical trial.) The chat interface supported text-based communication and sharing of images from the scenarios while displaying the scenario pack information to patient-actors throughout the conversation. After each consultation, patient-actors completed a questionnaire to rate their experience to represent the patient-actor perspective. Separately, a different questionnaire was completed by both multimodal AMIE (using offline generation) and PCPs to summarize key clinical findings and next steps. In particular, the questionnaire asked for a DDx list (at least three and up to 10 plausible items, ordered by likelihood), a management plan and a description of salient image findings. Lastly, a group of specialist physicians assessed the performance of multimodal AMIE and PCPs in a blinded fashion (Supplementary Fig. 2), based on their consultation transcripts and responses to the post-questionnaires using various rubrics, representing the specialist physician perspective.

fulltextpubmed· Participants· item 42135531

This study involved 19 board-certified PCPs and 25 validated patient-actors, with participants split between India and Canada (10/9 for PCPs and 10/10 for patient-actors for India/Canada, respectively). Informed consent was obtained from each participant before their participation. To ensure consistent and high-quality interactions, all patient-actors and PCPs completed a standardized training prior to participation. This training included an interactive workshop to familiarize them with the chat interface and the specifics of the OSCE interaction format based on detailed instructional guides. The PCPs had a median post-residency experience of 6 years with an interquartile range of 3.5−11.5 years. For quality assessment of the multimodal AMIE/PCP consultations and their post-questionnaire responses, we recruited 18 independent specialist physicians from India and North America over three medical specialities (dermatology, cardiology and internal medicine) to ensure diverse clinical viewpoints as well as requisite expertise. These specialists were independent of the study team and the patient-actor cohort, had a median post-residency experience of 5 years (interquartile range, 4−8) and were assigned evaluation tasks matching the medical specialty of the scenarios (for example, dermatology scenarios were evaluated by dermatologists). For each of 105 scenarios, each assigned patient-actor performed two sessions, one with a PCP and another with multimodal AMIE in a randomized order, yielding a total of 210 consultations. Each of these conversations was evaluated by three independent specialists.

fulltextpubmed· Statistical analysis· item 42135531

We analyzed the results from the remote OSCE study using a mixed-effect approach to account for differences in the scenarios using random effects. Most rating choice options for patient-actors and specialist physicians were ordinal (for example, ranging from ‘Very unfavorable’ to ‘Very favorable’), and we model them using a cumulative ordinal model with logit link function. For binary ratings (for example, ‘Yes/No’), we use a Bernoulli model with logit link. For diagnostic accuracy, we again use a Bernoulli model for ‘Correct/Incorrect’. In all models, the scenario is used as a random intercept to model differences in the quality of the patient vignettes or the difficulty of the diagnostic task. Both of these factors can affect the flow of conversation and should, therefore, be considered when modeling ratings of these conversations. The experimental arm (PCP/multimodal AMIE) is modeled as a fixed effect. For diagnostic accuracy, we also model the fixed effect of the differential size (k) using monotonic regression (Supplementary Section 2.2). Here, clinical scenarios were modeled as random intercepts to allow generalization of findings to the broader population of medical cases rather than restricting inference to the specific vignettes sampled. We explicitly note that this study is a comparative system evaluation, not a randomized controlled trial. Additionally, because clinical utility requires simultaneous success in accuracy, empathy and trust, we treated these as co-equal outcomes adjusted via false discovery rate (FDR) rather than designating a single primary endpoint.

fulltextpubmed· Online content· item 42135531

Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41591-026-04371-0.

fulltextpubmed· Supplementary information· item 42135531

Supplementary InformationSupplementary Figs. 1–21 and Supplementary Information Sections 1–6. Reporting Summary Supplementary Figs. 1–21 and Supplementary Information Sections 1–6. Reporting Summary