Browse the corpus

Walk the Even Hospital Database by book and chapter — the raw source passages that ground Ask, DDx, and the rest.

121 passages

abstractpubmed· Abstract· item 42135531

Advancing conversational diagnostic AI with multimodal reasoning. Real-world clinical practice is inherently multimodal, relying on the synthesis of patient history with visual information such as medical imagery and clinical documents. Although large language models (LLMs) have shown promise in diagnostic dialogue, their evaluation has been largely restricted to text-only interactions, failing to capture the complexity of modern remote care delivery. Here we introduce a multimodal extension of the Articulate Medical Intelligence Explorer (multimodal AMIE), capable of gathering, interpreting and reasoning about multimodal data within a diagnostic conversation. To achieve this, we developed a state-aware dialogue framework that dynamically guides history-taking based on diagnostic uncertainty and evolving patient states, emulating the structured reasoning of experienced clinicians. We evaluated this updated, state-aware version of multimodal AMIE against primary care physicians (PCPs) in a randomized, blinded exploratory study comprising 105 simulated telehealth consultations, which included dermatology photographs, electrocardiograms and clinical documents. As assessed by 18 specialist physicians, multimodal AMIE outperformed PCPs not only in diagnostic accuracy but also in conversation quality, including history-taking and empathy. Specifically, multimodal AMIE demonstrated superior performance on 29 of 32 evaluation axes, including seven of nine metrics that assess multimodal reasoning. These results validate the efficacy of state-aware reasoning in bridging the gap between text and visual information and demonstrate the potential for artificial intelligence (AI) systems to augment clinicians in complex, multimodal diagnostic settings.

fulltextpubmed· Main· item 42135531

Healthcare delivery faces considerable challenges globally from aging populations and increasing care fragmentation through to clinician burnout and misalignment between clinical and financial incentives1. This drives increased wait times for providers, delayed care and potentially worsened morbidity and mortality2. Even in high-income countries, there are many regions where access to primary care is limited and challenging3.

fulltextpubmed· Main· item 42135531

tation through to clinician burnout and misalignment between clinical and financial incentives1. This drives increased wait times for providers, delayed care and potentially worsened morbidity and mortality2. Even in high-income countries, there are many regions where access to primary care is limited and challenging3. Recently, we have witnessed the promise of generative AI systems in healthcare, especially those based on LLMs, as potential tools to augment healthcare delivery and address some of these challenges. Research systems such as the AMIE, an LLM-based conversational diagnostic AI system, have demonstrated physician-like capabilities to steer text-based clinical conversations, attaining superior or similar performance compared to PCPs across a broad array of measurements for consultation quality in studies involving patient-actors4,5. Although the evidence base for such autonomous diagnostic systems in real-world clinical settings remains limited, some conversational AI systems for triage and care navigation have shown promising results in specific applications. For instance, Zeltzer et al.6 conducted an evaluation of an AI system deployed in a real-world virtual urgent care setting for common symptoms and showed that the initial AI recommendations were generally concordant or rated by expert adjudicators as better than the diagnosis and management recommendations from the treating physicians. Furthermore, studies on other conversational systems have also shown high patient acceptability, a crucial factor for successful adoption7,8.

fulltextpubmed· Main· item 42135531

the initial AI recommendations were generally concordant or rated by expert adjudicators as better than the diagnosis and management recommendations from the treating physicians. Furthermore, studies on other conversational systems have also shown high patient acceptability, a crucial factor for successful adoption7,8. Despite the early promise of this technology, LLM-based medical AI systems have predominantly been studied and implemented as text-only chatbots. This represents a substantial deviation from how multimodal information is routinely used in care. As demonstrated during the COVID-19 pandemic9, to augment in-person care through visits at lower costs and increased patient convenience10, it is integral for remote care participants to be able to converse about and interpret multimodal medical information. This can take the form of images such as patient-provided skin photographs, electrocardiogram (ECG) tracings or laboratory reports, which can be shared via ubiquitous messaging platforms11,12 that combine text and multimodal chat or even more commonly can occur via real-time video consultations. We also note that these data (images or other attachments) provided by remote care participants are distinct from data obtained directly in the clinical environment, such as radiology imaging.

fulltextpubmed· Main· item 42135531

ssaging platforms11,12 that combine text and multimodal chat or even more commonly can occur via real-time video consultations. We also note that these data (images or other attachments) provided by remote care participants are distinct from data obtained directly in the clinical environment, such as radiology imaging. Multimodal medical tests are essential in effective care and can meaningfully guide the course of a consultation. However, evidence validating the use of LLMs for diagnostic conversations involving such multimodal data is scarce, revealing an important discrepancy between clinical needs and current technology. Patients may struggle to precisely articulate the results of an investigation using text alone, potentially omitting crucial details such as precise laboratory test values, and essential clinical information commonly resides only in non-textual formats11. For example, visual data such as photographs are paramount for remote assessment of dermatological conditions, and clinical documents, such as blood reports or specialist letters, contain objective findings that can be otherwise impractical to relay accurately via transcription to text chat. A text-only approach prevents AI from leveraging these rich, supplementary objective sources of information, hindering its ability to form a complete clinical picture.

fulltextpubmed· Main· item 42135531

as blood reports or specialist letters, contain objective findings that can be otherwise impractical to relay accurately via transcription to text chat. A text-only approach prevents AI from leveraging these rich, supplementary objective sources of information, hindering its ability to form a complete clinical picture. An overreliance on text-only input not only increases the risk of diagnostic errors but also risks exacerbating disparities in access to telehealth13, presenting barriers for individuals with lower digital literacy and language proficiency. By contrast, instant messaging apps that allow users to send text, voice and video messages (or share photo, audio and video messages) are commonplace with billions of users. Although real-time video call is more common in telemedicine, such multimedia instant messaging platforms enable considerably richer communication than text-only conversation and have seen reported use as tools for remote clinical consultations in which doctors and patients can exchange multimodal medical data (laboratory results, videos, photographs and more)12,14.

fulltextpubmed· Main· item 42135531

telemedicine, such multimedia instant messaging platforms enable considerably richer communication than text-only conversation and have seen reported use as tools for remote clinical consultations in which doctors and patients can exchange multimodal medical data (laboratory results, videos, photographs and more)12,14. The ability of LLMs to request, interpret and reason about multimodal medical data during clinical conversations has not been investigated or evaluated. To address this, we introduce multimodal AMIE—advancing the conversational diagnostic capabilities of the original system4 by integrating multimodal medical perception. We introduce a state-aware reasoning framework designed to orchestrate complex clinical conversation flows, which dynamically adapts responses based on intermediate model outputs, reflecting the evolving patient state and diagnostic uncertainty. This allows multimodal AMIE to strategically request, interpret and integrate multimodal artifacts, including skin images, ECGs and clinical documents, into the dialogue to refine diagnoses in a manner that emulates the diagnostic reasoning process of experienced clinicians in a telehealth setting. To enable rapid iteration and validation of design choices, we developed a simulation environment for generating realistic patient scenarios and conducting automated turn-by-turn dialogue assessments. Finally, to evaluate the performance of our finalized system, we conducted a randomized, blinded human evaluation that emulates an objective structured clinical examination (OSCE), a standardized assessment method used in both medical schools and physician licensing examinations worldwide. We compared multimodal AMIE to PCPs in multimodal text-based chats and evaluated their behaviors on 105 carefully designed multimodal clinical scenarios with validated patient-actors. We note, however, that our study is not a randomized clinical trial with prespecified endpoints and preregistered statistical analysis. Rather, it is an exploratory study investigating the properties of multimodal diagnostic dialogue. Furthermore, we designed a dedicated multimodal understanding and handling (MUH) rubric to rigorously compare AI and human performance in artifact interpretation over numerous dimensions. Figure 1 provides an overview of our system’s key components, our evaluation methodology and key findings.Fig.

fulltextpubmed· Main· item 42135531

diagnostic dialogue. Furthermore, we designed a dedicated multimodal understanding and handling (MUH) rubric to rigorously compare AI and human performance in artifact interpretation over numerous dimensions. Figure 1 provides an overview of our system’s key components, our evaluation methodology and key findings.Fig. 1Overview of key contributions.This figure provides a schematic overview of the key components enabling and evaluating multimodal diagnostic conversations within the AMIE system, facilitated through a multimodal chat interface. a, Multimodal state-aware reasoning. Multimodal AMIE employs a novel state-aware dialogue phase transition framework built on the publicly available Gemini 2.0 Flash model. This dynamically controls the conversation flow through history-taking, diagnosis and management phases, guided by intermediate outputs reflecting patient state and diagnostic uncertainty, allowing strategic requests and interpretation of multimodal artifacts. b, Simulation environment. A comprehensive simulation framework enables rapid development and automated evaluation. This framework involves generating realistic patient scenarios grounded in real images and metadata, using Gemini 2.0 Flash, simulating turn-by-turn multimodal dialogues between multimodal AMIE and patient agents, and using an auto-rater agent for assessment against clinical criteria. c, Randomized comparative study. The performance of multimodal AMIE was evaluated against PCPs in a randomized, double-blind, OSCE-style study. Trained patient-actors conducted synchronous chat consultations based on 105 diverse multimodal scenarios involving artifacts such as skin photographs, ECGs and clinical documents. d, Evaluation results. Specialist evaluations demonstrated that multimodal AMIE attains similar or superior performance to PCPs in handling and reasoning about multimodal data over multiple evaluation axes alongside strong performance in diagnostic accuracy and overall consultation quality. UI, user interface.

fulltextpubmed· Main· item 42135531

cuments. d, Evaluation results. Specialist evaluations demonstrated that multimodal AMIE attains similar or superior performance to PCPs in handling and reasoning about multimodal data over multiple evaluation axes alongside strong performance in diagnostic accuracy and overall consultation quality. UI, user interface. This figure provides a schematic overview of the key components enabling and evaluating multimodal diagnostic conversations within the AMIE system, facilitated through a multimodal chat interface. a, Multimodal state-aware reasoning. Multimodal AMIE employs a novel state-aware dialogue phase transition framework built on the publicly available Gemini 2.0 Flash model. This dynamically controls the conversation flow through history-taking, diagnosis and management phases, guided by intermediate outputs reflecting patient state and diagnostic uncertainty, allowing strategic requests and interpretation of multimodal artifacts. b, Simulation environment. A comprehensive simulation framework enables rapid development and automated evaluation. This framework involves generating realistic patient scenarios grounded in real images and metadata, using Gemini 2.0 Flash, simulating turn-by-turn multimodal dialogues between multimodal AMIE and patient agents, and using an auto-rater agent for assessment against clinical criteria. c, Randomized comparative study. The performance of multimodal AMIE was evaluated against PCPs in a randomized, double-blind, OSCE-style study. Trained patient-actors conducted synchronous chat consultations based on 105 diverse multimodal scenarios involving artifacts such as skin photographs, ECGs and clinical documents. d, Evaluation results. Specialist evaluations demonstrated that multimodal AMIE attains similar or superior performance to PCPs in handling and reasoning about multimodal data over multiple evaluation axes alongside strong performance in diagnostic accuracy and overall consultation quality. UI, user interface.

fulltextpubmed· Results· item 42135531

To evaluate the clinical utility of multimodal AMIE, we conducted a randomized, blinded OSCE-style study comparing the system against board-certified PCPs in multimodal text chat consultations. These findings indicate that, within the context of this randomized, blinded exploratory study, multimodal AMIE performance was similar to or higher than PCP performance across the assessed axes, as rated by specialists and patient-actors, including history-taking, diagnostic accuracy, management reasoning, communication skills, empathy and the understanding and handling of multimodal data. Subsequently, we report findings from our automated evaluation framework, using controlled ablation studies. (Here and throughout the text, the precise meaning of terms such as ‘matching’ or ‘exceeding’ that we use when comparing multimodal AMIE to PCPs, including the magnitude of difference and associated statistical analysis, is described in the corresponding results figures and Methods section.)

fulltextpubmed· Results· item 42135531

ontrolled ablation studies. (Here and throughout the text, the precise meaning of terms such as ‘matching’ or ‘exceeding’ that we use when comparing multimodal AMIE to PCPs, including the magnitude of difference and associated statistical analysis, is described in the corresponding results figures and Methods section.) In our remote OSCE-style evaluation of simulated consultations using multimodal text chat, the performance of multimodal AMIE was rated as similar or superior to that of PCPs across all assessed axes of comparison. Below, we report performance on an objective measure of diagnostic accuracy and subjective but blinded measures of patient experience and other key attributes of effective clinical consultations rated by specialist physicians. A detailed statistical breakdown of the consultation characteristics, including dialogue length and verbosity by participant type and modality, is provided in Supplementary Section 2.1.

fulltextpubmed· Results· item 42135531

ut blinded measures of patient experience and other key attributes of effective clinical consultations rated by specialist physicians. A detailed statistical breakdown of the consultation characteristics, including dialogue length and verbosity by participant type and modality, is provided in Supplementary Section 2.1. Multimodal AMIE achieved higher diagnostic accuracy compared to PCPs in the OSCE-style simulated clinical consultations that both multimodal AMIE and PCPs performed with patient-actors using multimodal text chat. As illustrated in Fig. 2a, which plots the top-k diagnostic accuracy, the differential diagnosis (DDx) of multimodal AMIE was both more accurate and more comprehensive.Fig. 2Comparison of OSCE metrics between PCPs and multimodal AMIE.a, Top-k DDx accuracy. Both multimodal AMIE and PCPs have the opportunity to submit a DDx list (with at least three and up to 10 plausible items, ordered by likelihood) for each dialogue. The top-k accuracy was computed using an auto-rater that compares each of the 10 diagnoses with the ground truth. Multimodal AMIE outperformed PCPs in diagnostic accuracy on average (P < 0.001, n = 210 conversations). b, Subgroup analysis of DDx accuracy. Top-3 accuracy was compared across subgroups within three axes as defined in the MUH rubric: (1) image quality (n = 204 conversations; 115 high quality, 89 low quality), (2) image-grounded reasoning (n = 204 conversations; 143 ‘Yes’, 61 ‘No’) and (3) hallucination of image findings in consultation (n = 204 conversations; 135 ‘No hallucination’, 56 ‘Hallucination’, 13 ‘Significant hallucination’) (see Extended Data Table 1 for details). c, Distributions of relative performance across patient scenarios. The proportion of scenarios for which PCPs were rated more favorably than multimodal AMIE and vice versa for three categories in the OSCE rubric are presented—namely, MUH rubric (right), non-multimodal specialist metrics (middle) and non-multimodal patient-centric metrics (right). Each specialist and patient-actor separately rated two dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote.

fulltextpubmed· Results· item 42135531

wo dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote. The ‘Tie’ label was assigned to cases where two or more specialists assigned the same ratings to both PCP and multimodal AMIE consultations or the cases for which the opinions on the relative performance were equally divided. Significant differences between multimodal AMIE and PCP segments, calculated using a one-sided χ2 test comparing only the proportions of preference (ignoring ties), are indicated by asterisks (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). n = 210 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs. 1 and 2 show the underlying distributions of ratings for the same set of metrics. For all panels, the error bars center on the mean and represent 95% confidence intervals estimated based on 10,000 bootstrap samples.

fulltextpubmed· Results· item 42135531

10 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs. 1 and 2 show the underlying distributions of ratings for the same set of metrics. For all panels, the error bars center on the mean and represent 95% confidence intervals estimated based on 10,000 bootstrap samples. a, Top-k DDx accuracy. Both multimodal AMIE and PCPs have the opportunity to submit a DDx list (with at least three and up to 10 plausible items, ordered by likelihood) for each dialogue. The top-k accuracy was computed using an auto-rater that compares each of the 10 diagnoses with the ground truth. Multimodal AMIE outperformed PCPs in diagnostic accuracy on average (P < 0.001, n = 210 conversations). b, Subgroup analysis of DDx accuracy. Top-3 accuracy was compared across subgroups within three axes as defined in the MUH rubric: (1) image quality (n = 204 conversations; 115 high quality, 89 low quality), (2) image-grounded reasoning (n = 204 conversations; 143 ‘Yes’, 61 ‘No’) and (3) hallucination of image findings in consultation (n = 204 conversations; 135 ‘No hallucination’, 56 ‘Hallucination’, 13 ‘Significant hallucination’) (see Extended Data Table 1 for details). c, Distributions of relative performance across patient scenarios. The proportion of scenarios for which PCPs were rated more favorably than multimodal AMIE and vice versa for three categories in the OSCE rubric are presented—namely, MUH rubric (right), non-multimodal specialist metrics (middle) and non-multimodal patient-centric metrics (right). Each specialist and patient-actor separately rated two dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote. The ‘Tie’ label was assigned to cases where two or more specialists assigned the same ratings to both PCP and multimodal AMIE consultations or the cases for which the opinions on the relative performance were equally divided. Significant differences between multimodal AMIE and PCP segments, calculated using a one-sided χ2 test comparing only the proportions of preference (ignoring ties), are indicated by asterisks (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). n = 210 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs.

fulltextpubmed· Results· item 42135531

l AMIE and PCP segments, calculated using a one-sided χ2 test comparing only the proportions of preference (ignoring ties), are indicated by asterisks (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). n = 210 conversations for all patient-centric metrics and n = 204 for all expert ratings. Extended Data Figs. 1 and 2 show the underlying distributions of ratings for the same set of metrics. For all panels, the error bars center on the mean and represent 95% confidence intervals estimated based on 10,000 bootstrap samples. Multimodal AMIE accuracy consistently outperformed PCPs across the full range of k (1−10 diagnoses). Multimodal AMIE’s leading diagnosis (top-1) and broader DDx list (top-2 to top-10) more frequently contained the ground truth condition compared to the lists generated by PCPs. This overall difference in diagnostic accuracy was statistically significant (P < 0.001, based on a mixed-effects model accounting for scenario difficulty; see ‘Statistical analysis’ section). We note that although we analyze accuracy up to top-10, neither multimodal AMIE nor PCPs always report 10 differential diagnoses.

fulltextpubmed· Results· item 42135531

is overall difference in diagnostic accuracy was statistically significant (P < 0.001, based on a mixed-effects model accounting for scenario difficulty; see ‘Statistical analysis’ section). We note that although we analyze accuracy up to top-10, neither multimodal AMIE nor PCPs always report 10 differential diagnoses. This work advances from prior comparison of multimodal AMIE in simulated consultations by incorporating the ability for patients to upload multimodal artifacts into text chat. To understand the extent to which our results specifically reflected this new capability, and to explore the extent to which the comprehension and utilization of multimodal data contributed to the superior overall diagnostic accuracy of multimodal AMIE, we performed specific subgroup analyses using specialist physicians’ blinded consultation ratings (Fig. 2b). First, regarding the impact of image quality (Fig. 2b, left), when independent specialists rated the provided artifact quality as low, the top-3 DDx accuracy decreased for both multimodal AMIE and PCPs as anticipated. However, multimodal AMIE demonstrated a significantly smaller performance drop than that of PCPs. Second, regarding the impact of artifact use in reasoning (Fig. 2b, middle), when specialists judged that the consulting agent (multimodal AMIE or PCP) appropriately used the visual artifact, diagnostic accuracy improved similarly for both. Finally, we examined consultations where specialists identified instances of the clinician or multimodal AMIE reporting findings not actually present in the artifact, which we refer to here as ‘hallucination’ (Fig. 2b, right). Such misreporting negatively impacted diagnostic accuracy for both but significantly more so for PCPs compared to multimodal AMIE.

fulltextpubmed· Results· item 42135531

ere specialists identified instances of the clinician or multimodal AMIE reporting findings not actually present in the artifact, which we refer to here as ‘hallucination’ (Fig. 2b, right). Such misreporting negatively impacted diagnostic accuracy for both but significantly more so for PCPs compared to multimodal AMIE. We also analyzed performance variation across the different types of multimodal data used in our OSCE scenarios: skin photographs (Skin Condition Image Network (SCIN)), ECG tracings (PTB-XL) and clinical documents (Fig. 3a). Here, we used a monotonic regression model on k (number of diagnoses in the differential used to compute accuracy), with additional effects estimated for experimental arm and modalities. Both multimodal AMIE and PCPs were more accurate for scenarios based on clinical documents (P < 0.001). Multimodal AMIE retained statistically significantly higher accuracy across all modalities. For the full statistical analysis details, see Supplementary Section 2.2.Fig. 3Variation in performance across modality types.a,The average top-k DDx accuracy is shown as a function of the size of the differentials (k) for three artifact types. Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. b, The panel shows the distributions of specialist physicians’ ratings (DDx appropriateness and management plan appropriateness) and patient-actors’ ratings (‘patient happy to return in future’); this analysis attempts to capture the overall user satisfaction of the consultation. The distributions of ratings are displayed separately for the sessions with PCPs and multimodal AMIE and for modality types. c, The distributions of scores for three MUH metrics are displayed. For illustration purposes, all responses from five-point rating scales were mapped to a generic five-point scale ranging from ‘Very favorable’ to ‘Very unfavorable’. For Yes/No questions as indicated by ‘(Y/N)’, a (positive) ‘Yes’ response was mapped to the same color as ‘Favorable’ and a (negative) ‘No’ response to the same color as ‘Unfavorable’. The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant).

fulltextpubmed· Results· item 42135531

The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). a,The average top-k DDx accuracy is shown as a function of the size of the differentials (k) for three artifact types. Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. b, The panel shows the distributions of specialist physicians’ ratings (DDx appropriateness and management plan appropriateness) and patient-actors’ ratings (‘patient happy to return in future’); this analysis attempts to capture the overall user satisfaction of the consultation. The distributions of ratings are displayed separately for the sessions with PCPs and multimodal AMIE and for modality types. c, The distributions of scores for three MUH metrics are displayed. For illustration purposes, all responses from five-point rating scales were mapped to a generic five-point scale ranging from ‘Very favorable’ to ‘Very unfavorable’. For Yes/No questions as indicated by ‘(Y/N)’, a (positive) ‘Yes’ response was mapped to the same color as ‘Favorable’ and a (negative) ‘No’ response to the same color as ‘Unfavorable’. The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant).

fulltextpubmed· Results· item 42135531

The ‘NA’ rating indicates responses that are ‘not applicable’, ‘cannot rate’ or ‘does not apply’. The McNemar test (two-sided) was used for each evaluation axis, and P values were adjusted using the FDR correction. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). We compared multimodal AMIE and PCPs on the patient-centric quality metrics that were collected upon conclusion of the consultations. Patient-actors rated the interactions with multimodal AMIE as similar to or higher than those with board-certified PCPs across the assessed dimensions in Fig. 2c (right) (see Extended Data Fig. 1 for full ratings). This included aspects such as being polite, listening, explaining conditions, involving patients in decisions, appearing honest/trustworthy, building rapport, showing empathy and managing patient concerns (General Medical Council Patient Questionnaire (GMCPQ) and Practical Assessment of Clinical Examination Skills (PACES) criteria, P < 0.01 for all axes). We also asked patient-actors two questions specifically about the nature of multimodal interaction (Extended Data Table 1). The two questions, rated from 1 to 5, asked how well the doctor addressed the patients’ questions about image artifacts and how well the doctor explained the findings from the artifacts. Figure 2c (left) shows that multimodal AMIE was rated more highly than clinicians on both questions (P < 0.01).

fulltextpubmed· Results· item 42135531

tended Data Table 1). The two questions, rated from 1 to 5, asked how well the doctor addressed the patients’ questions about image artifacts and how well the doctor explained the findings from the artifacts. Figure 2c (left) shows that multimodal AMIE was rated more highly than clinicians on both questions (P < 0.01). A group of 18 specialists in dermatology, cardiology and general practice (internal medicine) evaluated multimodal AMIE and PCPs in terms of numerous attributes of effective diagnostic conversations, including multimodal reasoning, history-taking, diagnostic accuracy, management reasoning, communication skills and empathy. Aggregated results are presented in Fig. 2c (see Extended Data Fig. 2 for full ratings). To account for potential interdisciplinary variations, we also present results by specialty (Fig. 3). The conversations held by multimodal AMIE were consistently rated more highly by specialists overall and across three disciplines. This is reflected in all items in our assessment, across diagnosis and management and quality of history-taking. The positive assessments of the diagnostic dialogues that multimodal AMIE conducted were also supported by results in the MUH component of our assessment: specialists assigned higher ratings on average to multimodal AMIE’s interpretation of the multimodal artifacts, its reasoning about these artifacts and the way it handled patient questions and concerns about the multimodal artifacts.

fulltextpubmed· Results· item 42135531

AMIE conducted were also supported by results in the MUH component of our assessment: specialists assigned higher ratings on average to multimodal AMIE’s interpretation of the multimodal artifacts, its reasoning about these artifacts and the way it handled patient questions and concerns about the multimodal artifacts. This result is also reflected in all three disciplines studied. Multimodal AMIE provided more appropriate diagnoses and management plans, and patients were more happy to return for a second visit with multimodal AMIE in our three fields (Fig. 3b; all P < 0.001). Similarly, in all three disciplines, multimodal AMIE performed more accurate interpretation of artifacts and reasoning about those artifacts, with no increase in hallucination or mis-reporting (Fig. 3c; all P < 0.001, P < 0.05 and P > 0.05, respectively). This higher performance across all disciplines was broadly mirrored in the top-k accuracy (Fig. 3a), although the superiority was statistically significant only for clinical documents, potentially due to the smaller sample size. In Supplementary Section 4.2, we provide qualitative examples from the three disciplines, of dialogues between multimodal AMIE and a patient-actor and between a PCP and a patient-actor where multimodal AMIE has correctly identified the most probable diagnosis. For each discipline, we use the same patient scenario for the dialogue with multimodal AMIE and with the PCP, to facilitate a qualitative comparison.

fulltextpubmed· Results· item 42135531

of dialogues between multimodal AMIE and a patient-actor and between a PCP and a patient-actor where multimodal AMIE has correctly identified the most probable diagnosis. For each discipline, we use the same patient scenario for the dialogue with multimodal AMIE and with the PCP, to facilitate a qualitative comparison. Beyond the OSCE study, this section details automated evaluations of multimodal AMIE using simulated dialogues and auto-rating (detailed in the ‘Auto-rating simulated diagnostic conversations’ section and Fig. 4) to quantify diagnostic performance across different system configurations.Fig. 4Overview of multimodal dialogue simulation and the evaluation framework.This figure illustrates the three key components of the system. Step 1 (patient scenario generation): Comprehensive patient profiles are created, including condition, demographics, symptoms and medical history, using information derived from web searches and real-world datasets—for example, PTB-XL (for cardiology) and SCIN (for dermatology). These profiles are then used to generate detailed patient scenarios, outlining the patient’s presentation and expectations. Step 2 (dialogue simulation): A doctor agent and a patient agent engage in a text-based consultation with multimodal artifact upload. The doctor agent is instructed to provide empathetic and clinically accurate responses while the patient agent responds truthfully based on the generated scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management.

fulltextpubmed· Results· item 42135531

scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management. This figure illustrates the three key components of the system. Step 1 (patient scenario generation): Comprehensive patient profiles are created, including condition, demographics, symptoms and medical history, using information derived from web searches and real-world datasets—for example, PTB-XL (for cardiology) and SCIN (for dermatology). These profiles are then used to generate detailed patient scenarios, outlining the patient’s presentation and expectations. Step 2 (dialogue simulation): A doctor agent and a patient agent engage in a text-based consultation with multimodal artifact upload. The doctor agent is instructed to provide empathetic and clinically accurate responses while the patient agent responds truthfully based on the generated scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management.

fulltextpubmed· Results· item 42135531

scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management. To further understand the contributions of different system components to overall performance, we conducted several ablation studies using our automated evaluation framework. This section details experiments comparing the full multimodal AMIE system against simpler baselines and evaluating the specific impact of test-time reasoning, history-taking and robustness to variations in patient presentation on conversation quality, as assessed by the auto-rater. One of the key elements of our system is the state-aware dialogue phase transition framework (detailed in the ‘Multimodal state-aware reasoning’ section). To assess its contribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach.

fulltextpubmed· Results· item 42135531

tribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach. Using our simulation environment and auto-rater, we evaluated conversations generated for synthetic scenarios derived from PAD-UFES-20, SCIN, PTB-XL and ClinicalDoc-QA datasets. We compared two conditions: (1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components.

fulltextpubmed· Results· item 42135531

1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components. Across all datasets (Fig. 5a, Extended Data Table 2 and Supplementary Section 3.4) (PAD-UFES-20, SCIN, Clinical Documents and PTB-XL), multimodal AMIE with reasoning consistently outperformed the vanilla baseline on key diagnostic metrics. Specifically, Extended Data Table 2 (also see Supplementary Section 3.4) shows that multimodal AMIE with reasoning achieved higher DDx accuracy. Notable top-1 accuracy gains occurred on Clinical Documents (0.89 to 0.98), PAD-UFES-20 (0.75 to 0.84) and PTB-XL (0.20 to 0.28). Reasoning also improved information-gathering scores (for example, Clinical Documents: 4.00 to 4.17). Hallucination rates remained negligible in both settings although was sometimes higher for state-aware reasoning. Management plan appropriateness also improved (for example, SCIN: 3.69 to 3.89).Fig. 5Auto-rater evaluation to quantify the value of state-aware reasoning and dialogue-based interaction.a, State-aware reasoning ablation compares the performance of multimodal AMIE with its state-aware reasoning against a baseline without it, using the dialogue simulation environment across four datasets: skin photographs (SCIN, n = 885), skin photographs (PAD-UFES-20, n = 1,290), ECG traces (PTB-XL, n = 2,724) and Clinical Documents (created by our clinical providers, n = 138). The metrics include top-k DDx accuracy, managment plan appropriateness (scaled), quality of information gathering (scaled) and non-hallucination rate (the proportion of cases without any hallucination). b, Dialogue ablation aims to quantify the value of dialogue-based interaction by comparing the top-k DDx accuracy of the baseline LLM that infers the differentials directly from the image artifacts (denoted by ‘Images-only’) against that of the multimodal AMIE system, which leverages the combination of provided images and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric.

fulltextpubmed· Results· item 42135531

ages and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan. a, State-aware reasoning ablation compares the performance of multimodal AMIE with its state-aware reasoning against a baseline without it, using the dialogue simulation environment across four datasets: skin photographs (SCIN, n = 885), skin photographs (PAD-UFES-20, n = 1,290), ECG traces (PTB-XL, n = 2,724) and Clinical Documents (created by our clinical providers, n = 138). The metrics include top-k DDx accuracy, managment plan appropriateness (scaled), quality of information gathering (scaled) and non-hallucination rate (the proportion of cases without any hallucination). b, Dialogue ablation aims to quantify the value of dialogue-based interaction by comparing the top-k DDx accuracy of the baseline LLM that infers the differentials directly from the image artifacts (denoted by ‘Images-only’) against that of the multimodal AMIE system, which leverages the combination of provided images and conversational history (‘Images + Dialogue (with multimodal AMIE)’). Error bars center on the mean and represent 95% confidence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan.

fulltextpubmed· Results· item 42135531

dence intervals derived from 104 bootstrap samples. Significance markers denote P values obtained from a two-sided Mann−Whitney U-test comparing the two conditions within each panel/metric. Asterisks represent statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001; NS, not significant). Mx plan, management plan. To understand the additional performance gained with history-taking versus analyzing multimodal artifacts alone, we conducted an ablation comparing two settings: (1) ‘Images-only’, where the base model diagnosed from images without dialogue, omitting history-taking, and (2) ‘Images + Dialogue’, where multimodal AMIE used images, full dialogues and its state-aware reasoning at inference. Results are shown in Fig. 5b. The ‘Images-only’ setting yielded markedly lower accuracy; for instance, top-1 accuracy on PAD-UFES-20 dropped significantly without dialogue. Conversely, the ‘Images + Dialogue’ setting consistently improved performance across all datasets (SCIN, PAD-UFES-20, PTB-XL and Clinical Documents).

fulltextpubmed· Results· item 42135531

Results are shown in Fig. 5b. The ‘Images-only’ setting yielded markedly lower accuracy; for instance, top-1 accuracy on PAD-UFES-20 dropped significantly without dialogue. Conversely, the ‘Images + Dialogue’ setting consistently improved performance across all datasets (SCIN, PAD-UFES-20, PTB-XL and Clinical Documents). To evaluate system robustness against variations in patient presentation, we used LLM-driven augmentation to introduce variations to the original patient scenarios across three axes: personality styles (affecting interaction dynamics), demographics (testing fairness) and semantic changes to background/symptoms (evaluating resilience to minor history variations). The results, detailed in Extended Data Fig. 3 (more details in Supplementary Section 3.5), demonstrate that the performance of multimodal AMIE across key auto-rater metrics—including diagnostic accuracy, information gathering, hallucination rate and management plan appropriateness—remained highly consistent between the original and augmented scenarios.

fulltextpubmed· Results· item 42135531

Data Fig. 3 (more details in Supplementary Section 3.5), demonstrate that the performance of multimodal AMIE across key auto-rater metrics—including diagnostic accuracy, information gathering, hallucination rate and management plan appropriateness—remained highly consistent between the original and augmented scenarios. In addition to inference-time strategies, we also explored the potential of training-time adaptations through domain-specific supervised fine-tuning (SFT) to enhance model performance. We conducted experiments comparing our base model to a version fine-tuned on specialized medical dialogue and Q&A datasets. Although these experiments revealed that SFT could yield statistically significant improvements in diagnostic accuracy for highly specialized tasks such as ECG analysis, this came at the cost of a notable decrease in performance on other crucial aspects of the consultation, including management plan appropriateness (Extended Data Fig. 4). Additional details on these critical tradeoffs are presented in Supplementary Section 3.6.

fulltextpubmed· Comparison of multimodal AMIE and PCPs on OSCE assessment· item 42135531

In our remote OSCE-style evaluation of simulated consultations using multimodal text chat, the performance of multimodal AMIE was rated as similar or superior to that of PCPs across all assessed axes of comparison. Below, we report performance on an objective measure of diagnostic accuracy and subjective but blinded measures of patient experience and other key attributes of effective clinical consultations rated by specialist physicians. A detailed statistical breakdown of the consultation characteristics, including dialogue length and verbosity by participant type and modality, is provided in Supplementary Section 2.1.

fulltextpubmed· Diagnostic accuracy of multimodal AMIE and PCPs· item 42135531

Multimodal AMIE achieved higher diagnostic accuracy compared to PCPs in the OSCE-style simulated clinical consultations that both multimodal AMIE and PCPs performed with patient-actors using multimodal text chat. As illustrated in Fig. 2a, which plots the top-k diagnostic accuracy, the differential diagnosis (DDx) of multimodal AMIE was both more accurate and more comprehensive.Fig. 2Comparison of OSCE metrics between PCPs and multimodal AMIE.a, Top-k DDx accuracy. Both multimodal AMIE and PCPs have the opportunity to submit a DDx list (with at least three and up to 10 plausible items, ordered by likelihood) for each dialogue. The top-k accuracy was computed using an auto-rater that compares each of the 10 diagnoses with the ground truth. Multimodal AMIE outperformed PCPs in diagnostic accuracy on average (P < 0.001, n = 210 conversations). b, Subgroup analysis of DDx accuracy. Top-3 accuracy was compared across subgroups within three axes as defined in the MUH rubric: (1) image quality (n = 204 conversations; 115 high quality, 89 low quality), (2) image-grounded reasoning (n = 204 conversations; 143 ‘Yes’, 61 ‘No’) and (3) hallucination of image findings in consultation (n = 204 conversations; 135 ‘No hallucination’, 56 ‘Hallucination’, 13 ‘Significant hallucination’) (see Extended Data Table 1 for details). c, Distributions of relative performance across patient scenarios. The proportion of scenarios for which PCPs were rated more favorably than multimodal AMIE and vice versa for three categories in the OSCE rubric are presented—namely, MUH rubric (right), non-multimodal specialist metrics (middle) and non-multimodal patient-centric metrics (right). Each specialist and patient-actor separately rated two dialogues (one with PCP and the other with multimodal AMIE) for the same scenario, and their relative ranking was derived from their scores. For the specialist metrics, each scenario was evaluated by three different expert physicians, yielding three pairwise rankings that were aggregated by taking the majority vote.

fulltextpubmed· Conversation quality rated by patient-actors· item 42135531

We compared multimodal AMIE and PCPs on the patient-centric quality metrics that were collected upon conclusion of the consultations. Patient-actors rated the interactions with multimodal AMIE as similar to or higher than those with board-certified PCPs across the assessed dimensions in Fig. 2c (right) (see Extended Data Fig. 1 for full ratings). This included aspects such as being polite, listening, explaining conditions, involving patients in decisions, appearing honest/trustworthy, building rapport, showing empathy and managing patient concerns (General Medical Council Patient Questionnaire (GMCPQ) and Practical Assessment of Clinical Examination Skills (PACES) criteria, P < 0.01 for all axes). We also asked patient-actors two questions specifically about the nature of multimodal interaction (Extended Data Table 1). The two questions, rated from 1 to 5, asked how well the doctor addressed the patients’ questions about image artifacts and how well the doctor explained the findings from the artifacts. Figure 2c (left) shows that multimodal AMIE was rated more highly than clinicians on both questions (P < 0.01).

fulltextpubmed· Specialist evaluation of the performance and robustness of multimodal AMIE· item 42135531

A group of 18 specialists in dermatology, cardiology and general practice (internal medicine) evaluated multimodal AMIE and PCPs in terms of numerous attributes of effective diagnostic conversations, including multimodal reasoning, history-taking, diagnostic accuracy, management reasoning, communication skills and empathy. Aggregated results are presented in Fig. 2c (see Extended Data Fig. 2 for full ratings). To account for potential interdisciplinary variations, we also present results by specialty (Fig. 3). The conversations held by multimodal AMIE were consistently rated more highly by specialists overall and across three disciplines. This is reflected in all items in our assessment, across diagnosis and management and quality of history-taking. The positive assessments of the diagnostic dialogues that multimodal AMIE conducted were also supported by results in the MUH component of our assessment: specialists assigned higher ratings on average to multimodal AMIE’s interpretation of the multimodal artifacts, its reasoning about these artifacts and the way it handled patient questions and concerns about the multimodal artifacts.

fulltextpubmed· LLM-as-a-judge for automated evaluations· item 42135531

Beyond the OSCE study, this section details automated evaluations of multimodal AMIE using simulated dialogues and auto-rating (detailed in the ‘Auto-rating simulated diagnostic conversations’ section and Fig. 4) to quantify diagnostic performance across different system configurations.Fig. 4Overview of multimodal dialogue simulation and the evaluation framework.This figure illustrates the three key components of the system. Step 1 (patient scenario generation): Comprehensive patient profiles are created, including condition, demographics, symptoms and medical history, using information derived from web searches and real-world datasets—for example, PTB-XL (for cardiology) and SCIN (for dermatology). These profiles are then used to generate detailed patient scenarios, outlining the patient’s presentation and expectations. Step 2 (dialogue simulation): A doctor agent and a patient agent engage in a text-based consultation with multimodal artifact upload. The doctor agent is instructed to provide empathetic and clinically accurate responses while the patient agent responds truthfully based on the generated scenario. Step 3 (auto-rating): An auto-rater agent evaluates the simulated dialogue based on predefined criteria, including management appropriateness, information-gathering effectiveness and the presence of hallucinations. Qualitative feedback is also provided by the auto-rater to explain the scores. Mx, management.

fulltextpubmed· Ablation studies on state-aware reasoning· item 42135531

To further understand the contributions of different system components to overall performance, we conducted several ablation studies using our automated evaluation framework. This section details experiments comparing the full multimodal AMIE system against simpler baselines and evaluating the specific impact of test-time reasoning, history-taking and robustness to variations in patient presentation on conversation quality, as assessed by the auto-rater. One of the key elements of our system is the state-aware dialogue phase transition framework (detailed in the ‘Multimodal state-aware reasoning’ section). To assess its contribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach. Using our simulation environment and auto-rater, we evaluated conversations generated for synthetic scenarios derived from PAD-UFES-20, SCIN, PTB-XL and ClinicalDoc-QA datasets. We compared two conditions: (1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components.

fulltextpubmed· State-aware reasoning improves dialogue quality· item 42135531

One of the key elements of our system is the state-aware dialogue phase transition framework (detailed in the ‘Multimodal state-aware reasoning’ section). To assess its contribution, we compared its diagnostic performance and dialogue quality against a baseline model lacking this explicit reasoning structure. We hypothesized that the dynamic phase transitions and uncertainty-driven questioning employed by our framework would lead to more accurate diagnoses compared to a simpler approach. Using our simulation environment and auto-rater, we evaluated conversations generated for synthetic scenarios derived from PAD-UFES-20, SCIN, PTB-XL and ClinicalDoc-QA datasets. We compared two conditions: (1) multimodal AMIE with the full test-time reasoning framework and (2) a ‘vanilla’ baseline using the same Gemini 2.0 Flash model but without the explicit state-aware reasoning framework, relying only on domain-specific instructions and lacking the dynamic phase transitions and uncertainty-based questioning components.

fulltextpubmed· History-taking improves diagnostic accuracy· item 42135531

To understand the additional performance gained with history-taking versus analyzing multimodal artifacts alone, we conducted an ablation comparing two settings: (1) ‘Images-only’, where the base model diagnosed from images without dialogue, omitting history-taking, and (2) ‘Images + Dialogue’, where multimodal AMIE used images, full dialogues and its state-aware reasoning at inference. Results are shown in Fig. 5b. The ‘Images-only’ setting yielded markedly lower accuracy; for instance, top-1 accuracy on PAD-UFES-20 dropped significantly without dialogue. Conversely, the ‘Images + Dialogue’ setting consistently improved performance across all datasets (SCIN, PAD-UFES-20, PTB-XL and Clinical Documents).

fulltextpubmed· Robust diagnostic conversation with patient scenario augmentations· item 42135531

To evaluate system robustness against variations in patient presentation, we used LLM-driven augmentation to introduce variations to the original patient scenarios across three axes: personality styles (affecting interaction dynamics), demographics (testing fairness) and semantic changes to background/symptoms (evaluating resilience to minor history variations). The results, detailed in Extended Data Fig. 3 (more details in Supplementary Section 3.5), demonstrate that the performance of multimodal AMIE across key auto-rater metrics—including diagnostic accuracy, information gathering, hallucination rate and management plan appropriateness—remained highly consistent between the original and augmented scenarios.

fulltextpubmed· Tradeoffs of domain-specific supervised fine-tuning for base model· item 42135531

In addition to inference-time strategies, we also explored the potential of training-time adaptations through domain-specific supervised fine-tuning (SFT) to enhance model performance. We conducted experiments comparing our base model to a version fine-tuned on specialized medical dialogue and Q&A datasets. Although these experiments revealed that SFT could yield statistically significant improvements in diagnostic accuracy for highly specialized tasks such as ECG analysis, this came at the cost of a notable decrease in performance on other crucial aspects of the consultation, including management plan appropriateness (Extended Data Fig. 4). Additional details on these critical tradeoffs are presented in Supplementary Section 3.6.

fulltextpubmed· Discussion· item 42135531

In this work, as illustrated in Fig. 1, we sought to address a key unmet need for medical AI systems to be able to understand, reason about and appropriately use information from multimodal medical data in the course of diagnostic conversations with patients. To contextualize these capabilities, it is important to distinguish domain-specific systems such as multimodal AMIE from general-purpose multiagent frameworks such as AutoGen15 and Magentic-One16. Although those offer broad flexibility, AMIE is explicitly engineered for the safety-critical reasoning required in medical dialogue. This specialization addresses the field’s shift from static benchmarks toward dynamic, process-oriented evaluations of the complete patient journey (for example, MedR-Bench17 and Agent Hospital18). By combining artifact perception with dynamic reasoning in standard chat interfaces—already ubiquitous in global remote care19—AMIE bridges a critical gap toward real-world clinical deployment. Integrating multimodal perception into clinical dialogue, our study demonstrates that adding these reasoning capabilities into multimodal AMIE maintained the high standards of diagnostic accuracy and conversational quality previously established in text-only settings4. Across the multimodal OSCE evaluations, multimodal AMIE achieved diagnostic accuracy and core conversational metrics—such as empathy, clarity and history-taking—that were higher than those of PCPs, including in scenarios requiring interpretation of images or documents.

fulltextpubmed· Discussion· item 42135531

ity previously established in text-only settings4. Across the multimodal OSCE evaluations, multimodal AMIE achieved diagnostic accuracy and core conversational metrics—such as empathy, clarity and history-taking—that were higher than those of PCPs, including in scenarios requiring interpretation of images or documents. Although foundation models such as Gemini 2.0 provide multimodal capabilities (see ‘Validation of base model perception of medical artifacts’ section), a key challenge is effectively weaving artifact interpretation into the flow of clinical dialogue without degrading the interaction. Multimodal AMIE achieved this via our state-aware reasoning framework (Fig. 6 and ‘Multimodal state-aware reasoning’ section), enabling the system to handle multimodal data mid-interaction while preserving clinically sound, empathetic dialogue. This capability is essential for real-world telehealth where multimodal data exchange is routine13,20. The modalities tested reflect the expanding breadth of data present in telehealth; for instance, clinical photographs and reports are readily accessible to consumers. Additionally, ECGs are common in primary care21, but they are increasingly available on consumer-grade wearable devices, which enable recording of both single-lead and, with guidance, multi-lead configurations22.Fig. 6Multimodal state-aware reasoning.Illustrated here is multimodal AMIE’s state-aware dialogue phase transition framework, which structures the diagnostic conversation. The system progresses through three distinct phases, each with a specific goal: (1) History-taking (gathering comprehensive patient information), (2) (Differential) Diagnosis & management (formulating and presenting DDx and management plan) and (3) Answer follow-up questions (addressing remaining concerns). Within each phase, multimodal AMIE maintains an internal state—its dynamic understanding of the patient’s situation, evolving diagnoses (DDx) and knowledge gaps—derived from the dialogue history and any multimodal inputs. This state guides specific actions, such as asking targeted questions, requesting multimodal data (if needed), generating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation.

fulltextpubmed· Discussion· item 42135531

rating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians. Illustrated here is multimodal AMIE’s state-aware dialogue phase transition framework, which structures the diagnostic conversation. The system progresses through three distinct phases, each with a specific goal: (1) History-taking (gathering comprehensive patient information), (2) (Differential) Diagnosis & management (formulating and presenting DDx and management plan) and (3) Answer follow-up questions (addressing remaining concerns). Within each phase, multimodal AMIE maintains an internal state—its dynamic understanding of the patient’s situation, evolving diagnoses (DDx) and knowledge gaps—derived from the dialogue history and any multimodal inputs. This state guides specific actions, such as asking targeted questions, requesting multimodal data (if needed), generating internal summaries and differential diagnoses or providing explanations. Transitions between phases are triggered automatically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians.

fulltextpubmed· Discussion· item 42135531

ically when the system assesses that the objectives of the current phase (for example, sufficient information gathered and DDx presented) have been met, based on its internal state evaluation. This mechanism enforces a structured yet flexible dialogue flow, inspired by the methodical approach of experienced clinicians. Notably, multimodal AMIE demonstrated proficiency in reasoning about patient-provided multimodal artifacts, as indicated by specialist ratings. Assessed blinded using our MUH rubric, specialist physicians gave multimodal AMIE higher scores compared to PCPs on axes encompassing artifact engagement, interpretation and artifact-grounded reasoning.

fulltextpubmed· Discussion· item 42135531

emonstrated proficiency in reasoning about patient-provided multimodal artifacts, as indicated by specialist ratings. Assessed blinded using our MUH rubric, specialist physicians gave multimodal AMIE higher scores compared to PCPs on axes encompassing artifact engagement, interpretation and artifact-grounded reasoning. Moreover, patient-actors rated multimodal AMIE significantly higher than PCPs in addressing questions regarding artifacts and explaining findings derived from visual evidence. Although caution is warranted when generalizing to broader remote care—given that clinician performance might be constrained by the text chat modality versus video—multimodal AMIE was rated higher in these communicative tasks. This suggests that patients valued multimodal AMIE’s systematic approach to verbalizing visual findings, potentially more than the implicit assessments offered by clinicians. In one such case, after the patient sends images, multimodal AMIE summarizes the images and relates findings to the conversation history. By contrast, the PCP sometimes did not directly address the shared artifacts (see Supplementary Information for dialogue examples). Explicitly addressing images, as multimodal AMIE did, may provide reassurance that the images were transmitted correctly and that the doctor reached a conclusion based on specific evidence. We regard this explicit acknowledgment as vital for multimodal interaction, as patients expect explanations referring to the visual evidence provided.

fulltextpubmed· Discussion· item 42135531

g images, as multimodal AMIE did, may provide reassurance that the images were transmitted correctly and that the doctor reached a conclusion based on specific evidence. We regard this explicit acknowledgment as vital for multimodal interaction, as patients expect explanations referring to the visual evidence provided. The scenarios were designed to ensure that successful diagnosis required appropriate use of multimodal information. Subgroup analyses supported the validity of this design. Both multimodal AMIE and PCP diagnostic accuracy dropped in cases where image quality was deemed inferior; if performance remained high irrespective of image quality, it would suggest that the consultation did not depend on the visual upload. Instead, the parallel drop in performance confirms that multimodal inputs were critical to the diagnostic process, validating these scenarios as effective tests of multimodal reasoning. Further support was seen in how diagnostic accuracy improved for both multimodal AMIE and PCPs when they were rated as having appropriately used visual information. These findings lend credence to multimodal AMIE’s potential robustness to lower-quality images and ability to overcome intermediate misinterpretations.

fulltextpubmed· Discussion· item 42135531

ng. Further support was seen in how diagnostic accuracy improved for both multimodal AMIE and PCPs when they were rated as having appropriately used visual information. These findings lend credence to multimodal AMIE’s potential robustness to lower-quality images and ability to overcome intermediate misinterpretations. Ensuring robustness, reliability and equity is essential for responsible diagnostic AI, although achieving this requires extensive further study. Our study includes extensive dialogue-level error analysis through our rubrics (for example, artifact-grounded reasoning, hallucination checks and DDx appropriateness), but we acknowledge that a large-scale, fine-grained analysis of the model’s internal reasoning states is a limitation of our present work. Our evaluations also undertook initial steps to explore robustness. Automated evaluations using LLM-driven augmentations showed that the core metrics of multimodal AMIE remained largely stable with respect to variations in simulated patient personality, demographics and minor semantic changes. Specialist evaluations further examined factors influencing accuracy. Notably, multimodal AMIE demonstrated greater robustness in diagnostic accuracy against specialist-rated image quality than PCPs. Although these preliminary findings regarding stability are encouraging, further research is necessary to ensure reliable performance across diverse populations. Finally, we acknowledge that although the clinical documents and in-house generated ECG trace images were private, the skin images and underlying raw ECG signals were publicly available and may have been encountered during pretraining. We sought to mitigate this risk by integrating these artifacts into unique, synthetic patient scenarios; notably, our perception test results suggest that the influence of data leakage on performance is limited. Nevertheless, validating on fully private or prospectively collected datasets is one of the key directions for future research.

fulltextpubmed· Discussion· item 42135531

te this risk by integrating these artifacts into unique, synthetic patient scenarios; notably, our perception test results suggest that the influence of data leakage on performance is limited. Nevertheless, validating on fully private or prospectively collected datasets is one of the key directions for future research. Considering the role of model specialization, adapting foundation models like Gemini23 for medicine involves strategies ranging from training-time adaptations such as SFT24 to inference-time approaches25. Training-time specialization can boost performance on specific tasks (for example, ECG analysis) that are underrepresented in pretraining26. However, this requires substantial resources and carries the risk of catastrophic forgetting, potentially degrading general conversational competence. To investigate this, we compared fine-tuned Gemini 2.0 Flash with the original model. Although experiments displayed potential gains on specific tasks, they highlighted performance degradation on other aspects, such as management plan appropriateness.

fulltextpubmed· Discussion· item 42135531

forgetting, potentially degrading general conversational competence. To investigate this, we compared fine-tuned Gemini 2.0 Flash with the original model. Although experiments displayed potential gains on specific tasks, they highlighted performance degradation on other aspects, such as management plan appropriateness. Consequently, we focused on leveraging a strong, general-purpose base model enhanced with domain-specific inference-time strategy. We selected Gemini 2.0 Flash for its higher out-of-the-box performance in multimodal understanding and conversational fluency compared to other variants. This strong baseline provided a robust foundation for the state-aware reasoning framework (Fig. 6). Although training-time specialization remains an area for future investigation, our findings highlight the effectiveness of inference-time enhancements, a strategy also proven successful for management reasoning in related work5. This state-based approach builds on human-centered research showing that user−clinician conversations unfold in phases8. This aligns with clinician feedback preferring thorough information gathering to avoid premature assessments. However, a limitation of this structured approach is potential rigidity when critical information (for example, allergy and contraindication) emerges late in the dialogue, suggesting a need for more fluid state transitions.

fulltextpubmed· Discussion· item 42135531

ith clinician feedback preferring thorough information gathering to avoid premature assessments. However, a limitation of this structured approach is potential rigidity when critical information (for example, allergy and contraindication) emerges late in the dialogue, suggesting a need for more fluid state transitions. Regarding the limitations of chat-based interactions, our evaluation used a synchronous chat interface, mirroring ubiquitous instant messaging often used for medical conversation12,19. Although accessible, this modality presents limitations compared to video or in-person visits. Chat restricts non-verbal cues, limits dynamic visual assessments and precludes physical examinations, all of which provide crucial diagnostic information. Furthermore, video calls may foster stronger rapport and trust27,28. Consequently, the scope of conditions and depth of assessment in this format are constrained. Additionally, the potential for unblinding due to systematic stylistic differences between multimodal AMIE and PCPs poses a challenge for comparative studies. These limitations must be considered when interpreting findings.

fulltextpubmed· Discussion· item 42135531

8. Consequently, the scope of conditions and depth of assessment in this format are constrained. Additionally, the potential for unblinding due to systematic stylistic differences between multimodal AMIE and PCPs poses a challenge for comparative studies. These limitations must be considered when interpreting findings. Highlighting the importance of real-world validation and future directions, although our findings demonstrate the potential of multimodal AMIE in simulated consultations, considerable research is essential before clinical translation. Our study is not a randomized clinical trial but, rather, represents an exploratory investigation into the properties of multimodal diagnostic dialogue. Before real-world clinical deployment, it will be important in future work to conduct a full randomized clinical trial, conforming with full CONSORT criteria, including prespecified endpoints and preregistered statistical analysis conducted by an independent contract research organization. Future studies must rigorously evaluate the performance, safety and reliability of multimodal AMIE under the complexities of actual healthcare delivery, including its impact on workflows and health equity. Furthermore, we acknowledge that comprehensive clinical care often relies on multimodal clinical data beyond those currently covered in this work (for example, ECGs, skin images and scanned clinical documents). Extending this framework to support additional modalities, such as radiology or pathology imaging, is a valuable direction for future research to broaden the system’s diagnostic scope. We emphasize that multimodal AMIE remains an evolving research system.

fulltextpubmed· Discussion· item 42135531

(for example, ECGs, skin images and scanned clinical documents). Extending this framework to support additional modalities, such as radiology or pathology imaging, is a valuable direction for future research to broaden the system’s diagnostic scope. We emphasize that multimodal AMIE remains an evolving research system. In conclusion, this work advances AMIE to dynamically integrate multimodal reasoning within diagnostic conversations. Achieved through a novel state-aware inference-time reasoning technique using Gemini 2.0 Flash, the performance of multimodal AMIE within the OSCE study matched or surpassed PCPs across diagnostic accuracy and consultation quality as rated by specialists and patient-actors. Although further research is essential before real-world translation, these findings demonstrate progress toward AI systems that are capable of comprehensive clinical interactions.