Browse the corpus
Walk the Even Hospital Database by book and chapter — the raw source passages that ground Ask, DDx, and the rest.
1 passage
CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications. BACKGROUND AND OBJECTIVES: General purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. METHODS: We compiled 23 984 articles from Neurosurgery Publications journals, yielding 78 853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263 064 training samples across 3 formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter Large Language and Visual Assistant-Next model. In a blinded, randomized deployment trial at NYU Langone Health (August 30-November 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or a Health Insurance Portability and Accountability Act-compliant GPT-4o end point as a diagnostic copilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy, assessed through user ratings and presence of the correct diagnosis within the VLM-provided differential, respectively. RESULTS: CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, P = .235), but only achieved 46.81% accuracy on human-generated questions vs GPT-4o's 65.70% (P < 10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults (7.3% utilization). CNS-Obsidian received positive ratings in 40.62% of cases vs 57.89% for GPT-4o (P = .230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, P = .626). CONCLUSION: Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. This establishes a transparent framework for scientific communities to build specialized artificial intelligence models. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative artificial intelligence integration strategies.