Young Jin Kim’s Post

View profile for Young Jin Kim, graphic

Principal Researcher at Microsoft GenAI

Look how strong #Phi-3.5-MoE is! On a completely unseen (the dataset didn't ever exist during the training time), low resource multilingual, and a new healthcare domain dataset, it strongly holds the ground against all those larger or private models. A new paper from MSR India. An extensive evaluation of various language models for multilingual healthcare scenarios. https://lnkd.in/gNSX3gj8

View profile for Sunayana Sitaram, graphic

Principal Researcher at Microsoft Research India and Writing Assistance and Language Intelligence (WALI) - Making AI more inclusive to everyone on the planet

🎉 New Pariksha alert! 🎊 I am so proud to share our latest work, Health Pariksha, an extensive assessment of 24 LLMs, examining their performance on data collected from Indian patients interacting with a medical chatbot in Indian English and four other Indic languages. This work was done in collaboration with Varun Gumma, Mohit Jain, Ananditha Raghunath and Karya (human annotation). Highlights of our work: - Multilingual Evaluation: The study evaluates LLM responses to 750 questions posed by patients using a medical chatbot, covering five languages: Indian English, Hindi, Kannada, Tamil, and Telugu. Our dataset is unique, containing code-mixed queries such as “Agar operation ke baad pain ho raha hai, to kya karna hai?”, “Can I eat before the kanna operation”, and culturally relevant queries such as “Can I eat chapati/puri/non veg after surgery?”. - Responses validated by doctors: We utilized doctor-validated responses as the ground truth for evaluating model responses. - Uniform RAG Framework: All models were assessed using a uniform Retrieval Augmented Generation (RAG) framework, ensuring a consistent and fair comparison. - Uncontaminated Dataset: The dataset used is free from contamination in the training data of the evaluated models, providing a reliable basis for assessment. - Specialized Metrics: The evaluation was based on four metrics: factual correctness, semantic similarity, coherence, and conciseness, as well as a combined overall metric, chosen in consultation with domain experts and doctors. Both automated techniques and human evaluators were employed to ensure comprehensive assessment. Key Findings: - Performance Variability: The study finds significant performance variability among models, with some smaller models outperforming larger ones. - Language-Specific Performance: Indic models do not consistently perform well on Indic language queries, and factual correctness is generally lower for non-English queries. This shows that there is still work to be done to build models that can answer questions reliably in Indian languages - Locally-grounded, non-translated datasets: Our dataset includes various instances of code-switching, Indian English colloquialisms, and culturally specific questions which cannot be obtained by translating datasets, particularly with automated translations. While models were able to handle code-switching to a certain extent, responses varied greatly to culturally-relevant questions. This underscores the importance of collecting datasets from target populations while building solutions. Check out the rest of the leaderboards in our paper (link in comments)

  • No alternative text description for this image
Ankit S.

LLM Arch Assoc Director and Tech Lead @Accenture | Ph.D. CMU LTI | Deep Learning | Machine Learning | AI | AGI

2mo

Amazing!

To view or add a comment, sign in

Explore topics