I am currently a first-year PhD student at the National University of Singapore, supervised by Prof. Mengling Feng.
Reviewer for NIPS, IJCAI, ACM TIST, with over 10 papers reviewed.
Research Interests: Trustworthy and Agentic LLM • Multi-modality Intelligence in Healthcare
Looking for a summer 26' internship!! 🙏
我目前是新加坡国立大学一年级博士生,导师为 冯梦凌教授。
担任NIPS、IJCAI、ACM TIST等会议和期刊的审稿人,已审稿10余篇。
研究方向:可信与智能体大语言模型 • 医疗领域多模态智能
期待26年暑期实习机会!!🙏
") does not match the recommended repository name for your site ("").
", so that your site can be accessed directly at "http://".
However, if the current repository name is intended, you can ignore this message by removing "{% include widgets/debug_repo_name.html %}" in index.html.
",
which does not match the baseurl ("") configured in _config.yml.
baseurl in _config.yml to "".

Zhihui Chen, Kai He, Yucheng Huang, Yunxiao Zhu, Mengling Feng
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2025
Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. We propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. Experiments show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall at 0.1% false positive rate threshold.
Zhihui Chen, Kai He, Yucheng Huang, Yunxiao Zhu, Mengling Feng
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2025
Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. We propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. Experiments show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall at 0.1% false positive rate threshold.

Zhihui Chen, et al.
arXiv preprint 2025
Recent advances in multimodal large language models have enabled remarkable medical image editing capabilities. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built specifically for medical image editing with strict anatomical and clinical constraints. We introduce Med-Banana-50K, a comprehensive 50K-image dataset for instruction-based medical image editing spanning three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types. Our dataset is constructed by leveraging Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition and removal) from real medical images. What distinguishes Med-Banana-50K from general-domain editing datasets is our systematic approach to medical quality control: we employ LLM-as-Judge with a medically grounded rubric and history-aware iterative refinement up to five rounds.
Zhihui Chen, et al.
arXiv preprint 2025
Recent advances in multimodal large language models have enabled remarkable medical image editing capabilities. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built specifically for medical image editing with strict anatomical and clinical constraints. We introduce Med-Banana-50K, a comprehensive 50K-image dataset for instruction-based medical image editing spanning three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types. Our dataset is constructed by leveraging Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition and removal) from real medical images. What distinguishes Med-Banana-50K from general-domain editing datasets is our systematic approach to medical quality control: we employ LLM-as-Judge with a medically grounded rubric and history-aware iterative refinement up to five rounds.