简要说明

Mu-SHROOM(中文名:魔幻菇)是一项不以英语为中心的SemEval2025共享任务,该项任务的目的是推进在LLM生成的内容的幻觉检测方面的前沿研究。我们已经对由大语言模型生成的 10 种不同语言的幻觉内容进行了标注。您可以通过准确识别幻觉内容的在段落中的范围来参与我们的工作,不限语言。欢迎加入我们的 Google 群组和/或 Slack,随时了解最新信息!

参赛和投稿邀请

我们非常高兴地宣布 Mu-SHROOM 幻觉检测共享任务。我们邀请参与者在多语言环境中检测经过指令调整的 LLM 输出中的幻觉跨度。

简介

这项共享任务建立在我们之前的迭代版本 SHROOM 的基础上,有三项关键改进: 以大语言模型为中心、多语言标注和幻觉跨度预测。

大语言模型经常会产生 “幻觉”,即模型生成了看似合理但却不正确的输出,而现有的指标优先考虑的是流畅性而非正确性。随着这些模型被越来越多的公众所采用,这一问题也日益受到关注。

我们希望通过 Mu-SHROOM 推动幻觉内容检测技术的发展。这项新的共享任务是在多语言和多模型背景下进行的:我们提供了由各种开放权重大语言模型生成的10 种不同语言(阿拉伯语(现代标准)、汉语(普通话)、英语、芬兰语、法语、德语、印地语、意大利语、西班牙语和瑞典语)的数据。

我们邀请参赛者选择其中任何一种语言参赛,并期望他们开发出能够准确识别和减轻生成内容中的幻觉的系统。 按照SemEval共享任务的惯例,参与者将受邀提交系统描述论文,并可以在下一次SemEval研讨会(与即将召开的计算语言学(ACL)协会主办的会议同期举行)上以海报形式展示。选择撰写系统描述论文的参与者将被要求审阅同行提交的论文(每位作者最多提交两篇论文)

关键日期:

所有截止日期均为 “地球上的任何地方”(23:59 UTC-12)。 提供开发集的时间: 2024年9月2日 提供测试集的时间:2025年1月1日 评估阶段结束: 2025年1月31日 提交系统描述文件:2025 年 2 月 28 日(待定) 论文接收通知: 2025 年 3 月 31 日(待定) 提交出版就绪稿:2025 年 4 月 21 日(待定) SemEval 研讨会: 2025年夏季(与ACL会议同期举行)

评估指标:

将根据两个(字符级)指标对参赛者进行排名:

  1. 在人工标注的数据中被标记为幻觉的字符与被模型或算法预测为幻觉的字符的交叉-重合程度
  2. 参赛者开发的系统/模型/算法检测某个字符是幻觉一部分的概率与我们标注中观察到的经验概率的相关性。

排名和提交将按语言分别进行:欢迎您只关注自己感兴趣的语言!

如何参加

注册: 请先注册您的团队,注册平台为 https://mushroomeval.pythonanywhere.com 提交成果:使用我们的平台在 2025 年 1 月 31 日之前提交您的成果 提交论文:系统说明论文应于 2025 年 2 月 28 日之前提交(待定,详情稍后公布)。

想了解最新动态?

加入我们的 Google 群组邮件列表Slack!我们期待着您的参与,并期待着这项任务所带来的激动人心的研究成果。

TL;DR

Mu-SHROOM is a non-English-centric SemEval-2025 shared task to advance the SOTA in hallucination detection for content generated with LLMs. We’ve annotated hallucinated content in 10 different languages from top -tier LLMs. Participate in as many languages as you’d like by accurately identifying spans of hallucinated content. Stay informed by joining our Google group or our Slack or follow our Twitter account!

Full Invitation

We are excited to announce the Mu-SHROOM shared task on hallucination detection (link to website). We invite participants to detect hallucination spans in the outputs of instruction-tuned LLMs in a multilingual context.

About

This shared task builds upon our previous iteration, SHROOM, with three key improvements: LLM-centered, multilingual annotations & hallucination-span prediction.

LLMs frequently produce “hallucinations,” where models generate plausible but incorrect outputs, while the existing metrics prioritize fluency over correctness. This results in an issue of growing concern as these models are increasingly adopted by the public.

With Mu-SHROOM, we want to advance the state-of-the-art in detecting hallucinated content. This new iteration of the shared task is held in a multilingual and multimodel context: we provide data produced by a variety of open-weights LLMs in 10 different languages (Arabic (modern standard), Chinese (Mandarin), English, Finnish, French, German, Hindi, Italian, Spanish, and Swedish).

Participants are invited to participate in any of the languages available and are expected to develop systems that can accurately identify and mitigate hallucinations in generated content. As is usual with SemEval shared tasks, participants will be invited to submit system description papers, with the option to present them in poster format during the next SemEval workshop (collocated with an upcoming *ACL conference). Participants that elect to write a system description paper will be asked to review their peers’ submissions (max 2 papers per author)

Key Dates:

All deadlines are “anywhere on Earth” (23:59 UTC-12). Dev set available by: 02.09.2024 Test set available by: 01.01.2025 Evaluation phase ends: 31.01.2025 System description papers due: 28.02.2025 (TBC) Notification of acceptance: 31.03.2025 (TBC) Camera-ready due: 21.04.2025 (TBC) SemEval workshop: Summer 2025 (co-located with an upcoming *ACL conference)

Evaluation Metrics:

Participants will be ranked along two (character-level) metrics:

  1. intersection-over-union of characters marked as hallucinations in the gold reference vs. predicted as such
  2. how well the probability assigned by the participants’ system that a character is part of a hallucination correlates with the empirical probabilities observed in our annotations.

Rankings and submissions will be done separately per language: you are welcome to focus only on the languages you are interested in!

How to Participate:

Register: Please register your team before making a submission on https://mushroomeval.pythonanywhere.com Submit results: use our platform to submit your results before 31.01.2025 Submit your system description: system description papers should be submitted by 28.02.2025 (TBC, further details will be announced at a later date).

Want to be kept in the loop? Join our Google group mailing list or the shared task Slack! You can also follow us on Twitter. We look forward to your participation and to the exciting research that will emerge from this task.

Best regards, Raúl Vázquez and Timothee Mickus On behalf of all the Mu-SHROOM organizers