Kaggle 奥数AIMO赛题：QwQ baseline

文章目录[隐藏]

赛题背景
赛题任务
赛题数据集
QwQ模型
Baseline思路

赛题名称：AI Mathematical Olympiad - Progress Prize 2

赛题类型：通过大模型完成数学题目的解答

赛题任务：大模型、自然语言处理

https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2

赛题背景

数学推理能力是AI的一个重要里程碑。数学推理是解决许多复杂问题的基础，从工程奇迹到复杂的金融模型。然而，当前AI在这一领域的能力有限。

AI数学奥林匹克（AIMO）奖是一个1000万美元的基金，旨在推动开放开发能够与国际数学奥林匹克（IMO）顶级人类选手表现相当的AI模型。

赛题任务

在第二届AIMO进步奖比赛中，参赛者的主要任务是开发算法和模型，以解决110道高难度的数学问题。这些问题涵盖了代数、组合数学、几何和数论四个领域，难度相当于国家级奥林匹克水平，并且特别设计为对现有AI技术具有挑战性。

赛题数据集

本次比赛的数据包含110道数学问题，风格与AIME（美国数学邀请赛）类似。

每个问题的答案是一个介于0到999之间的非负整数。您应通过将问题解决方案取模1000来得到这个数字。例如，如果您认为某个问题的解决方案是2034，您的预测答案应为34。

问题的难度大致相当于国家级奥林匹克水平，尽管有些问题稍简单，有些则稍难。

所有问题均为纯文本格式，数学符号使用LaTeX表示。请参阅“概述”部分中的“语言和符号说明”了解使用的符号约定详情。尽管有些问题可能涉及几何，但任何问题中都不使用图表。

公开测试集：包含50道问题。
私有测试集：包含另外50道不同的问题。
参考数据：提供10道问题作为参考，称为“参考数据”。以下提供了包含这些参考问题完整解决方案的PDF文件。

QwQ模型

https://qwenlm.github.io/zh/blog/qwq-32b-preview/

QwQ-32B-Preview 是由 Qwen 团队开发的实验性研究模型，专注于增强 AI 推理能力。作为预览版本，它展现了令人期待的分析能力，同时也存在以下局限：

Kaggle 奥数AIMO赛题：QwQ baseline

QwQ-32B-Preview 在数学和编程领域表现出色，但在其他领域仍有提升空间。当模型有足够的时间思考、质疑和反思时，它对数学和编程的理解就会深化。

GPQA：65.2%，展示了研究生水平的科学推理能力；
AIME：50.0%，证明了强大的数学问题解决技能；
MATH-500：90.6%，体现了在各类数学主题上的全面理解；
LiveCodeBench：50.0%，验证了在实际编程场景中的出色表现。

Baseline思路

https://www.kaggle.com/code/itahiro/qwen-qwq-32b-preview-deepreasoning-6a5856

加载QwQ模型

fromvllmimportLLM,SamplingParamsos.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"os.environ["TOKENIZERS_PARALLELISM"]="false"llm_model_pth='/kaggle/input/m/shelterw/qwen2.5/transformers/qwq-32b-preview-awq/1'llm=LLM(llm_model_pth,#dtype="half",#Thedatatypeforthemodelweightsandactivations#max_num_seqs=128,#Maximumnumberofsequencesperiteration.Defaultis256max_model_len=32768,#4096*10,#Modelcontextlengthtrust_remote_code=True,#Trustremotecode(e.g.,fromHuggingFace)whendownloadingthemodelandtokenizertensor_parallel_size=4,#ThenumberofGPUstousefordistributedexecutionwithtensorparallelismgpu_memory_utilization=0.96,#Theratio(between0and1)ofGPUmemorytoreserveforthemodel)

设定多样提示词

thoughts=['Pleaseusechainedreasoningtoputtheanswerin\boxed{}.','Pleasereflectandverifywhilereasoningandputtheanswerin\boxed{}.','Solvethefollowingproblemusingconciseandclearreasoningbyplacingtheanswerin\boxed{}.','Youareahelpfulandreflectivemathsassistant,pleasereasonstepbysteptoputtheanswerin\boxed{}.','Youarethesmartestmathsexpertintheworld,pleasespikethisquestionandputtheanswerin\boxed{}.']

提取代码进行验证

defextract_python_code_list(text):pattern=r'```pythons*(.*?)s*```'ans=[]matches=re.findall(pattern,text,re.DOTALL)forminmatches:ans.append(m)returnans

classPythonREPL:def__init__(self,timeout=8):self.timeout=timeoutdef__call__(self,query):withtempfile.TemporaryDirectory()astemp_dir:temp_file_path=os.path.join(temp_dir,"tmp.py")withopen(temp_file_path,"w",encoding="utf-8")asf:f.write(query)try:result=subprocess.run(["python3",temp_file_path],capture_output=True,check=False,text=True,timeout=self.timeout,)exceptsubprocess.TimeoutExpired:returnFalse,f"Executiontimedoutafter{self.timeout}seconds."stdout=result.stdout.strip()stderr=result.stderr.strip()ifresult.returncode==0:returnTrue,stdoutelse:returnFalse,""

逐个回答问题

defbatch_message_execute_and_get_answer(list_of_messages,round_idx)->tuple

],list[int]]:#提取python代码，执行并获取答案，直接返回答案，不需要返回新的messageans=[]formessagesinlist_of_messages:python_code=extract_python_code(messages[-1]['content'])python_code=process_python_code(python_code)try:success,output=PythonREPL()(python_code)ifsuccess:patten=r'(d+)'matches=re.findall(patten,output)ifmatches:formatchinmatches:ans.append(int(match)%1000)ans.append(int(match)%1000)#代码权重高于自然语言，所以添加两次exceptExceptionase:output=str(e)print(f'pythoncodeoutput:{output}')returnans

【竞赛报名/项目咨询+微信：mollywei007】