1. Benchmarks Reasoning, conversation, Q&A benchmarks HellaSwag BIG-Bench Hard SQuAD IFEval MuSR MMLU-PRO MT-Bench Domain-specific benchmarks GPQA MedQA PubMedQA Math benchmarks GSM8K MATH MathEval Security-related benchmarks PyRIT Purple Llama CyberSecEval 2. 国内外端侧大模型 模型本身方面,由于端侧大模型更多是