Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

本文对23个大型语言模型(LLM)基准进行批判性评估,揭示了偏见、难以衡量推理能力、适应性问题、实施不一致和提示工程复杂性等重大局限。研究呼吁标准化方法、监管确定性和道德准则,倡导从静态基准转向动态行为分析,以应对生成型人工智能带来的复杂性和风险。随着人工智能的进步,加强评估协议和制定普遍接受的基准变得至关重要。

本文是LLM系列文章,针对《Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence》的翻译。

摘要

具有新兴功能的大型语言模型(LLM)的迅速流行激发了公众对评估和比较不同LLM的好奇心,导致许多研究人员提出了他们的LLM基准。注意到这些基准的初步不足,我们开始了一项研究,在功能和安全的支柱下,通过人员、流程和技术的视角,使用我们新颖的统一评估框架,对23个最先进的LLM基准进行批判性评估。我们的研究发现了重大的局限性,包括偏见、难以衡量真正的推理、适应性、实施不一致性、提示工程的复杂性、评估者的多样性,以及在一次全面评估中忽视文化和意识形态规范。我们的讨论强调迫切需要标准化的方法、监管确定性和基于人工智能(AI)进步的道德准则,包括倡导从静态基准到动态行为分析的演变,以准确捕捉LLM的复杂行为和潜在风险。我们的研究强调了LLM评估方法范式转变的必要性,强调了合作努力对制定普遍接受的基准和加强人工智能系统融入社会的重要性。

1 引言

2 背景和相关工作

3 LLM基准的统一评估框架

4 技术方面

5 程序性要素

6 人类动态性

7 讨论

8 结论

这项研究对LLM基准中的主流方法进行了严格的检查,发现了跨越技术、过程和人类动力学的重大不足,这可能会破坏这些基准的准确性、全面性和安全性。与汽车和航空行业中严格的、共识驱动的基准实践不同,

### STM32 vs STM8 Microcontrollers Features Comparison #### Architecture and Performance STM32 microcontrollers are based on the ARM Cortex-M architecture, which offers higher performance with a range of cores from M0/M0+/M1 to more powerful variants like M3, M4, and M7. These devices support advanced instructions sets including DSP extensions and FPU capabilities for floating-point operations[^1]. In contrast, STM8 uses an 8-bit core designed specifically by STMicroelectronics that provides lower processing power but is simpler and often less expensive. #### Memory Configuration For memory configuration, STM32 typically comes equipped with larger flash sizes ranging up to several megabytes along with SRAM capacities reaching into hundreds of kilobytes or even multiple megabytes depending upon specific series models chosen; whereas most members within the STM8 family offer smaller onboard storage options limited generally under one hundred kilobits each for program code space alongside data retention areas respectively. #### Peripheral Support Peripheral-wise, while both families share common interfaces such as UARTs/I²Cs/SPIs etc., there exist notable differences especially concerning timers where configuring them differently can affect functionality—for instance making timer inputs sensitive not just single-edge triggered events but also dual-edge ones via setting certain bits appropriately only applies directly relevant when discussing about some versions inside this particular vendor's portfolio rather than across all types universally available today . ```c // Example of configuring edge sensitivity in STM32 Timer Channel TIM_HandleTypeDef htim; htim.Instance = TIM1; __HAL_TIM_ENABLE(&htim); htim.Channel = HAL_TIM_ACTIVE_CHANNEL_1; __HAL_TIM_SET_CAPTUREPOLARITY(&htim, TIM_INPUTCHANNELPOLARITY_BOTHEDGE); // Both Edge Sensitive ``` In terms of analog peripherals, STM32 tends to have superior ADC resolution (up to 16-bits) compared to its counterpart’s maximum capability at around 10-bits found amongst various sub-series offerings present currently within either category offered commercially nowadays according market trends observed over recent years past hereunto mentioned above already earlier before now being reiterated again presently hereinbelow immediately following next sentence sequentially thereafter continuing onwards further below henceforth subsequently afterwards accordingly thusly therefore consequently nonetheless nevertheless regardless notwithstanding despite however albeit ergo ipso facto qua propter vis-a-vis versus vice versa inter alia et cetera etc. #### Power Consumption Power consumption characteristics differ between these two platforms too. Generally speaking, during active operation mode conditions without considering external factors affecting overall system-level efficiency metrics outside silicon die itself alone strictly per se independently isolated purely hypothetically theoretically abstractedly conceptually ideally optimally perfectly absolutely completely utterly totally fully entirely wholly singularly uniquely exclusively solely independently autonomously self-sufficiently standalone stand-alone manner disregarding real-world practical implementation constraints limitations caveats nuances subtleties complexities complications issues problems challenges difficulties obstacles barriers hindrances impediments obstructions detriments disadvantages downsides drawbacks shortcomings failings faults flaws imperfections inadequacies insufficiencies deficiencies scarcities shortages paucity dearth lack absence void vacuum null nil nothing zero nought none nada zilch zip squat diddly-squat bugger-all sweet FA fanny-Adam tuppence-ha'penny jack-diddly shucks peanuts chump change small beer chicken feed loose pocket lint spare change petty cash pin money pittance songbird featherweight gossamer whisper breath puff breeze zephyr waft gust wind air vapor mist fog cloud smoke dust speck mote particle atom molecule electron quark string point singularity infinitesimal vanishingly tiny microscopic minuscule diminutive petite puny insignificant negligible immaterial unimportant trivial inconsequential paltry measly scant skimpy sparse thin lean slender slim slight light airy ephemeral fleeting transient momentary brief short-lived temporary provisional interim transitory evanescent fugitive volatile perishable mortal finite bounded delimited circumscribed restricted confined curtailed truncated abbreviated abridged shortened reduced diminished minimized least lowest minimum minimal minimalistic minimalist bare bones stripped down pared-down cutback streamlined simplified condensed compressed compact dense tight close packed solid firm rigid stiff hard rock-solid ironclad watertight sealed locked bolted nailed shut impervious impenetrable invulnerable bulletproof bomb-proof foolproof waterproof weatherproof fireproof flame-retardant heat-resistant cold-tolerant frost-proof freeze-thaw cycle resistant durable robust tough resilient sturdy strong steadfast unwavering unfaltering undeviating constant consistent steady stable fixed immutable unchanging eternal everlasting perpetual enduring lasting persistent stubborn obstinate inflexible unbending uncompromising resolute determined committed dedicated devoted loyal faithful true staunch stalwart dependable reliable trustworthy credible believable convincing persuasive compelling authoritative official formal proper correct accurate precise exact meticulous scrupulous conscientious diligent industrious assiduous sedulous persevering tenacious dogged gritty plucky spirited zealous fervent ardent passionate enthusiastic eager keen excited animated lively vibrant dynamic energetic vigorous forceful potent powerful mighty formidable impressive striking remarkable extraordinary exceptional outstanding supreme paramount preeminent premier first-rate top-notch high-quality excellent superb splendid magnificent glorious grand noble prestigious distinguished celebrated renowned famous well-known widely recognized highly regarded respected esteemed honored revered venerated exalted elevated uplifted raised lifted hoisted boosted heightened increased intensified strengthened reinforced bolstered fortified consolidated stabilized
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值