Hotpatching a C Function on x86

March 31, 2016

In this post I’m going to do a silly, but interesting, exercise that should never be done in any program that actually matters. I’m going write a program that changes one of its function definitions while it’s actively running and using that function. Unlike last time, this won’t involve shared libraries, but it will require x86_64 and GCC. Most of the time it will work with Clang, too, but it’s missing an important compiler option that makes it stable.

If you want to see it all up front, here’s the full source: hotpatch.c

Here’s the function that I’m going to change:

void
hello(void)
{
    puts("hello");
}

It’s dead simple, but that’s just for demonstration purposes. This will work with any function of arbitrary complexity. The definition will be changed to this:

void
hello(void)
{
    static int x;
    printf("goodbye %d\n", x++);
}

I was only going change the string, but I figured I should make it a little more interesting.

Here’s how it’s going to work: I’m going to overwrite the beginning of the function with an unconditional jump that immediately moves control to the new definition of the function. It’s vital that the function prototype does not change, since that would be a farmore complex problem.

But first there’s some preparation to be done. The target needs to be augmented with some GCC function attributes to prepare it for its redefinition. As is, there are three possible problems that need to be dealt with:

  • I want to hotpatch this function while it is being used by another thread withoutany synchronization. It may even be executing the function at the same time I clobber its first instructions with my jump. If it’s in between these instructions, disaster will strike.

The solution is the ms_hook_prologue function attribute. This tells GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP that I can safely clobber. This idea originated in Microsoft’s Win32 API, hence the “ms” in the name.

  • The prologue NOP needs to be updated atomically. I can’t let the other thread see a half-written instruction or, again, disaster. On x86 this means I have an alignment requirement. Since I’m overwriting an 8-byte instruction, I’m specifically going to need 8-byte alignment to get an atomic write.

The solution is the aligned function attribute, ensuring the hotpatch prologue is properly aligned.

  • The final problem is that there must be exactly one copy of this function in the compiled program. It must never be inlined or cloned, since these won’t be hotpatched.

As you might have guessed, this is primarily fixed with the noinline function attribute. Since GCC may also clone the function and call that instead, so it also needs the nocloneattribute.

Even further, if GCC determines there are no side effects, it may cache the return value and only ever call the function once. To convince GCC that there’s a side effect, I added an empty inline assembly string (__asm("")). Since puts() has a side effect (output), this isn’t truly necessary for this particular example, but I’m being thorough.

What does the function look like now?

__attribute__ ((ms_hook_prologue))
__attribute__ ((aligned(8)))
__attribute__ ((noinline))
__attribute__ ((noclone))
void
hello(void)
{
    __asm("");
    puts("hello");
}

And what does the assembly look like?

$ objdump -Mintel -d hotpatch
0000000000400848 <hello>:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 <puts@plt>

It’s 8-byte aligned and it has the 8-byte NOP: that lea instruction does nothing. It copies rsp into itself and changes no flags. Why not 8 1-byte NOPs? I need to replace exactly one instruction with exactly one other instruction. I can’t have another thread in between those NOPs.

Hotpatching

Next, let’s take a look at the function that will perform the hotpatch. I’ve written a generic patching function for this purpose. This part is entirely specific to x86.

void
hotpatch(void *target, void *replacement)
{
    assert(((uintptr_t)target & 0x07) == 0); // 8-byte aligned?
    void *page = (void *)((uintptr_t)target & ~0xfff);
    mprotect(page, 4096, PROT_WRITE | PROT_EXEC);
    uint32_t rel = (char *)replacement - (char *)target - 5;
    union {
        uint8_t bytes[8];
        uint64_t value;
    } instruction = { {0xe9, rel >> 0, rel >> 8, rel >> 16, rel >> 24} };
    *(uint64_t *)target = instruction.value;
    mprotect(page, 4096, PROT_EXEC);
}

It takes the address of the function to be patched and the address of the function to replace it. As mentioned, the target must be 8-byte aligned (enforced by the assert). It’s also important this function is only called by one thread at a time, even on different targets. If that was a concern, I’d wrap it in a mutex to create a critical section.

There are a number of things going on here, so let’s go through them one at a time:

Make the function writeable

The .text segment will not be writeable by default. This is for both security and safety. Before I can hotpatch the function I need to make the function writeable. To make the function writeable, I need to make its page writable. To make its page writeable I need to call mprotect(). If there was another thread monkeying with the page attributes of this page at the same time (another thread calling hotpatch()) I’d be in trouble.

It finds the page by rounding the target address down to the nearest 4096, the assumed page size (sorry hugepages). Warning: I’m being a bad programmer and not checking the result of mprotect(). If it fails, the program will crash and burn. It will always fail systems with W^X enforcement, which will likely become the standard in the future. Under W^X (“write XOR execute”), memory can either be writeable or executable, but never both at the same time.

What if the function straddles pages? Well, I’m only patching the first 8 bytes, which, thanks to alignment, will sit entirely inside the page I just found. It’s not an issue.

At the end of the function, I mprotect() the page back to non-writeable.

Create the instruction

I’m assuming the replacement function is within 2GB of the original in virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s no 64-bit relative jump, and I only have 8 bytes to work within anyway. Looking that up in the Intel manual, I see this:

Fortunately it’s a really simple instruction. It’s opcode 0xE9 and it’s followed immediately by the 32-bit displacement. The instruction is 5 bytes wide.

To compute the relative jump, I take the difference between the functions, minus 5. Why the 5? The jump address is computed from the position after the jump instruction and, as I said, it’s 5 bytes wide.

I put 0xE9 in a byte array, followed by the little endian displacement. The astute may notice that the displacement is signed (it can go “up” or “down”) and I used an unsigned integer. That’s because it will overflow nicely to the right value and make those shifts clean.

Finally, the instruction byte array I just computed is written over the hotpatch NOP as a single, atomic, 64-bit store.

    *(uint64_t *)target = instruction.value;

Other threads will see either the NOP or the jump, nothing in between. There’s no synchronization, so other threads may continue to execute the NOP for a brief moment even through I’ve clobbered it, but that’s fine.

Trying it out

Here’s what my test program looks like:

void *
worker(void *arg)
{
    (void)arg;
    for (;;) {
        hello();
        usleep(100000);
    }
    return NULL;
}

int
main(void)
{
    pthread_t thread;
    pthread_create(&thread, NULL, worker, NULL);
    getchar();
    hotpatch(hello, new_hello);
    pthread_join(thread, NULL);
    return 0;
}

I fire off the other thread to keep it pinging at hello(). In the main thread, it waits until I hit enter to give the program input, after which it calls hotpatch() and changes the function called by the “worker” thread. I’ve now changed the behavior of the worker thread without its knowledge. In a more practical situation, this could be used to update parts of a running program without restarting or even synchronizing.

Further Reading

These related articles have been shared with me since publishing this article:

 

 

 

安全帽与口罩检测数据集 一、基础信息 数据集名称:安全帽与口罩检测数据集 图片数量: - 训练集:1690张图片 - 验证集:212张图片 - 测试集:211张图片 - 总计:2113张实际场景图片 分类类别: - HelmetHelmet:戴安全帽的人员,用于安全防护场景的检测。 - personwithmask:戴口罩的人员,适用于公共卫生监测。 - personwith_outmask:未戴口罩的人员,用于识别未遵守口罩佩戴规定的情况。 标注格式:YOLO格式,包含边界框和类别标签,适用于目标检测任务。 数据格式:JPEG/PNG图片,来源于实际监控和场景采集,细节清晰。 二、适用场景 工业安全监控系统开发: 数据集支持目标检测任务,帮助构建自动检测人员是否佩戴安全帽的AI模型,适用于建筑工地、工厂等环境,提升安全管理效率。 公共卫生管理应用: 集成至公共场所监控系统,实时监测口罩佩戴情况,为疫情防控提供自动化支持,辅助合规检查。 智能安防与合规检查: 用于企业和机构的自动化安全审计,减少人工干预,提高检查准确性和响应速度。 学术研究与AI创新: 支持计算机视觉目标检测领域的研究,适用于安全与健康相关的AI模型开发和论文发表。 三、数据集优势 精准标注与实用性: 每张图片均经过标注,边界框定位准确,类别定义清晰,确保模型训练的高效性和可靠性。 场景多样性与覆盖性: 包含安全帽和口罩相关类别,覆盖工业、公共场所以及多种实际环境,样本丰富,提升模型的泛化能力和适应性。 任务适配性强: 标注兼容主流深度学习框架(如YOLO),可直接用于目标检测任务,便于快速集成和部署。 实际应用价值突出: 专注于工业安全和公共健康领域,为自动化监控、合规管理以及疫情防护提供可靠数据支撑,具有较高的社会和经济价值。
内容概要:本文围绕FOC电机控制代码实现与调试技巧在计算机竞赛中的应用,系统阐述了从基础理论到多场景优化的完整技术链条。文章深入解析了磁链观测器、前馈控制、代码可移植性等关键概念,并结合FreeRTOS多任务调度、滑动窗口滤波、数据校验与热仿真等核心技巧,展示了高实时性与稳定性的电机控制系统设计方法。通过服务机器人、工业机械臂、新能源赛车等典型应用场景,论证了FOC在复杂系统协同中的关键技术价值。配套的千行级代码案例聚焦分层架构与任务同步机制,强化工程实践能力。最后展望数字孪生、低代码平台与边缘AI等未来趋势,体现技术前瞻性。; 适合人群:具备嵌入式开发基础、熟悉C语言与实时操作系统(如FreeRTOS)的高校学生或参赛开发者,尤其适合参与智能车、机器人等综合性竞赛的研发人员(经验1-3年为佳)。; 使用场景及目标:① 掌握FOC在多任务环境下的实时控制实现;② 学习抗干扰滤波、无传感器控制、跨平台调试等竞赛实用技术;③ 提升复杂机电系统的问题分析与优化能力; 阅读建议:此资源强调实战导向,建议结合STM32等开发平台边学边练,重点关注任务优先级设置、滤波算法性能权衡与观测器稳定性优化,并利用Tracealyzer等工具进行可视化调试,深入理解代码与系统动态行为的关系。
【场景削减】拉丁超立方抽样方法场景削减(Matlab代码实现)内容概要:本文介绍了基于拉丁超立方抽样(Latin Hypercube Sampling, LHS)方法的场景削减技术,并提供了相应的Matlab代码实现。该方法主要用于处理不确定性问题,特别是在电力系统、可再生能源等领域中,通过对大量可能场景进行高效抽样并削减冗余场景,从而降低计算复杂度,提高优化调度等分析工作的效率。文中强调了拉丁超立方抽样在保持样本代表性的同时提升抽样精度的优势,并结合实际科研背景阐述了其应用场景与价值。此外,文档还附带多个相关科研方向的Matlab仿真案例和资源下载链接,涵盖风电、光伏、电动汽车、微电网优化等多个领域,突出其实用性和可复现性。; 适合人群:具备一定Matlab编程基础,从事电力系统、可再生能源、优化调度等相关领域研究的研究生、科研人员及工程技术人员。; 使用场景及目标:①应用于含高比例可再生能源的电力系统不确定性建模;②用于风电、光伏出力等随机变量的场景生成与削减;③支撑优化调度、风险评估、低碳运行等研究中的数据预处理环节;④帮助科研人员快速实现LHS抽样与场景削减算法,提升仿真效率与模型准确性。; 阅读建议:建议读者结合文中提供的Matlab代码进行实践操作,理解拉丁超立方抽样的原理与实现步骤,并参考附带的其他科研案例拓展应用思路;同时注意区分场景生成与场景削减两个阶段,确保在实际项目中正确应用该方法。
道路坑洞目标检测数据集 一、基础信息 • 数据集名称:道路坑洞目标检测数据集 • 图片数量: 训练集:708张图片 验证集:158张图片 总计:866张图片 • 训练集:708张图片 • 验证集:158张图片 • 总计:866张图片 • 分类类别: CirEllPothole CrackPothole IrrPothole • CirEllPothole • CrackPothole • IrrPothole • 标注格式:YOLO格式,包含边界框和类别标签,适用于目标检测任务。 • 数据格式:图片为常见格式(如JPEG/PNG),来源于相关数据采集。 二、适用场景 • 智能交通监控系统开发:用于自动检测道路坑洞,实现实时预警和维护响应,提升道路安全。 • 自动驾驶与辅助驾驶系统:帮助车辆识别道路缺陷,避免潜在事故,增强行驶稳定性。 • 城市基础设施管理:用于道路状况评估和定期检查,优化维护资源分配和规划。 • 学术研究与创新:支持计算机视觉在公共安全和交通领域的应用,推动算法优化和模型开发。 三、数据集优势 • 精准标注与类别覆盖:标注高质量,包含三种常见坑洞类型(CirEllPothole、CrackPothole、IrrPothole),覆盖不同形态道路缺陷。 • 数据多样性:数据集涵盖多种场景,提升模型在复杂环境下的泛化能力和鲁棒性。 • 任务适配性强:标注兼容主流深度学习框架(如YOLO),可直接用于目标检测任务,支持快速模型迭代。 • 实际应用价值:专注于道路安全与维护,为智能交通和城市管理提供可靠数据支撑,促进效率提升。
废物分类实例分割数据集 一、基础信息 数据集名称:废物分类实例分割数据集 图片数量: - 训练集:2,658张图片 - 验证集:316张图片 - 测试集:105张图片 - 总计:2,974张图片(训练集 + 验证集) 分类类别: - 电子产品(electronics) - 玻璃瓶(gbottle) - 口罩(mask) - 金属(metal) - 塑料袋(pbag) - 塑料瓶(pbottle) - 废物(waste) 标注格式:YOLO格式,包含多边形点坐标,适用于实例分割任务。 数据格式:JPEG图片,来源于实际场景,涵盖多种废物物品。 二、适用场景 智能废物分类系统开发: 数据集支持实例分割任务,帮助构建能够自动识别和分割废物物品的AI模型,辅助垃圾分类和回收管理。 环境监测与环保应用: 集成至智能垃圾桶或监控系统,提供实时废物识别功能,促进环保和资源回收。 学术研究与技术创新: 支持计算机视觉与环境保护交叉领域的研究,助力开发高效的废物处理AI解决方案。 教育与培训: 数据集可用于高校或培训机构,作为学习实例分割技术和AI在环境应用中实践的重要资源。 三、数据集优势 类别多样性与覆盖广: 包含7个常见废物和可回收物品类别,如电子产品、玻璃瓶、口罩、金属、塑料袋、塑料瓶和废物,涵盖日常生活中的多种物品,提升模型的泛化能力。 精准标注与高质量: 每张图片均使用YOLO格式进行多边形点标注,确保分割边界精确,适用于实例分割任务。 任务导向性强: 标注兼容主流深度学习框架,可直接用于实例分割模型的训练和评估。 实用价值突出: 专注于废物分类和回收管理,为智能环保系统提供关键数据支撑,推动可持续发展。
### 关于SFTTrainer的RuntimeError问题分析 在解决 `RuntimeError` 问题时,需要明确错误的具体上下文以及触发条件。以下是对问题的详细分析和可能的解决方案: #### 错误背景 根据提供的引用内容[^1]和[^2],虽然它们并未直接涉及 `SFTTrainer` 的 `RuntimeError`,但可以从其中提取一些通用的调试思路和技术背景。例如: - 引用中提到的 Oracle 和 PostgreSQL 相关问题表明,运行时错误可能与环境配置、依赖库或历史遗留文件有关。 - 如果 `SFTTrainer` 是一个基于 Python 的深度学习框架(如 Hugging Face Transformers),则其 `RuntimeError` 很可能与模型加载、数据处理或 GPU 内存分配相关。 #### 常见原因及解决方案 以下是可能导致 `RuntimeError` 的常见原因及其对应的解决方案: 1. **GPU内存不足** - 如果使用了 GPU 加速训练,但显存不足以容纳模型和数据,则会抛出类似 `Cannot allocate memory of size 0` 的错误。 - 解决方案:减少批量大小(batch size)或启用梯度累积以降低显存消耗。 ```python # 示例代码:调整批量大小和梯度累积 trainer = SFTTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, args=TrainingArguments( output_dir="./results", per_device_train_batch_size=4, # 减少批量大小 gradient_accumulation_steps=4 # 启用梯度累积 ) ) ``` 2. **模型加载失败** - 如果模型文件损坏或路径错误,可能会导致运行时错误。 - 确保模型文件完整且路径正确。如果使用预训练模型,可以尝试重新下载。 ```python from transformers import AutoModelForCausalLM # 确保模型加载成功 model = AutoModelForCausalLM.from_pretrained("path/to/model", trust_remote_code=True) ``` 3. **数据格式不匹配** - 输入数据格式与模型期望的格式不一致也会引发运行时错误。 - 检查数据预处理步骤,确保输入张量形状和类型符合模型要求。 ```python # 示例代码:检查数据预处理 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer") inputs = tokenizer(train_texts, return_tensors="pt", padding=True, truncation=True) ``` 4. **依赖库版本冲突** - 如果使用的依赖库版本不兼容,可能会导致运行时错误。 - 检查并更新相关依赖库至推荐版本。 ```bash pip install --upgrade transformers accelerate torch ``` 5. **环境配置问题** - 类似于引用中提到的 `/opt/ORCLcluster/lib` 文件误导系统的问题,某些环境变量或配置文件可能会影响程序运行。 - 清理不必要的环境变量,并确保配置文件正确无误。 #### 调试建议 为了更准确地定位问题,可以采取以下措施: - 打印详细的错误信息和堆栈跟踪。 - 使用调试工具(如 PyCharm 或 VS Code)逐步排查问题。 - 在代码中添加日志记录以捕获关键变量值。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值