1 张图秒懂 Nova 16 种操作 - 每天5分钟玩转 OpenStack(44)

本文详细介绍了OpenStack Nova中实例的各种管理操作,包括常规操作和故障处理两大类。常规操作涉及实例的启动、停止、调整配置等;故障处理则针对计划内维护和意外故障,提供了迁移、恢复等解决方案。

前面我们讨论了 Instance 的若干操作,有的操作功能比较类似,也有各自的适用场景,现在是时候系统地总结一下了。

如上图所示,我们把对 Instance 的管理按运维工作的场景分为两类:常规操作和故障处理。

常规操作

常规操作中,Launch、Start、Reboot、Shut Off 和 Terminate 都很好理解。
下面几个操作重点回顾一下:

Resize
通过应用不同的 flavor 调整分配给 instance 的资源。

Lock/Unlock
可以防止对 instance 的误操作。

Pause/Suspend/Resume
暂停当前 instance,并在以后恢复。
Pause 和 Suspend 的区别在于 Pause 将 instance 的运行状态保存在计算节点的内存中,而 Suspend 保存在磁盘上。
Pause 的优点是 Resume 的速度比 Suspend 快;缺点是如果计算节点重启,内存数据丢失,就无法 Resume 了,而 Suspend 则没有这个问题。

Snapshot
备份 instance 到 Glance。产生的 image 可用于故障恢复,或者以此为模板部署新的 instance。

故障处理

故障处理有两种场景:计划内和计划外。

计划内是指提前安排时间窗口做的维护工作,比如服务器定期的微码升级,添加更换硬件等。
计划外是指发生了没有预料到的突发故障,比如强行关机造成 OS 系统文件损坏,服务器掉电,硬件故障等。

计划内故障处理

对于计划内的故障处理,可以在维护窗口中将 instance 迁移到其他计算节点。
涉及如下操作:

Migrate
将 instance 迁移到其他计算节点。
迁移之前,instance 会被 Shut Off,支持共享存储和非共享存储。

Live Migrate
与 Migrate 不同,Live Migrate 能不停机在线地迁移 instance,保证了业务的连续性。也支持共享存储和非共享存储(Block Migration)

Shelve/Unshelve
Shelve 将 instance 保存到 Glance 上,之后可通过 Unshelve 重新部署。
Shelve 操作成功后,instance 会从原来的计算节点上删除。
Unshelve 会重新选择节点部署,可能不是原节点。

计划外故障处理

计划外的故障按照影响的范围又分为两类:Instance 故障和计算节点故障

Instance 故障

Instance 故障只限于某一个 instance 的操作系统层面,系统无法正常启动。
可以使用如下操作修复 instance:

Rescue/Unrescue
用指定的启动盘启动,进入 Rescue 模式,修复受损的系统盘。成功修复后,通过 Unrescue 正常启动 instance。

Rebuild
如果 Rescue 无法修复,则只能通过 Rebuild 从已有的备份恢复。Instance 的备份是通过 snapshot 创建的,所以需要有备份策略定期备份。

计算节点故障

Instance 故障的影响范围局限在特定的 instance,计算节点本身是正常工作的。如果计算节点发生故障,OpenStack 则无法与节点的 nova-compute 通信,其上运行的所有 instance 都会受到影响。这个时候,只能通过 Evacuate 操作在其他正常节点上重建 Instance。

Evacuate
利用共享存储上 Instance 的镜像文件在其他计算节点上重建 Instance。
所以提前规划共享存储是关键。

小节

到这里,我们已经学习了 OpenStack Nova 的架构,讨论了 Nova API,Scheduler,Compute 等重要组件,并通过案例详尽的剖析了 Nova 各个操作,最后用一张图总结了这些操作的用途和使用场景。

Nova 是 OpenStack 最重要的项目,处于 OpenStack 的中心。
其他 Keystone,Glance,Cinder 和 Neutron 项目都是为 Nova 其服务的,一定要好好理解。

下一节我们将学习 OpenStack 块存储服务 - Cinder。

### Flink Exactly-Once Semantics Explained In the context of stream processing, ensuring that each record is processed only once (exactly-once) without any loss or duplication becomes critical for applications requiring high accuracy and reliability. For this purpose, Apache Flink implements sophisticated mechanisms to guarantee exactly-once delivery semantics. #### Importance of Exactly-Once Processing Exactly-once processing ensures every message is consumed precisely one time by downstream systems, preventing both data loss and duplicate records[^3]. This level of assurance is particularly important when dealing with financial transactions, billing information, or other scenarios where even a single error can lead to significant issues. #### Implementation Mechanisms To achieve exactly-once guarantees, Flink employs several key technologies: 1. **Checkpointing**: Periodic snapshots are taken across all operators within a job graph at consistent points in time. These checkpoints serve as recovery states which allow jobs to resume from these saved positions upon failure. 2. **Two-phase commit protocol**: When interacting with external systems like databases or messaging queues through sinks, Flink uses an extended version of the two-phase commit transaction mechanism. During checkpoint creation, pre-commit actions prepare changes; after successful completion of the checkpoint process, global commits finalize those operations[^4]. ```mermaid graph LR; A[Start Transaction] --> B{Prepare Changes}; B --> C(Pre-Commit); C --> D{All Pre-commits Succeed?}; D -->|Yes| E(Global Commit); D -->|No| F(Abort); ``` This diagram illustrates how the two-phase commit works during sink operations. Each operator prepares its part before confirming globally whether everything has been successfully prepared. Only then does it proceed with committing or aborting based on consensus among participants. #### Barrier Insertion & Propagation For maintaining consistency between different parts of computation while taking periodic snapshots, barriers play a crucial role. They act as synchronization markers inserted into streams periodically according to configured intervals. As they propagate along with events throughout the topology, they ensure that no new elements enter until previous ones have completed their respective stages up till the barrier point. ```mermaid sequenceDiagram participant Source participant OperatorA participant OperatorB Note over Source: Time advances... Source->>OperatorA: Data Element 1 Source->>OperatorA: Checkpoint Barrier X Source->>OperatorA: Data Element 2 OperatorA->>OperatorB: Forwarded Elements + Barrier X Note right of OperatorB: Process pending items\nbefore handling next element post-barrier ``` The sequence above shows how barriers travel alongside regular data flow but enforce order so that computations remain synchronized despite asynchronous nature inherent in distributed environments. --related questions-- 1. What challenges arise when implementing exactly-once semantics in real-world applications? 2. How do checkpointing frequencies impact performance versus fault tolerance trade-offs? 3. Can you explain what happens if some nodes fail midway through a two-phase commit operation? 4. Are there alternative methods besides using barriers for achieving similar levels of consistency? 5. In practice, under what circumstances might at-least-once be preferred over exactly-once semantics?
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值