fault tolerance中的错误和故障检测(Error and Fault Detection Mechanisms)

本文探讨了错误检测机制的分类及其在内存和电路层面的应用。介绍了硬错误和软错误的区别,软错误又分为瞬态错误和间歇性错误,并讨论了如何利用ECC等技术提高系统的可靠性。

这里的介绍来自论文Survey of Error and Fault Detection Mechanisms

下面这张图来自论文,反映了当今关于错误检测机制(Error Detection Mechanisms)的研究方向和分类:



ERROR:

error分为hard error(硬错误)和soft error(软错误)。hard error一般为制造和设计缺陷,而关于soft error,有两个来源:

1)高能粒子造成单粒子翻转(single event upset SEU),以及晶体管工作电压的减小降低了集成电路噪声容限从而使芯片更易受瞬态故障影响,我们称之为transient error(瞬态错误)

2)制造和运行过程中的variations带来的temporal timing violations,我们称之为intermittent error(间歇性错误)

其实,间歇性错误表现为瞬态错误的发生频率超过系统可靠性允许阈值范围。


内存中的检测机制

由于我们的程序和OS常驻内存(包括L1 cache和L2 cache),因此内存出错是让程序和系统不可靠的重要原因,比如指令序列被破坏。常用的解决内存错误的机制是使用ECC(error checking and correcting,错误检查和纠正)码。


电路级(circuit-level)是如何应对上面两种错误(transient error 和 intermittent error)

在高辐射环境(high-radiation environments),有一种fault-tolerant电路叫hardened circuit(抗辐射集成电路),以及通过监测合闸电流和供电电压来判断是否有意外事件的电路监控技术(circuit monitoring),都是来检测和应对transient error的技术。

由variation引发的间歇性timing errors属于intermittent error,Tunable Replica Circuits、Razor Flip-Flops、Transition Detectors和Temporal Redundancy是解决它的方法。

另外,三模冗余TMR(Triple ModularRedundancy)也是从电路架构的角度发现错误的一种方法,输入信号由完全相同的三个模块分别独立处理,每个模块产生一个运行结果交给决策器,由决策器判断并输出结果,但缺点是只能发现单个模块错误且没有重构策略来修复错误模块。

以三模冗余为代表的技术属于fault tolerance常用的技术——冗余技术,是能检测并纠正错误和故障的有效方法。其实上面应用于内存中的ECC,以及常用的奇偶校验码Parity,属于冗余技术中的信息冗余,相比于硬件冗余的大开销,它仅需要少量额外的存储字节和计算开销或少量的额外编码电路即可。



翻译成中文,只做翻译: 9. Conflict Resolution A conflict occurs when a Multicast DNS responder has a unique record for which it is currently authoritative, and it receives a Multicast DNS response message containing a record with the same name, rrtype and rrclass, but inconsistent rdata. What may be considered inconsistent is context sensitive, except that resource records with identical rdata are never considered inconsistent, even if they originate from different hosts. This is to permit use of proxies and other fault-tolerance mechanisms that may cause more than one responder to be capable of issuing identical answers on the network. A common example of a resource record type that is intended to be unique, not shared between hosts, is the address record that maps a host's name to its IP address. Should a host witness another host announce an address record with the same name but a different IP address, then that is considered inconsistent, and that address record is considered to be in conflict. Whenever a Multicast DNS responder receives any Multicast DNS response (solicited or otherwise) containing a conflicting resource record in any of the Resource Record Sections, the Multicast DNS responder MUST immediately reset its conflicted unique record to probing state, and go through the startup steps described above in Section 8, "Probing and Announcing on Startup". The protocol used in the Probing phase will determine a winner and a loser, and the loser MUST cease using the name, and reconfigure. It is very important that any host receiving a resource record that conflicts with one of its own MUST take action as described above. In the case of two hosts using the same host name, where one has been configured to require a unique host name and the other has not, the one that has not been configured to require a unique host name will not perceive any conflict, and will not take any action. By reverting to Probing state, the host that desires a unique host name will go through the necessary steps to ensure that a unique host name is obtained. The recommended course of action after probing and failing is as follows: 1. Programmatically change the resource record name in an attempt to find a new name that is unique. This could be done by adding some further identifying information (e.g., the model name of the hardware) if it is not already present in the name, or appending the digit "2" to the name, or incrementing a number at the end of the name if one is already present. 2. Probe again, and repeat as necessary until a unique name is found. 3. Once an available unique name has been determined, by probing without receiving any conflicting response, record this newly chosen name in persistent storage so that the device will use the same name the next time it is power-cycled. 4. Display a message to the user or operator informing them of the name change. For example: The name "Bob's Music" is in use by another music server on the network. Your music collection has been renamed to "Bob's Music (2)". If you want to change this name, use [describe appropriate menu item or preference dialog here]. The details of how the user or operator is informed of the new name depends on context. A desktop computer with a screen might put up a dialog box. A headless server in the closet may write a message to a log file, or use whatever mechanism (email, SNMP trap, etc.) it uses to inform the administrator of error conditions. On the other hand, a headless server in the closet may not inform the user at all -- if the user cares, they will notice the name has changed, and connect to the server in the usual way (e.g., via web browser) to configure a new name. 5. After one minute of probing, if the Multicast DNS responder has been unable to find any unused name, it should log an error message to inform the user or operator of this fact. This situation should never occur in normal operation. The only situations that would cause this to happen would be either a deliberate denial-of-service attack, or some kind of very obscure hardware or software bug that acts like a deliberate denial-of-service attack. These considerations apply to address records (i.e., host names) and to all resource records where uniqueness (or maintenance of some other defined constraint) is desired.
最新发布
10-18
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值