关注、星标公众号,精彩内容每日送达
来源:网络素材作者:Ivy Guo,AMD工程师;来源:AMD开发者社区
Versal的SEM解决方案XilSEM和之前的SEM IP有很大不同。SEM IP构建于FPGA的PL,基于底层硬核ICAP和FRAME_ECC进行扫描和纠正工作。而XilSEM是运行在PMC上的一个固件库,其插错,纠错等机制是预先编译的C代码。但两者检测到的bit错误类型是基本一致的。一般有CRAM存储单元的Correctable error, Uncorrectable error, CRC error等。另外,Versal系列器件中包含NPI寄存器,XilSEM的扫描范围同样覆盖到了NPI。
本文对几种错误类型做一下简单介绍,并基于VCK190开发板,熟悉一下发现错误后对应的XilSEM报告。
Correctable Error:
UltraSacle/+系列通过ECC策略和CRAM帧物理位置的设计,最多可以纠正一帧内16 bit错误。Versal不同于UltraScale+的结构, 一般单帧数据内按照只能纠正1-bit错误计算。所以在Versal里面测试可纠错误,我们选择单bit插错。可以在任意一个Row的一帧里面选择位置进行插错。一般有效的位置是Row:0~3,Qword:0~24, Bit:0~127。由于Qword12的bit:0~23,以及bit:48~71是syndrome数据,这些数据用于ECC校验,在这类位置插错不会影响用户逻辑,又可以验证插错效果,我们一般选择在此位置插错测试。
Source文件位于:
https://github.com/Xilinx/embeddedsw/blob/xlnx_rel_v2025.1/lib/sw_services/xilsem/examples/xsem_cram_example.c
用于插错的函数为:
XStatus Xsem_CfrApiInjctCorErr()
Example选择的插错位置为Frame 0,Qword 3,Bit 4,Row 0.
XSemCfrErrInjData ErrData = {0, 3, 4, 0};
使能XILSEM_CORR_ERRINJ_ENABLE宏。编译文件并下载到开发板上,得到的log如下:
CRAM scan is configured for immediate start
[XSem_CmdCfrReadFrameEcc] Success Reading Cfr Frame ECC Values of Row = 0x00 Frame = 0x0000000F Over IPI
Received Segment 0 ECC Value = 0x00222A0A
Received Segment 1 ECC Value = 0x008A0200
[XSem_CfrApiStopScan] Success: Stop
Type: 0 Total Frames : 34111
Type: 1 Total Frames : 3520
Type: 2 Total Frames : 12800
Type: 3 Total Frames : 11
Type: 4 Total Frames : 5
Type: 5 Total Frames : 1
Type: 6 Total Frames : 0
Golden CRC for ROW_0 : 1BEB3B4E
[XSem_CfrApiStartScan] Success: Start
[XSem_CmdGetConfig] Success Reading CRAM and NPI configuration Over IPI
Received CRAM Configuration = 0x00000026
Received NPI Configuration = 0x00005012
[XSem_CfrApiStopScan] Success: Stop
starting Correctable Error injection
[XSem_CfrApiErrNjct] Success: Inject
[XSem_CfrApiStartScan] Success: Start
Received CRAM Correcatble error event
-----------------------------------------------------
-----------------Print Report------------------------
-----------------------------------------------------
[SUCCESS] Correctable error detected
[SUCCESS] Correction Done
Total Corrected Error count = 1
[SUCCESS] Error Count increased by 1 as expected
[SUCCESS] Received Correctable error event notification
Error is located in Row = 0, FAR = 0x000000, Qword = 3, Bit = 4
CRAM Scan Status:2014805
Uncorrectable Error:
当一帧内出错的bit超过1个时,一般就超出了XilSEM的纠错能力。此时XilSEM可以检测到错误的产生,但是无法纠正。为了模拟这一场景,我们可以在同一帧内插入2个错误:
ErrData[0]: Frame Address : 0, Quadword: 0, Bit position: 2, Row: 0
ErrData[1]: Frame Address : 0, Quadword: 0, Bit position: 4, Row: 0
错误地址为Frame 0,Qword 0,两个交错的bit:2和4,Row 0.
XSemCfrErrInjData ErrData[2] = {{0, 0, 2, 0},
{0, 0, 4, 0},
};
此时得到如下log:
[XSem_CfrApiStartScan] Success: Start
[XSem_CmdGetConfig] Success Reading CRAM and NPI configuration Over IPI
Received CRAM Configuration = 0x00000026
Received NPI Configuration = 0x00005012
[XSem_CfrApiStopScan] Success: Stop
starting UnCorrectable Error injection
[XSem_CfrApiErrNjct] Success: Inject
[XSem_CfrApiErrNjct] Success: Inject
Received CRAM Uncorrectable error event
[XSem_CfrApiStartScan] Success: Start
-----------------------------------------------------
-----------------Print Report------------------------
-----------------------------------------------------
[SUCCESS] UnCorrectable error detected
[SUCCESS] No increase in Error Count
[SUCCESS] Received Uncorrectable error event notification
CRAM Scan Status:2010211
CRC Error:
如果想要模拟CRC错误,可以选择在同一个Qword内的交替位置,插入3个及以上的错误。此时会触发CRC校验错误。
位置:
* ErrData[0]: Frame Address : 0, Quadword: 0, Bit position: 0, Row: 0
* ErrData[1]: Frame Address : 0, Quadword: 0, Bit position: 2, Row: 0
* ErrData[2]: Frame Address : 0, Quadword: 0, Bit position: 4, Row: 0
* ErrData[3]: Frame Address : 0, Quadword: 0, Bit position: 6, Row: 0
CRC错误由于无法纠正,XilSEM在做了必要报告之后,会和Uncorrectable Err的处理一样,进入IDLE状态。
对应函数;
XSemCfrErrInjData ErrData[4] = {{0, 0, 0, 0},
{0, 0, 2, 0},
{0, 0, 4, 0},
{0, 0, 6, 0},
};
VCK190上执行的log如下:
CRAM scan is configured for immediate start
[XSem_CmdCfrReadFrameEcc] Success Reading Cfr Frame ECC Values of Row = 0x00 Frame = 0x0000000F Over IPI
Received Segment 0 ECC Value = 0x00222A0A
Received Segment 1 ECC Value = 0x008A0200
[XSem_CfrApiStopScan] Success: Stop
Type: 0 Total Frames : 34111
Type: 1 Total Frames : 3520
Type: 2 Total Frames : 12800
Type: 3 Total Frames : 11
Type: 4 Total Frames : 5
Type: 5 Total Frames : 1
Type: 6 Total Frames : 0
Golden CRC for ROW_0 : 1BEB3B4E
[XSem_CfrApiStartScan] Success: Start
[XSem_CmdGetConfig] Success Reading CRAM and NPI configuration Over IPI
Received CRAM Configuration = 0x00000026
Received NPI Configuration = 0x00005012
[XSem_CfrApiStopScan] Success: Stop
starting UnCorrectable Error injection
[XSem_CfrApiErrNjct] Success: Inject
[XSem_CfrApiErrNjct] Success: Inject
[XSem_CfrApiErrNjct] Success: Inject
[XSem_CfrApiErrNjct] Success: Inject
[XSem_CfrApiStartScan] Success: Start
Received CRAM CRC error event
-----------------------------------------------------
-----------------Print Report------------------------
-----------------------------------------------------
[SUCCESS] CRC error detected
[SUCCESS] No increase in Error Count
[SUCCESS] Received CRC error event notification
CRAM Scan Status:2010411
可以看到,inject连续执行了四次,触发了CRC不可纠错误。
NPI Error:
在XilSEM库对NPI寄存器进行扫描时,采用了一种基于SHA(安全哈希算法,Secure Hash Algorithm)的检测策略。具体来说,XilSEM库会定期回读NPI寄存器的内容,并通过SHA算法计算其哈希值。通过将当前哈希值与预期或之前记录的哈希值进行比对,可以检测出寄存器内容是否发生了异常变化(即SEU错误)。因此要执行NPI扫描功能,必须要保证器件PMC的加密加速功能正常运行:
https://adaptivesupport.amd.com/s/article/000033536
Source文件位于:
https://github.com/Xilinx/embeddedsw/blob/xlnx_rel_v2025.1/lib/sw_services/xilsem/examples/xsem_npi_example.c
如Source文件中提到,Demo是往第一个描述符Descriptor的Golden SHA 中注入错误。在测试中,可以通过再次执行错误注入序列,纠正所注入的 SHA 错误。实际发生的 SHA 错误是不可纠正的。XilSEM在检测到错误后会做汇报,同时停止NPI扫描。CRAM扫描并不受此影响。
VCK190上执行的log如下:
-----------------------------------------
NPI Scan Fail Count for register 0: 0
NPI Scan Fail Count for register 1: 0
NPI Scan Fail Count for register 2: 0
NPI Scan Fail Count for register 3: 0
NPI Scan Fail Count for register 4: 0
NPI Scan Fail Count for register 5: 0
NPI Scan Fail Count for register 6: 0
NPI Scan Fail Count for register 7: 0
SHA mismatch Err details recorded as
ErrInfo[0]: 0
ErrInfo[1]: 0
NPI Scan Count = 5E85
HBCount = 266608
[main] Success: Get Golden SHA
[main] Success: Stop
[main] Success: Inject
[main] Success: Start
[main] Success: Get Golden SHA
-----------------------------------------------------
-----------------Print Report------------------------
-----------------------------------------------------
[SUCCESS] SHA comparison error detected
[SUCCESS] Scan counter not incrementing
[SUCCESS] Heartbeat counter not incrementing
[SUCCESS] Received CRC error event notification
----------------------------------------------------
Total Descriptor Count = 1
----------------------------------------------------
Descriptor information before injecting error:
Descriptor 1
Type: Static
Golden SHA: 0x16533699
----------------------------------------------------
Descriptor information after injecting error:
Descriptor 1
Type: Static
Golden SHA: 0x16533698
----------------------------------------------------
-------------- Test Report --------------
Failed Command Count : 0
NPI examples ran successfully
如果你的板子未开启加密加速模块,得到的log如下:
[ALERT] Received Cryptographic Accelerator Disabled event notification from XilSEM
[main] Success: Start
[main] ERROR: NPI Scan count not incrementing.
[main] Success: Get Golden SHA
[main] Success: Stop
[main] Success: Inject
[ALERT] Received Cryptographic Accelerator Disabled event notification from XilSEM
[main] Success: Start
[main] ERROR: Timeout occurred waiting for error.
-----------------------------------------------------
-----------------Print Report------------------------
-----------------------------------------------------
[SUCCESS] Scan counter not incrementing
[SUCCESS] Heartbeat counter not incrementing
[FAILURE] No CRC error event notification received
----------------------------------------------------
Total Descriptor Count = 1
----------------------------------------------------
Descriptor information before injecting error:
Descriptor 1
Type: Static
Golden SHA: 0x00000000
----------------------------------------------------
Descriptor information after injecting error:
Descriptor 1
Type: Static
Golden SHA: 0x00000000
----------------------------------------------------
-------------- Test Report --------------
Failed Command Count : 4
NPI examples Failed
以上是可以在实验室环境下进行测试的四种XilSEM错误类型。通过这些插错测试,可以熟悉一下XilSEM的报告和运行方式。其他如果是Fatal Error或者Internal Error,原因多样就无法模拟了。可以参考UG643:https://docs.amd.com/r/en-US/oslib_rm, Listening for SEU Detection做好系统中的应对措施。
(全文完)
声明:我们尊重原创,也注重分享;文字、图片版权归原作者所有。转载目的在于分享更多信息,不代表本号立场,如有侵犯您的权益请及时联系,我们将第一时间删除,谢谢!

想要了解FPGA吗?这里有实例分享,ZYNQ设计,关注我们的公众号,探索

被折叠的 条评论
为什么被折叠?



