XilSEM可插错错误类型及log示例

关注、星标公众号,精彩内容每日送达
来源:网络素材

作者:Ivy Guo,AMD工程师;来源:AMD开发者社区

Versal的SEM解决方案XilSEM和之前的SEM IP有很大不同。SEM IP构建于FPGA的PL,基于底层硬核ICAP和FRAME_ECC进行扫描和纠正工作。而XilSEM是运行在PMC上的一个固件库,其插错,纠错等机制是预先编译的C代码。但两者检测到的bit错误类型是基本一致的。一般有CRAM存储单元的Correctable error, Uncorrectable error, CRC error等。另外,Versal系列器件中包含NPI寄存器,XilSEM的扫描范围同样覆盖到了NPI。

本文对几种错误类型做一下简单介绍,并基于VCK190开发板,熟悉一下发现错误后对应的XilSEM报告。

Correctable Error:

UltraSacle/+系列通过ECC策略和CRAM帧物理位置的设计,最多可以纠正一帧内16 bit错误。Versal不同于UltraScale+的结构, 一般单帧数据内按照只能纠正1-bit错误计算。所以在Versal里面测试可纠错误,我们选择单bit插错。可以在任意一个Row的一帧里面选择位置进行插错。一般有效的位置是Row:0~3,Qword:0~24, Bit:0~127。由于Qword12的bit:0~23,以及bit:48~71是syndrome数据,这些数据用于ECC校验,在这类位置插错不会影响用户逻辑,又可以验证插错效果,我们一般选择在此位置插错测试。

Source文件位于:

https://github.com/Xilinx/embeddedsw/blob/xlnx_rel_v2025.1/lib/sw_services/xilsem/examples/xsem_cram_example.c

用于插错的函数为:

XStatus Xsem_CfrApiInjctCorErr()

Example选择的插错位置为Frame 0,Qword 3,Bit 4,Row 0.

XSemCfrErrInjData ErrData = {0, 3, 4, 0};

使能XILSEM_CORR_ERRINJ_ENABLE宏。编译文件并下载到开发板上,得到的log如下:

CRAM scan is configured for immediate start

[XSem_CmdCfrReadFrameEcc] Success Reading Cfr Frame ECC Values of Row = 0x00  Frame = 0x0000000F  Over IPI

Received Segment 0 ECC Value = 0x00222A0A

Received Segment 1 ECC Value = 0x008A0200

[XSem_CfrApiStopScan] Success: Stop

Type: 0 Total Frames : 34111

Type: 1 Total Frames : 3520

Type: 2 Total Frames : 12800

Type: 3 Total Frames : 11

Type: 4 Total Frames : 5

Type: 5 Total Frames : 1

Type: 6 Total Frames : 0

Golden CRC for ROW_0 : 1BEB3B4E

[XSem_CfrApiStartScan] Success: Start

[XSem_CmdGetConfig] Success Reading CRAM and NPI configuration Over IPI

Received CRAM Configuration = 0x00000026

Received NPI Configuration = 0x00005012

[XSem_CfrApiStopScan] Success: Stop

starting Correctable Error injection

[XSem_CfrApiErrNjct] Success: Inject

[XSem_CfrApiStartScan] Success: Start

Received CRAM Correcatble error event

-----------------------------------------------------

-----------------Print Report------------------------

-----------------------------------------------------

[SUCCESS] Correctable error detected

[SUCCESS] Correction Done

Total Corrected Error count = 1

[SUCCESS] Error Count increased by 1 as expected

[SUCCESS] Received Correctable error event notification

Error is located in Row = 0, FAR = 0x000000, Qword = 3, Bit = 4

CRAM Scan Status:2014805

Uncorrectable Error:

当一帧内出错的bit超过1个时,一般就超出了XilSEM的纠错能力。此时XilSEM可以检测到错误的产生,但是无法纠正。为了模拟这一场景,我们可以在同一帧内插入2个错误:

   ErrData[0]: Frame Address : 0, Quadword: 0, Bit position: 2, Row: 0

   ErrData[1]: Frame Address : 0, Quadword: 0, Bit position: 4, Row: 0

错误地址为Frame 0,Qword 0,两个交错的bit:2和4,Row 0.

    XSemCfrErrInjData ErrData[2] = {{0, 0, 2, 0},

                                                          {0, 0, 4, 0},

                                                         };

此时得到如下log:

[XSem_CfrApiStartScan] Success: Start

[XSem_CmdGetConfig] Success Reading CRAM and NPI configuration Over IPI

Received CRAM Configuration = 0x00000026

Received NPI Configuration = 0x00005012

[XSem_CfrApiStopScan] Success: Stop

starting UnCorrectable Error injection

[XSem_CfrApiErrNjct] Success: Inject

[XSem_CfrApiErrNjct] Success: Inject

Received CRAM Uncorrectable error event

[XSem_CfrApiStartScan] Success: Start

-----------------------------------------------------

-----------------Print Report------------------------

-----------------------------------------------------

[SUCCESS] UnCorrectable error detected

[SUCCESS] No increase in Error Count

[SUCCESS] Received Uncorrectable error event notification

CRAM Scan Status:2010211

CRC Error

如果想要模拟CRC错误,可以选择在同一个Qword内的交替位置,插入3个及以上的错误。此时会触发CRC校验错误。

位置:

     * ErrData[0]: Frame Address : 0, Quadword: 0, Bit position: 0, Row: 0

     * ErrData[1]: Frame Address : 0, Quadword: 0, Bit position: 2, Row: 0

     * ErrData[2]: Frame Address : 0, Quadword: 0, Bit position: 4, Row: 0

     * ErrData[3]: Frame Address : 0, Quadword: 0, Bit position: 6, Row: 0

CRC错误由于无法纠正,XilSEM在做了必要报告之后,会和Uncorrectable Err的处理一样,进入IDLE状态。

对应函数;

XSemCfrErrInjData ErrData[4] = {{0, 0, 0, 0},

                                                      {0, 0, 2, 0},

                                                      {0, 0, 4, 0},

                                                      {0, 0, 6, 0},

                                                    };

VCK190上执行的log如下:

CRAM scan is configured for immediate start

[XSem_CmdCfrReadFrameEcc] Success Reading Cfr Frame ECC Values of Row = 0x00  Frame = 0x0000000F  Over IPI

Received Segment 0 ECC Value = 0x00222A0A

Received Segment 1 ECC Value = 0x008A0200

[XSem_CfrApiStopScan] Success: Stop

Type: 0 Total Frames : 34111

Type: 1 Total Frames : 3520

Type: 2 Total Frames : 12800

Type: 3 Total Frames : 11

Type: 4 Total Frames : 5

Type: 5 Total Frames : 1

Type: 6 Total Frames : 0

Golden CRC for ROW_0 : 1BEB3B4E

[XSem_CfrApiStartScan] Success: Start

[XSem_CmdGetConfig] Success Reading CRAM and NPI configuration Over IPI

Received CRAM Configuration = 0x00000026

Received NPI Configuration = 0x00005012

[XSem_CfrApiStopScan] Success: Stop

starting UnCorrectable Error injection

[XSem_CfrApiErrNjct] Success: Inject

[XSem_CfrApiErrNjct] Success: Inject

[XSem_CfrApiErrNjct] Success: Inject

[XSem_CfrApiErrNjct] Success: Inject

[XSem_CfrApiStartScan] Success: Start

Received CRAM CRC error event

-----------------------------------------------------

-----------------Print Report------------------------

-----------------------------------------------------

[SUCCESS] CRC error detected

[SUCCESS] No increase in Error Count

[SUCCESS] Received CRC error event notification

CRAM Scan Status:2010411

可以看到,inject连续执行了四次,触发了CRC不可纠错误。

NPI Error:

在XilSEM库对NPI寄存器进行扫描时,采用了一种基于SHA(安全哈希算法,Secure Hash Algorithm)的检测策略。具体来说,XilSEM库会定期回读NPI寄存器的内容,并通过SHA算法计算其哈希值。通过将当前哈希值与预期或之前记录的哈希值进行比对,可以检测出寄存器内容是否发生了异常变化(即SEU错误)。因此要执行NPI扫描功能,必须要保证器件PMC的加密加速功能正常运行:

https://adaptivesupport.amd.com/s/article/000033536

Source文件位于:

https://github.com/Xilinx/embeddedsw/blob/xlnx_rel_v2025.1/lib/sw_services/xilsem/examples/xsem_npi_example.c

如Source文件中提到,Demo是往第一个描述符Descriptor的Golden SHA 中注入错误。在测试中,可以通过再次执行错误注入序列,纠正所注入的 SHA 错误。实际发生的 SHA 错误是不可纠正的。XilSEM在检测到错误后会做汇报,同时停止NPI扫描。CRAM扫描并不受此影响。

VCK190上执行的log如下:

-----------------------------------------

NPI Scan Fail Count for register 0: 0

NPI Scan Fail Count for register 1: 0

NPI Scan Fail Count for register 2: 0

NPI Scan Fail Count for register 3: 0

NPI Scan Fail Count for register 4: 0

NPI Scan Fail Count for register 5: 0

NPI Scan Fail Count for register 6: 0

NPI Scan Fail Count for register 7: 0

SHA mismatch Err details recorded as

ErrInfo[0]: 0

ErrInfo[1]: 0

NPI Scan Count = 5E85

HBCount = 266608

[main] Success: Get Golden SHA

[main] Success: Stop

[main] Success: Inject

[main] Success: Start

[main] Success: Get Golden SHA

-----------------------------------------------------

-----------------Print Report------------------------

-----------------------------------------------------

[SUCCESS] SHA comparison error detected

[SUCCESS] Scan counter not incrementing

[SUCCESS] Heartbeat counter not incrementing

[SUCCESS] Received CRC error event notification

----------------------------------------------------

Total Descriptor Count = 1

----------------------------------------------------

Descriptor information before injecting error:

Descriptor 1

  Type: Static

  Golden SHA: 0x16533699

----------------------------------------------------

Descriptor information after injecting error:

Descriptor 1

  Type: Static

  Golden SHA: 0x16533698

----------------------------------------------------

-------------- Test Report --------------

Failed Command Count : 0

NPI examples ran successfully

如果你的板子未开启加密加速模块,得到的log如下:

[ALERT] Received Cryptographic Accelerator Disabled event notification from XilSEM

[main] Success: Start

[main] ERROR: NPI Scan count not incrementing.

[main] Success: Get Golden SHA

[main] Success: Stop

[main] Success: Inject

[ALERT] Received Cryptographic Accelerator Disabled event notification from XilSEM

[main] Success: Start

[main] ERROR: Timeout occurred waiting for error.

-----------------------------------------------------

-----------------Print Report------------------------

-----------------------------------------------------

[SUCCESS] Scan counter not incrementing

[SUCCESS] Heartbeat counter not incrementing

[FAILURE] No CRC error event notification received

----------------------------------------------------

Total Descriptor Count = 1

----------------------------------------------------

Descriptor information before injecting error:

Descriptor 1

  Type: Static

  Golden SHA: 0x00000000

----------------------------------------------------

Descriptor information after injecting error:

Descriptor 1

  Type: Static

  Golden SHA: 0x00000000

----------------------------------------------------

-------------- Test Report --------------

Failed Command Count : 4

NPI examples Failed

以上是可以在实验室环境下进行测试的四种XilSEM错误类型。通过这些插错测试,可以熟悉一下XilSEM的报告和运行方式。其他如果是Fatal Error或者Internal Error,原因多样就无法模拟了。可以参考UG643:https://docs.amd.com/r/en-US/oslib_rm, Listening for SEU Detection做好系统中的应对措施。

(全文完)

声明:我们尊重原创,也注重分享;文字、图片版权归原作者所有。转载目的在于分享更多信息,不代表本号立场,如有侵犯您的权益请及时联系,我们将第一时间删除,谢谢!

图片

想要了解FPGA吗?这里有实例分享,ZYNQ设计,关注我们的公众号,探索

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值