mozilla有一个模块,叫universalchatdet,是用来判断是什么编码的 他的主要流程是这样的:
HandleData(batch_of_text)用了3中方法: 1) Coding scheme method, 2) Character Distribution, 3) 2-Char Sequence Distribution
{
if (batch_of_text contains BOM)
report UCS2;
if ((inputState is PureAscii) || (inputState is EscAscii))
if (batch_of_text contains 8-bits-byte)
inputState = HighByte;
else if ((inputState is PureAscii ) && (batch_of_text contains Esc_Sequence) )
inputState = EscAscii;
if (inputState is HighByte)
{
Remove Ascii character that is not neighboring to 8-bits byte
For each prober in multibyte_probers
Prober.HandleData(batch_of_text);
For each prober in singlebyte_probers
Prober.HandleData(batch_of_text);
}
else if (inputState is EscAscii)
{
For each prober in (ISO2022_XX or HZ)
Prober.HandleData(batch_of_text);
}
}
本文介绍了mozilla的universalchatdet模块,该模块用于判断文本的编码类型。通过处理文本批次,结合编码方案方法、字符分布及双字符序列分布来确定编码。具体流程包括检查字节标记、ASCII状态以及多字节探测等。
966

被折叠的 条评论
为什么被折叠?



