关于DataOutput的writeUTF()长度太长异常encode string too long异常解决
异常内容包括
- Error: java.io.UTFDataFormatException : encode string too long 85643 bytes
- Error: java.io.EOFException
UTFDataFormatException 异常原因分析
官方接口注释:
/**
* Writes two bytes of length information
* to the output stream, followed
* by the
* <a href="DataInput.html#modified-utf-8">modified UTF-8</a>
* representation
* of every character in the string <code>s</code>.
* If <code>s</code> is <code>null</code>,
* a <code>NullPointerException</code> is thrown.
* Each character in the string <code>s</code>
* is converted to a group of one, two, or
* three bytes, depending on the value of the
* character.<p>
* If a character <code>c</code>
* is in the range <code>\u0001</code> through
* <code>\u007f</code>, it is represented
* by one byte:
* <pre>(byte)c </pre> <p>
* If a character <code>c</code> is <code>\u0000</code>
* or is in the range <code>\u0080</code>
* through <code>\u07ff</code>, then it is
* represented by two bytes, to be written
* in the order shown: <pre>{@code
* (byte)(0xc0 | (0x1f & (c >> 6)))
* (byte)(0x80 | (0x3f & c))
* }</pre> <p> If a character
* <code>c</code> is in the range <code>\u0800</code>
* through <code>uffff</code>, then it is
* represented by three bytes, to be written
* in the order shown: <pre>{@code
* (byte)(0xe0 | (0x0f & (c >> 12)))
* (byte)(0x80 | (0x3f & (c >> 6)))
* (byte)(0x80 | (0x3f & c))
* }</pre> <p> First,
* the total number of bytes needed to represent
* all the characters of <code>s</code> is
* calculated. If this number is larger than
*这个地方,有说明了最大长度限制为65535,超过之后抛出UTFDataFormatException
* <code>65535</code>, then a <code>UTFDataFormatException</code>
* is thrown. Otherwise, this length is written
* to the output stream in exactly the manner
* of the <code>writeShort</code> method;
* after this, the one-, two-, or three-byte
* representation of each character in the
* string <code>s</code> is written.<p> The
* bytes written by this method may be read
* by the <code>readUTF</code> method of interface
* <code>DataInput</code> , which will then
* return a <code>String</code> equal to <code>s</code>.
*
* @param s the string value to be written.
* @throws IOException if an I/O error occurs.
*/
void writeUTF(String s) throws IOException;
原因:输出字符串超过限定的65535长度。
解决方式:
//定义常量
public static final int WRITE_READ_UTF_MAX_LENGTH = 65535;
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.userId);
dataOutput.writeUTF(this.statisDate);
//如果超过限定长度,将进行截取多次写出
if(this.dayResult.length() > WRITE_READ_UTF_MAX_LENGTH) {
for (int i = 1; i < this.dayResult.length()/WRITE_READ_UTF_MAX_LENGTH+2; i++) {
dataOutput.writeUTF(this.dayResult.substring(WRITE_READ_UTF_MAX_LENGTH*(i-1),WRITE_READ_UTF_MAX_LENGTH*i<this.dayResult.length()?WRITE_READ_UTF_MAX_LENGTH*i:this.dayResult.length()));
}
}else {
//长度在0-65535默认写出
dataOutput.writeUTF(this.dayResult);
}
}
EOFException异常原因分析
接口文档:
/**
* Reads in a string that has been encoded using a
* <a href="#modified-utf-8">modified UTF-8</a>
* format.
* The general contract of {@code readUTF}
* is that it reads a representation of a Unicode
* character string encoded in modified
* UTF-8 format; this string of characters
* is then returned as a {@code String}.
* <p>
* First, two bytes are read and used to
* construct an unsigned 16-bit integer in
* exactly the manner of the {@code readUnsignedShort}
* method . This integer value is called the
* <i>UTF length</i> and specifies the number
* of additional bytes to be read. These bytes
* are then converted to characters by considering
* them in groups. The length of each group
* is computed from the value of the first
* byte of the group. The byte following a
* group, if any, is the first byte of the
* next group.
* <p>
* If the first byte of a group
* matches the bit pattern {@code 0xxxxxxx}
* (where {@code x} means "may be {@code 0}
* or {@code 1}"), then the group consists
* of just that byte. The byte is zero-extended
* to form a character.
* <p>
* If the first byte
* of a group matches the bit pattern {@code 110xxxxx},
* then the group consists of that byte {@code a}
* and a second byte {@code b}. If there
* is no byte {@code b} (because byte
* {@code a} was the last of the bytes
* to be read), or if byte {@code b} does
* not match the bit pattern {@code 10xxxxxx},
* then a {@code UTFDataFormatException}
* is thrown. Otherwise, the group is converted
* to the character:
* <pre>{@code (char)(((a & 0x1F) << 6) | (b & 0x3F))
* }</pre>
* If the first byte of a group
* matches the bit pattern {@code 1110xxxx},
* then the group consists of that byte {@code a}
* and two more bytes {@code b} and {@code c}.
* If there is no byte {@code c} (because
* byte {@code a} was one of the last
* two of the bytes to be read), or either
* byte {@code b} or byte {@code c}
* does not match the bit pattern {@code 10xxxxxx},
* then a {@code UTFDataFormatException}
* is thrown. Otherwise, the group is converted
* to the character:
* <pre>{@code
* (char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F))
* }</pre>
* If the first byte of a group matches the
* pattern {@code 1111xxxx} or the pattern
* {@code 10xxxxxx}, then a {@code UTFDataFormatException}
* is thrown.
* <p>
* If end of file is encountered
* at any time during this entire process,
* then an {@code EOFException} is thrown.
* <p>
* After every group has been converted to
* a character by this process, the characters
* are gathered, in the same order in which
* their corresponding groups were read from
* the input stream, to form a {@code String},
* which is returned.
* <p>
* The {@code writeUTF}
* method of interface {@code DataOutput}
* may be used to write data that is suitable
* for reading by this method.
* @return a Unicode string.
* @exception EOFException if this stream reaches the end
* before reading all the bytes.
* @exception IOException if an I/O error occurs.
* @exception UTFDataFormatException if the bytes do not represent a
* valid modified UTF-8 encoding of a string.
*/
String readUTF() throws IOException;
异常分析:
数据长度超过65535之后,读取一次只有65535,会导致第一,数据有问题,第二,如果是json数据,之后的json2obj会因为数据丢失报错。
//此处报错
Map<String, String> resultMap = JsonUtils.json2Obj(resultJson, Map.class);
解决方式
//在MapReduce中,反序列化顺序要与序列化顺序对应,此处放在最后读取超长数据,避免影响其他字段
@Override
public void readFields(DataInput dataInput) throws IOException {
this.userId = dataInput.readUTF();
this.statisDate = dataInput.readUTF();
String tempStr = dataInput.readUTF();
StringBuilder sBulider = new StringBuilder();
//如果长度大于等于65535,继续读取
while(tempStr.length() >= WRITE_READ_UTF_MAX_LENGTH) {
sBulider.append(tempStr);
tempStr= dataInput.readUTF();
}
sBulider.append(tempStr);
this.dayResult = sBulider.toString();
}
写在最后
最近在研究一个防脱发的程序。。。