关于DataOutput的writeUTF()长度太长异常encode string too long异常解决

最新推荐文章于 2022-01-14 16:25:58 发布

一颗叁普参

最新推荐文章于 2022-01-14 16:25:58 发布

阅读量3.6k

点赞数 5

CC 4.0 BY-SA版权

分类专栏： bug 文章标签： DataOutput EOFException UTFDataFormatException MapReduce IO流

本文链接：https://blog.youkuaiyun.com/lxy_mycnds/article/details/95967956

bug 专栏收录该内容

1 篇文章

订阅专栏

博客主要围绕DataOutput的writeUTF()方法展开，分析了UTFDataFormatException和EOFException两种异常。指出UTFDataFormatException是因输出字符串超65535长度，EOFException则是数据超65535后读取不完整致数据问题及后续json解析报错，还提及了相应解决方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

关于DataOutput的writeUTF()长度太长异常encode string too long异常解决

异常内容包括

Error: java.io.UTFDataFormatException : encode string too long 85643 bytes
Error: java.io.EOFException

UTFDataFormatException 异常原因分析

官方接口注释：

/**
     * Writes two bytes of length information
     * to the output stream, followed
     * by the
     * <a href="DataInput.html#modified-utf-8">modified UTF-8</a>
     * representation
     * of  every character in the string <code>s</code>.
     * If <code>s</code> is <code>null</code>,
     * a <code>NullPointerException</code> is thrown.
     * Each character in the string <code>s</code>
     * is converted to a group of one, two, or
     * three bytes, depending on the value of the
     * character.<p>
     * If a character <code>c</code>
     * is in the range <code>&#92;u0001</code> through
     * <code>&#92;u007f</code>, it is represented
     * by one byte:
     * <pre>(byte)c </pre>  <p>
     * If a character <code>c</code> is <code>&#92;u0000</code>
     * or is in the range <code>&#92;u0080</code>
     * through <code>&#92;u07ff</code>, then it is
     * represented by two bytes, to be written
     * in the order shown: <pre>{@code
     * (byte)(0xc0 | (0x1f & (c >> 6)))
     * (byte)(0x80 | (0x3f & c))
     * }</pre> <p> If a character
     * <code>c</code> is in the range <code>&#92;u0800</code>
     * through <code>uffff</code>, then it is
     * represented by three bytes, to be written
     * in the order shown: <pre>{@code
     * (byte)(0xe0 | (0x0f & (c >> 12)))
     * (byte)(0x80 | (0x3f & (c >>  6)))
     * (byte)(0x80 | (0x3f & c))
     * }</pre>  <p> First,
     * the total number of bytes needed to represent
     * all the characters of <code>s</code> is
     * calculated. If this number is larger than

	 *这个地方，有说明了最大长度限制为65535，超过之后抛出UTFDataFormatException
     * <code>65535</code>, then a <code>UTFDataFormatException</code>


     * is thrown. Otherwise, this length is written
     * to the output stream in exactly the manner
     * of the <code>writeShort</code> method;
     * after this, the one-, two-, or three-byte
     * representation of each character in the
     * string <code>s</code> is written.<p>  The
     * bytes written by this method may be read
     * by the <code>readUTF</code> method of interface
     * <code>DataInput</code> , which will then
     * return a <code>String</code> equal to <code>s</code>.
     *
     * @param      s   the string value to be written.
     * @throws     IOException  if an I/O error occurs.
     */
    void writeUTF(String s) throws IOException;

原因：输出字符串超过限定的65535长度。

解决方式：

//定义常量
public static final int WRITE_READ_UTF_MAX_LENGTH = 65535;

@Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(this.userId);
        dataOutput.writeUTF(this.statisDate);
        //如果超过限定长度，将进行截取多次写出
        if(this.dayResult.length() > WRITE_READ_UTF_MAX_LENGTH) {
            for (int i = 1; i < this.dayResult.length()/WRITE_READ_UTF_MAX_LENGTH+2; i++) {
                dataOutput.writeUTF(this.dayResult.substring(WRITE_READ_UTF_MAX_LENGTH*(i-1),WRITE_READ_UTF_MAX_LENGTH*i<this.dayResult.length()?WRITE_READ_UTF_MAX_LENGTH*i:this.dayResult.length())); 
            }
        }else {
        	//长度在0-65535默认写出
            dataOutput.writeUTF(this.dayResult);
        }
    }

EOFException异常原因分析

接口文档：

 /**
     * Reads in a string that has been encoded using a
     * <a href="#modified-utf-8">modified UTF-8</a>
     * format.
     * The general contract of {@code readUTF}
     * is that it reads a representation of a Unicode
     * character string encoded in modified
     * UTF-8 format; this string of characters
     * is then returned as a {@code String}.
     * <p>
     * First, two bytes are read and used to
     * construct an unsigned 16-bit integer in
     * exactly the manner of the {@code readUnsignedShort}
     * method . This integer value is called the
     * <i>UTF length</i> and specifies the number
     * of additional bytes to be read. These bytes
     * are then converted to characters by considering
     * them in groups. The length of each group
     * is computed from the value of the first
     * byte of the group. The byte following a
     * group, if any, is the first byte of the
     * next group.
     * <p>
     * If the first byte of a group
     * matches the bit pattern {@code 0xxxxxxx}
     * (where {@code x} means "may be {@code 0}
     * or {@code 1}"), then the group consists
     * of just that byte. The byte is zero-extended
     * to form a character.
     * <p>
     * If the first byte
     * of a group matches the bit pattern {@code 110xxxxx},
     * then the group consists of that byte {@code a}
     * and a second byte {@code b}. If there
     * is no byte {@code b} (because byte
     * {@code a} was the last of the bytes
     * to be read), or if byte {@code b} does
     * not match the bit pattern {@code 10xxxxxx},
     * then a {@code UTFDataFormatException}
     * is thrown. Otherwise, the group is converted
     * to the character:
     * <pre>{@code (char)(((a & 0x1F) << 6) | (b & 0x3F))
     * }</pre>
     * If the first byte of a group
     * matches the bit pattern {@code 1110xxxx},
     * then the group consists of that byte {@code a}
     * and two more bytes {@code b} and {@code c}.
     * If there is no byte {@code c} (because
     * byte {@code a} was one of the last
     * two of the bytes to be read), or either
     * byte {@code b} or byte {@code c}
     * does not match the bit pattern {@code 10xxxxxx},
     * then a {@code UTFDataFormatException}
     * is thrown. Otherwise, the group is converted
     * to the character:
     * <pre>{@code
     * (char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F))
     * }</pre>
     * If the first byte of a group matches the
     * pattern {@code 1111xxxx} or the pattern
     * {@code 10xxxxxx}, then a {@code UTFDataFormatException}
     * is thrown.
     * <p>
     * If end of file is encountered
     * at any time during this entire process,
     * then an {@code EOFException} is thrown.
     * <p>
     * After every group has been converted to
     * a character by this process, the characters
     * are gathered, in the same order in which
     * their corresponding groups were read from
     * the input stream, to form a {@code String},
     * which is returned.
     * <p>
     * The {@code writeUTF}
     * method of interface {@code DataOutput}
     * may be used to write data that is suitable
     * for reading by this method.
     * @return     a Unicode string.
     * @exception  EOFException            if this stream reaches the end
     *               before reading all the bytes.
     * @exception  IOException             if an I/O error occurs.
     * @exception  UTFDataFormatException  if the bytes do not represent a
     *               valid modified UTF-8 encoding of a string.
     */
    String readUTF() throws IOException;

异常分析：
数据长度超过65535之后，读取一次只有65535，会导致第一，数据有问题，第二，如果是json数据，之后的json2obj会因为数据丢失报错。

//此处报错
Map<String, String> resultMap = JsonUtils.json2Obj(resultJson, Map.class);

解决方式

//在MapReduce中，反序列化顺序要与序列化顺序对应，此处放在最后读取超长数据，避免影响其他字段
@Override
    public void readFields(DataInput dataInput) throws IOException {
        this.userId = dataInput.readUTF();
        this.statisDate = dataInput.readUTF();
        String tempStr = dataInput.readUTF();
        StringBuilder sBulider = new StringBuilder();
        //如果长度大于等于65535，继续读取
        while(tempStr.length() >= WRITE_READ_UTF_MAX_LENGTH) {
            sBulider.append(tempStr);
            tempStr= dataInput.readUTF();
        }
        sBulider.append(tempStr);
        this.dayResult = sBulider.toString();
    }