使用java NIO FileChannel读取文件并解决中文乱码问题

最新推荐文章于 2023-04-18 09:06:23 发布

原创最新推荐文章于 2023-04-18 09:06:23 发布 · 5.6k 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#java #nio #channel #文件 #中文

java nio 专栏收录该内容

1 篇文章

订阅专栏

本文介绍如何使用 Java NIO 的 FileChannel 读取中文文件，并解决可能出现的乱码问题。通过实例展示如何正确处理 utf8 编码的中文字符，避免因字节读取不足导致的字符不全。

FileChannel 是java.nio下的一个连接文件的通道。通过此通道能够方便的实现对文件的读写操作。FileChannel 操作是ByteBuffer，能够读取文件字节到ByteBuffer或将ByteBuffer中的字节写入文件。

由于读取的文件内容涉及到中文（假设文件编码是utf8），FileChannel读取的是byte，而一个中文字符可能占2个或3byte，这样很容易导致一次读取到的byte不足以全部转换为中文，如果按照常规的方式new String(byte[],"utf-8")，根本无法解决乱码。

本文通过一个完整的示例，展示了如果用FileChannel读取中文文件并解决乱码问题。

打开文件通道

File file = new File(getClass().getResource("").getPath(), "中文测试.txt");
RandomAccessFile raFile = new RandomAccessFile(file, "rw");
FileChannel fChannel = raFile.getChannel();

从文件通道读取字节到buffer

ByteBuffer bBuf = ByteBuffer.allocate(32);

int bytesRead = fChannel.read(bBuf);

转码。以utf编码转换ByteBuffer到CharBuffer

CharBuffer cBuf = CharBuffer.allocate(32);
decoder.decode(bBuf, cBuf, true);

判断decode操作后是否有未处理完的字节，如果有，则缓存等待下次处理

byte[] remainByte = null;
int leftNum = 0;

leftNum = bBuf.limit() - bBuf.position();
if (leftNum > 0) { // 记录未转换完的字节
		remainByte = new byte[leftNum];
		bBuf.get(remainByte, 0, leftNum);
}

将上步缓存的字节写回ByteBuffer，然后进行下一次读取
```
if (remainByte != null) {
	 bBuf.put(remainByte);
}
bytesRead = fChannel.read(bBuf);
```
转码采用了CharsetDecoder这个类。因为文件是utf8编码的，所以Charset用的是utf8。CharsetDecoder对一次读取的byte做转码操作时，仅仅转码尽可能多的字节，此次转码不了的字节需要缓存，等待下次读取再转换，上述步骤4、步骤5很关键。如果没有这两步，直接导致转码后的字符与原文件相比，不全，会少很多。

完整示例代码如下：

public class FileChannelTest {

	public static void main(String[] args) throws IOException {
		FileChannelTest fcChannelTest = new FileChannelTest();
		fcChannelTest.readFile();
	}

	public void readFile() throws IOException {
		// 文件编码是utf8,需要用utf8解码
		Charset charset = Charset.forName("utf-8");
		CharsetDecoder decoder = charset.newDecoder();

		File file = new File(getClass().getResource("").getPath(), "中文测试.txt");
		RandomAccessFile raFile = new RandomAccessFile(file, "rw");
		FileChannel fChannel = raFile.getChannel();

		ByteBuffer bBuf = ByteBuffer.allocate(32); // 缓存大小设置为32个字节。仅仅是测试用。
		CharBuffer cBuf = CharBuffer.allocate(32);

		int bytesRead = fChannel.read(bBuf); // 从文件通道读取字节到buffer.
		char[] tmp = null; // 临时存放转码后的字符
		byte[] remainByte = null;// 存放decode操作后未处理完的字节。decode仅仅转码尽可能多的字节，此次转码不了的字节需要缓存，下次再转
		int leftNum = 0; // 未转码的字节数
		while (bytesRead != -1) {

			bBuf.flip(); // 切换buffer从写模式到读模式
			decoder.decode(bBuf, cBuf, true); // 以utf8编码转换ByteBuffer到CharBuffer
			cBuf.flip(); // 切换buffer从写模式到读模式
			remainByte = null;
			leftNum = bBuf.limit() - bBuf.position();
			if (leftNum > 0) { // 记录未转换完的字节
				remainByte = new byte[leftNum];
				bBuf.get(remainByte, 0, leftNum);
			}

			// 输出已转换的字符
			tmp = new char[cBuf.length()];
			while (cBuf.hasRemaining()) {
				cBuf.get(tmp);
				System.out.print(new String(tmp));
			}

			bBuf.clear(); // 切换buffer从读模式到写模式
			cBuf.clear(); // 切换buffer从读模式到写模式
			if (remainByte != null) {
				bBuf.put(remainByte); // 将未转换完的字节写入bBuf，与下次读取的byte一起转换
			}
			bytesRead = fChannel.read(bBuf);
		}
		raFile.close();
	}
}