高效率读取大文件进行数据处理-通过扩展RandomAccessFile类使之具备Buffer改善I/O性能

最新推荐文章于 2025-01-06 15:39:11 发布

原创

最新推荐文章于 2025-01-06 15:39:11 发布 · 3.1k 阅读

4 ·

CC 4.0 BY-SA版权

主体：

目前最流行的J2SDK版本是1.3系列。使用该版本的开发人员需文件随机存取，就得使用RandomAccessFile类。其I/O性能较之其它常用开发语言的同类性能差距甚远，严重影响程序的运行效率。

开发人员迫切需要提高效率，下面分析RandomAccessFile等文件类的源代码，找出其中的症结所在，并加以改进优化，创建一个"性/价比"俱佳的随机文件访问类BufferedRandomAccessFile。

在改进之前先做一个基本测试：逐字节COPY一个12兆的文件（这里牵涉到读和写）。

读	写	耗用时间（秒）
RandomAccessFile	RandomAccessFile	95.848
BufferedInputStream + DataInputStream	BufferedOutputStream + DataOutputStream	2.935

我们可以看到两者差距约32倍，RandomAccessFile也太慢了。先看看两者关键部分的源代码，对比分析，找出原因。

1．1．[RandomAccessFile]

public class RandomAccessFile implements DataOutput, DataInput {
    public final byte readByte() throws IOException {
        int ch = this.read();
        if (ch < 0)
            throw new EOFException();
        return (byte)(ch);
    }
    public native int read() throws IOException; 
    public final void writeByte(int v) throws IOException {
        write(v);
    }
    public native void write(int b) throws IOException; 
}

可见，RandomAccessFile每读/写一个字节就需对磁盘进行一次I/O操作。

1．2．[BufferedInputStream]

public class BufferedInputStream extends FilterInputStream {
    private static int defaultBufferSize = 2048; 
    protected byte buf[]; // 建立读缓存区
    public BufferedInputStream(InputStream in, int size) {
        super(in);       
        if (size <= 0) {
            throw new IllegalArgumentException("Buffer size <= 0");
        }
        buf = new byte[size];
    }
    public synchronized int read() throws IOException {
        ensureOpen();
        if (pos >= count) {
            fill();
            if (pos >= count)
                return -1;
        }
        return buf[pos++] & 0xff; // 直接从BUF[]中读取
    }
    private void fill() throws IOException {
    if (markpos < 0)
        pos= 0;        /* no mark: throw away the buffer */
    else if (pos >= buf.length)  /* no room left in buffer */
        if (markpos > 0) {   /* can throw away early part of the buffer */
        int sz = pos - markpos;
        System.arraycopy(buf, markpos, buf, 0, sz);
        pos = sz;
        markpos = 0;
        } else if (buf.length >= marklimit) {
        markpos = -1;   /* buffer got too big, invalidate mark */
        pos = 0;    /* drop buffer contents */
        } else {        /* grow buffer */
        int nsz = pos * 2;
        if (nsz > marklimit)
            nsz = marklimit;
        byte nbuf[] = new byte[nsz];
        System.arraycopy(buf, 0, nbuf, 0, pos);
        buf = nbuf;
        }
    count = pos;
    int n = in.read(buf, pos, buf.length - pos);
    if (n > 0)
        count = n + pos;
    }
}

1．3．[BufferedOutputStream]

public class BufferedOutputStream extends FilterOutputStream {
   protected byte buf[]; // 建立写缓存区
   public BufferedOutputStream(OutputStream out, int size) {
        super(out);
        if (size <= 0) {
            throw new IllegalArgumentException("Buffer size <= 0");
        }
        buf = new byte[size];
    }
public synchronized void write(int b) throws IOException {
        if (count >= buf.length) {
            flushBuffer();
        }
        buf[count++] = (byte)b; // 直接从BUF[]中读取
   }
   private void flushBuffer() throws IOException {
        if (count > 0) {
            out.write(buf, 0, count);
            count = 0;
        }
   }
}

可见，Buffered I/O putStream每读/写一个字节，若要操作的数据在BUF中，就直接对内存的buf[]进行读/写操作；否则从磁盘相应位置填充buf[]，再直接对内存的buf[]进行读/写操作，绝大部分的读/写操作是对内存buf[]的操作。

1．3．小结

内存存取时间单位是纳秒级（10E-9），磁盘存取时间单位是毫秒级（10E-3），同样操作一次的开销，内存比磁盘快了百万倍。理论上可以预见，即使对内存操作上万次，花费的时间也远少对于磁盘一次I/O的开销。显然后者是通过增加位于内存的BUF存取，减少磁盘I/O的开销，提高存取效率的，当然这样也增加了BUF控制部分的开销。从实际应用来看，存取效率提高了32倍。