C语言大文件读写

最新推荐文章于 2025-07-21 17:34:39 发布

转载最新推荐文章于 2025-07-21 17:34:39 发布 · 5.2k 阅读

C语言同时被 2 个专栏收录

80 篇文章

订阅专栏

Unix/Linux

69 篇文章

订阅专栏

http://blog.chinaunix.net/u1/33412/showart_397173.html

内存映射文件与虚拟内存有些类似，通过内存映射文件可以保留一个地址空间的区域，同时将物理存储器提交给此区域，只是内存文件映射的物理存储器来自一个已经存在于磁盘上的文件，而非系统的页文件，而且在对该文件进行操作之前必须首先对文件进行映射，就如同将整个文件从磁盘加载到内存。由此可以看出，使用内存映射文件处理存储于磁盘上的文件时，将不必再对文件执行I/O操作，这意味着在对文件进行处理时将不必再为文件申请并分配缓存，所有的文件缓存操作均由系统直接管理，由于取消了将文件数据加载到内存、数据从内存到文件的回写以及释放内存块等步骤，使得内存映射文件在处理大数据量的文件时能起到相当重要的作用。另外，实际工程中的系统往往需要在多个进程之间共享数据，如果数据量小，处理方法是灵活多变的，如果共享数据容量巨大，那么就需要借助于内存映射文件来进行。实际上，内存映射文件正是解决本地多个进程间数据共享的最有效方法。

Linux下写入文件最快的方式:mmap

2009-12-29补充：
在此要为我自己的肤浅致歉，以下测试的结论不完全准确。
内存映射实际上是让操作系统预先占用了一段虚拟内存，而此时内核不见得立即为其分配物理内存。等到真正读取或者写入的时候，才会真正分配物理内存。
往映射内存区域写入数据后，实际上数据还保存在内核的物理内存中，并不见得立即写入磁盘。取消映射后，也不见得立即写入磁盘。
正确的做法是：取消映射前调用msync()函数，将内核中的脏块写入磁盘。
所以，mmap相对于其他文件写入方式来说，直接使用了内核的内存，往内核内存写入这个过程相对较快，但内核内存写入磁盘的速度都是一致的，差别不大。

-----------------------------------------------------------------------
前段时间专门测试了LINUX下对大文件写入的性能，测试后发现直接调用write是最快而，而以内存映射文件的方式却非常慢，性能差异14倍。一直百思不得其解，今天终于搞懂了：
实际上，LINUX上最快的写入文件的方法的确是内存映射文件，我测试的结果是内存映射文件写入比直接写入高出59%。
上次测试失败的原因是我一直用msync()函数来写入，这个函数相当的慢，最好不要使用。
在munmap()后，数据就会写回文件了。

下面是测试的代码：
//=============================================================
//写入200MB数据耗时0.688s
/*
测试大数据写入的性能 test_mmap.cpp 使用内存映射文件来写入文件
*/
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

static int operator-(struct timeval& lsh, struct timeval& rsh)
{
if (lsh.tv_sec==rsh.tv_sec)
{
return lsh.tv_usec - rsh.tv_usec;
}
else
{
return (lsh.tv_sec-rsh.tv_sec)*1000000 + (lsh.tv_usec - rsh.tv_usec);
}
}

void test()
{
struct timeval start;
struct timeval end;
const int DATA_LEN = 1024*1024*200;
char* pData = new char[DATA_LEN]; //200MB
memset(pData, 'a', DATA_LEN);
gettimeofday(&start, NULL);
int fd = open("mmap.dat", O_RDWR | O_CREAT );
if (fd<0)
{
printf("open error!\n");
return;
}
lseek(fd, DATA_LEN-1, SEEK_SET);
write(fd, "", 1);
void* p = mmap(NULL, DATA_LEN, PROT_WRITE, MAP_SHARED, fd, 0);
if (MAP_FAILED==p)
{
perror("mmap");
return;
}
close(fd);
fd = -1;
//madvise(p, DATA_LEN, MADV_SEQUENTIAL|MADV_WILLNEED); //在写入的时候，不用madvise反而更快
memcpy(p, pData, DATA_LEN);
//if (-1==msync(p, DATA_LEN, MS_SYNC|MS_INVALIDATE)) //千万别用这个函数啊，非常滴慢
//{
// perror("msync");
//}
if (-1==munmap(p, DATA_LEN))
{
perror("munmap");
}
p = NULL;
//
gettimeofday(&end, NULL);
delete[] pData;
pData = NULL;
//显示占用时间
struct tm stTime;
localtime_r(&start.tv_sec, &stTime);
char strTemp[40];
strftime(strTemp, sizeof(strTemp)-1, "%Y-%m-%d %H:%M:%S", &stTime);
printf("start=%s.%07d\n", strTemp, start.tv_usec);
//
localtime_r(&end.tv_sec, &stTime);
strftime(strTemp, sizeof(strTemp)-1, "%Y-%m-%d %H:%M:%S", &stTime);
printf("end =%s.%07d\n", strTemp, end.tv_usec);
printf("spend=%d (%9.3fs)\n", end-start, (double)(end-start)/1000000.0f);
}

int main()
{
test();
return 1;
}

/*
g++ -o test_mmap test_mmap.cpp
*/

http://hi.baidu.com/ah__fu/blog/item/476799d9313bfeee39012f84.html

Linux大文件写入系列测试(五):mmap利用内存映射文件写入

2009-04-29: 这篇文章的测试方法有问题，请参考最新的一篇：

以下的内容仍然保留，作为一个反例。
==============================================================

    这次尝试的是利用LINUX的内存映射文件来写入数据。
//写入200MB数据耗时4694309微秒
/*
测试大数据写入的性能 test_mmap.cpp 使用内存映射文件来写入文件
*/
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

static int operator-(struct timeval& lsh, struct timeval& rsh)
{
    if (lsh.tv_sec==rsh.tv_sec)
    {
        return lsh.tv_usec - rsh.tv_usec;
    }
    else
    {
        return (lsh.tv_sec-rsh.tv_sec)*1000000 + (lsh.tv_usec - rsh.tv_usec);
    }
}

void test()
{
    struct timeval start;
    struct timeval end;
    //struct timeval start1, end1;
    const int DATA_LEN = 1024*1024*200;
    char* pData = new char[DATA_LEN]; //200MB
    gettimeofday(&start, NULL);
    int fd = open("mmap.dat", O_RDWR | O_CREAT | O_TRUNC);
    if (fd<0)
    {
        printf("open error!\n");
        return;
    }
    lseek(fd, DATA_LEN, SEEK_SET);
    write(fd, "", 1);
    void* p = mmap(NULL, DATA_LEN, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (MAP_FAILED==p)
    {
        perror("mmap");
        return;
    }
    close(fd);
    fd = -1;
    //gettimeofday(&start1, NULL);
    memcpy(p, pData, DATA_LEN);
    //gettimeofday(&end1, NULL);
    if (-1==msync(p, DATA_LEN, MS_SYNC))
    {
        perror("msync");
    }
    if (-1==munmap(p, DATA_LEN))
    {
        perror("munmap");
    }
    p = NULL;
    //
    gettimeofday(&end, NULL);
    delete[] pData;
    pData = NULL;
    //显示占用时间
    struct tm stTime;
    localtime_r(&start.tv_sec, &stTime);
    char strTemp[40];
    strftime(strTemp, sizeof(strTemp)-1, "%Y-%m-%d %H:%M:%S", &stTime);
    printf("start=%s.%07d\n", strTemp, start.tv_usec);
    //
    localtime_r(&end.tv_sec, &stTime);
    strftime(strTemp, sizeof(strTemp)-1, "%Y-%m-%d %H:%M:%S", &stTime);
    printf("end =%s.%07d\n", strTemp, end.tv_usec);
    printf("spend=%d 微秒\n", end-start);
    //printf("copy spend=%d 微秒\n", end1-start1);
}

int main()
{
    test();
    return 1;
}

/*
g++ -o test_mmap test_mmap.cpp
*/

    由以上测试可知：使用内存映射文件方式写入大量数据并不能提高性能，我觉得内存映射文件只是在方便性上提供好处，而且对于频繁的小数据量的文件读写，使用内存映射文件才可能提高性能。

======================================================
2009-04-12：补充
    为什么使用内存映射文件写入会慢呢？慢在哪儿呢？经过测试，大多数的时间都消耗在msync函数的执行上。由此可见，使用内存映射文件写入，实际上预先分配了很大的内存来做缓冲区，仅仅只在调用msync函数的时候才写入磁盘。因此就算使用madvice()等函数，或者mmap()函数中以只写方式打开，都不能提高写入的性能。
    从man msync的文档上看来，msync仅仅只写入有变化的数据。我猜想msync函数大量的时间可能都浪费在检查变化了的数据上（当然，假设变化了的数据很少，性能反而很高），查了一下相关的资料，并没有发现有什么函数或选项去告诉msync()“不要检查，全部写入”。
    so, 如果需要高性能地写入数据，直接调用open, write才是最好的。
======================================================
2009-04-28：补充
     不服气，又想测试一下如何提高msync的性能，仍然没有办法，各个选项怎么组合都不能提高性能，唯独使用madvise()加顺序读的时候会提高一点点。
       今天测试的结果是：写入200MB数据
直接write: 1.099s
msync:      15.626s     write的性能要比msync高出14.22倍。太离谱了！

http://hi.baidu.com/ah__fu/blog/item/8fc8132491bb833b8644f9f5.html

今天写了一个程序，读一个大文件，报no registers和内存错误信息。后来发现是分配的数组太大了。于是更改了程序，下面程序给出了四种不同方法读写大文件的例子

测试结果表明：最快的是mmap（12 us）、read_bf_once（30143）、read_bf_block（分行读:39106）、malloc(40145)

/*
* test_rw_binaryfile.c
*
* Created on: Jan 15, 2010
* Author: root
*/

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>

#define MATRIXSIZE 3601
#define BLOCKSIZE MATRIXSIZE*1

float g_s[MATRIXSIZE][MATRIXSIZE]; //static : this global variable only available in this file. If put it in a function, stack will overflow.

int main(){

int read_bf_block_malloc(char* file_name);
int read_bf_block(char* file_name);
int read_bf_once(char* file_name);
int read_bf_mmap(char* file_name);
// read_bf_block_malloc("N38W084.bin");
read_bf_block("N38W084.bin");
// read_bf_once("N38W084.bin");
// read_bf_mmap("N38W084.bin");
// read_bf_mmap("/etc/passwd");
return 0;
}

/*
* read_bf_block_malloc(char* file_name) use dynamic allocating memory, not global variant.
* It runs slower than others. Maybe, the allocated memory is not consistent comparing others.
*/

int read_bf_block_malloc(char* file_name){

   FILE * stream = NULL;

   stream = fopen(file_name, "rb" );//b represents binary file

   if (stream == NULL) {
   printf("open file failed");
   return 1;
   }

   if (ftell(stream) != 0) {
   fseek(stream,0,SEEK_SET);
   }

   /****Start: assign dynamic memory with malloc*******/
   float **p_array = NULL;

   int row = MATRIXSIZE;
   int col = MATRIXSIZE;

   struct timeval read_start_time;
   gettimeofday(&read_start_time, NULL);

   p_array = (float **) malloc(row * sizeof(float*));

   if (p_array == NULL) {
   printf("malloc failed!");
   }

   int k;
   for (k=0; k<row; k++) {
   *(p_array+k) = (float *) malloc(col*sizeof(float));
   }
   /****End: assign dynamic memory with malloc*******/

   int count = 0;

   while (feof(stream) == 0) {
   int size = fread(*(p_array+count),sizeof(float),BLOCKSIZE,stream);
   count++;
   }

   struct timeval read_end_time;
   gettimeofday(&read_end_time, NULL);
   printf("Spent time for reading data from disk is %d microseconds\n", read_end_time.tv_usec-read_start_time.tv_usec);

   fclose(stream);

   free(p_array);

   return 0;
}

/*
* read_bf_block(char* file_name) read data from file in block.
*/

int read_bf_block(char* file_name)

{

   FILE * stream;

   stream = fopen(file_name, "rb" );//b represents binary file

   if (stream == NULL) {
   printf("open file failed");
   return 1;
   }

   int sum = 0;
   int time = 0;

   if (ftell(stream) != 0) {
   fseek(stream,0,SEEK_SET);
   }

   float *p = g_s;

   struct timeval read_start_time;
   gettimeofday(&read_start_time, NULL);

   while (feof(stream) == 0) {
   int size = fread(p+sum,sizeof(float),BLOCKSIZE,stream);
   sum = ++time * BLOCKSIZE;
   }

   fclose(stream);

   struct timeval read_end_time;
   gettimeofday(&read_end_time, NULL);
   printf("Spent time for reading data from disk is %d microseconds\n", read_end_time.tv_usec-read_start_time.tv_usec);

   return 0;
}

/*
* readrf:read all the data once.
*/
int read_bf_once(char* file_name)

{

   FILE * stream;

   stream = fopen(file_name, "rb" );//b represents binary file

// float s[MATRIXSIZE][MATRIXSIZE]; //It should be set outside of function, otherwise stack will overflow when MATRIXSIZE is large.

   struct timeval read_start_time;
   gettimeofday(&read_start_time, NULL);

   int size = fread(g_s,sizeof(float),MATRIXSIZE*MATRIXSIZE,stream);
// int size = fread(s,sizeof(float),MATRIXSIZE*MATRIXSIZE,stream);

   if(feof(stream) != 0) {
   printf("error happens reading file\n");
   }
   fclose(stream);

   struct timeval read_end_time;
   gettimeofday(&read_end_time, NULL);
   printf("Spent time for reading data from disk is %d microseconds\n", read_end_time.tv_usec-read_start_time.tv_usec);

   return 0;

}

int read_bf_mmap(char* file_name)

{

   struct timeval read_start_time;
   gettimeofday(&read_start_time, NULL);

   int fd = open(file_name,O_RDONLY);
   struct stat sb;
   fstat(fd,&sb);

   void * start;
   start = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

   if (start==MAP_FAILED){
   printf("errors!\n");
   return;
   }

//g_s= (float) start;

//int i=0;
//int j=0;
//for ( i=0; i<MATRIXSIZE; i++) {
// for ( j=0; j<MATRIXSIZE; j++) {
//printf ("s[%d][%d] = %3f/n", i,j,((float *)start)[i*MATRIXSIZE+j]);
//}
//}

   struct timeval read_end_time;
   gettimeofday(&read_end_time, NULL);
   printf("Spent time for reading data from disk is %d microseconds\n", read_end_time.tv_usec-read_start_time.tv_usec);

   munmap(start,sb.st_size);
   close(fd);
}

原因如下：

定义的二位数组实在是太大了。将大数组的定义移动到函数体外，大功告成！
经过分析，我认为一个函数分配的内存是有限的，在函数体内定义的二维数组太大了，耗尽了堆栈，因此报错。

网友总结了下列四法，基本上涵盖全了：

方法一:
在VC的Project setting里的link选项卡里把栈开大一点(windows里默认是4M)

方法二:
局部变量存放在堆栈中，声明成全局或static的,可以摆脱stack的限制。全局变量、静态数据、常量存放在全局数据区，所有函数的代码存放在代码区，为运行函数而分配的局部变量、函数参数、返回数据、返回地址等存放在栈区。

1、静态函数与普通函数的区别在于：静态函数不可以被同一源文件以外的函数调用。

2、静态局部变量与普通局部变量的区别在于：静态局部变量只初始化一次，下一次初始化实际上是依然是上一次的变量；

3、静态全局变量与普通全局变量的区别在于：静态全局变量的作用域仅限于所在的源文件。

在C++中，内存分成5个区，他们分别是堆、栈、自由存储区、全局/静态存储区和常量存储区。
    栈，就是那些由编译器在需要的时候分配，在不需要的时候自动清楚的变量的存储区。里面的变量通常是局部变量、函数参数等。
    堆，就是那些由new分配的内存块，他们的释放编译器不去管，由我们的应用程序去控制，一般一个new就要对应一个delete。如果程序员没有释放掉，那么在程序结束后，操作系统会自动回收。
    自由存储区，就是那些由malloc等分配的内存块，他和堆是十分相似的，不过它是用free来结束自己的生命的。
    全局/静态存储区，全局变量和静态变量被分配到同一块内存中，在以前的C语言中，全局变量又分为初始化的和未初始化的（初始化的全局变量和静态变量在一块区域，未初始化的全局变量与静态变量在相邻的另一块区域，同时未被初始化的对象存储区可以通过void*来访问和操纵，程序结束后由系统自行释放），在C++里面没有这个区分了，他们共同占用同一块内存区。
    常量存储区，这是一块比较特殊的存储区，他们里面存放的是常量，不允许修改（当然，你要通过非正当手段也可以修改，而且方法很多）
文章出处：DIY部落(http://www.diybl.com/course/3_program/c++/cppjs/20091112/182134.html) ，该文还有一些更详细的关于几种存储方式优劣的比较。

方法三:

int Array[90000]; // 分配栈空间
int *pArray = new int[90000]; // 分配堆空间
delete pArray;

堆要比栈大得多~用后者

方法四:
用vector
#include <vector>

using namespace std;

void main()
{
vector<int> A(90000);
A[0] = 1;
}

可以用new 或vector在堆中申请
new 是你自己来管理内存，而vector 是自动管理
尽量用vector ，出错的机会小的多
但如果对内存管理还有特殊要求的话，还是自己来管理好

http://www.blogjava.net/windonly/archive/2009/06/16/282602.html

1 #include < stdio.h >
2 // #define _LARGEFILE_SOURCE
3 // #define _LARGEFILE64_SOURCE
4 // #define _FILE_OFFSET_BITS 64
5 #include < sys / types.h >
6 #include < sys / stat.h >
7 #include < unistd.h >
8 #include < stdio.h >
9 #include < fcntl.h >
10 #include < errno.h >
11
12 int main( int argc, char * argv[])
13 {
14         off_t  file_last_pos;
15         off_t end = 0 ;
16          //   FILE           *fp;
17          int fp = open64(argv[ 1 ], O_RDONLY);
18          if (fp < 0 ) {
19                 printf( " can't open file [%s]\n " , strerror(errno));
20                  return 1 ;
21         } else {
22                 printf( " file open success\n " );
23         }
24         file_last_pos = lseek(fp, 0 , SEEK_END);
25         printf( " Size: %1d \n " ,file_last_pos);
26         close(fp);
27          return 0 ;
28 }

//这行GCC参数很重要，原来是希望通过define的方式来解决的，但是最后还是只能通过这种方式
gcc -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 test.c -o test