[libxml2]_[C/C++]_[高效读取XML大文件]

最新推荐文章于 2023-10-14 17:28:58 发布

原创最新推荐文章于 2023-10-14 17:28:58 发布 · 2.9k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#libxml2 #xml #C++ #大文件 #读取

第三方库-XML处理专栏收录该内容

8 篇文章

订阅专栏

本文介绍如何利用libxml2库高效解析大型XML文件，避免因文件错误导致整体解析失败的问题。通过采用SAX方式读取，即使遇到错误也能继续处理已读取的部分，减少资源浪费。

场景

1.一个比较大的XML文件, 要读进内存里，并转换为相应的对象(比如一个C++对象), 通常都是把整个XML文件读入转换为DOM对象, 之后对DOM对象进行枚举分析生成C++对象.

2.如果这个庞大的XML文件有错误, 那么读取也会失败，即生成DOM失败，这样已经读取过的DOM对象就会被浪费, 造成无法分析XML对象.

说明

1.在生成DOM对象后, 再转换为C++对象，这样就会有两份重复的内存数据，事实上只应该存在一份数据, 而不应该多浪费一份内存. 而且便利庞大的DOM对象也是很费资源，而且XML文件很大时，耗费时间也很多.

2.使用libxml2的xmlTextReader(SAX?)读取, 却可以读取发生错误前的数据.

3.使用系统自带的IO函数, 能方便进行特定文件的分析处理，比如权限限制的文件，比如网络文件.

4.注意，在使用libxml2库是，一定要在主进程的主线程调用初始化xmlInitParser()和结束进程前主线程释放xmlCleanupParser().

例子


#include <libxml/parser.h>
#include <libxml/tree.h>
#include <libxml/xmlreader.h>
#include <iostream>

int pXmlInputReadCallback (void * context, char * buffer, int len)
{
    FILE* file = (FILE*)context;
    int readed = fread(buffer,1,len,file); // 注意, 必须是len个1字节, 这样才能返回准确的readed字节数.
    return readed;
}

int pXmlInputCloseCallback (void * context)
{
    FILE* file = (FILE*)context;
    return fclose(file);
}

/**
 * processNode:
 * @reader: the xmlReader
 *
 * Dump information about the current node
 */
static void
processNode(xmlTextReaderPtr reader) {
    const xmlChar *name, *value;

    name = xmlTextReaderConstName(reader);
    if (name == NULL)
    name = BAD_CAST "--";

    value = xmlTextReaderConstValue(reader);

    printf("%d %d %s %d %d", 
        xmlTextReaderDepth(reader),
        xmlTextReaderNodeType(reader),
        name,
        xmlTextReaderIsEmptyElement(reader),
        xmlTextReaderHasValue(reader));
    if (value == NULL)
    printf("\n");
    else {
        if (xmlStrlen(value) > 40)
            printf(" %.40s...\n", value);
        else
        printf(" %s\n", value);
    }
}

// http://xmlsoft.org/examples/reader1.c
void TestTreeRead()
{
    FILE* file = _wfopen(L"test.xml",L"rb");
    if(!file)
        return;

    xmlTextReaderPtr reader = xmlReaderForIO(pXmlInputReadCallback,pXmlInputCloseCallback,file,NULL,"UTF-8",XML_PARSE_RECOVER);
    if(!reader)
        return;

    int ret = xmlTextReaderRead(reader);
    
    while (ret == 1) {
        processNode(reader);
        ret = xmlTextReaderRead(reader);
    }

    xmlTextReaderClose(reader);
}

部分输出:

0 1 ROOT 0 0
1 1 one 0 0
2 14 #text 0 1

2 1 node1 0 0
3 3 #text 0 1 content of node 1
2 15 node1 0 0
2 14 #text 0 1

2 1 node2 1 0
2 14 #text 0 1

2 1 node3 0 0
3 3 #text 0 1 this node has attributes
2 15 node3 0 0
2 14 #text 0 1

2 1 node4 0 0
3 3 #text 0 1 other way to create content (which is al...
2 15 node4 0 0
2 14 #text 0 1

1 15 one 0 0
1 1 one 0 0
2 14 #text 0 1

2 1 node1 0 0
3 3 #text 0 1 content of node 1
2 15 node1 0 0
2 14 #text 0 1

2 1 node2 1 0
2 14 #text 0 1

2 1 node3 0 0
3 3 #text 0 1 this node has attributes
2 15 node3 0 0
2 14 #text 0 1

2 1 node4 0 0
3 3 #text 0 1 other way to create content (which is al...
2 15 node4 0 0
2 14 #text 0 1

1 15 one 0 0
1 1 one 0 0
2 14 #text 0 1

2 1 node1 0 0
3 3 #text 0 1 content of node 1
2 15 node1 0 0
2 14 #text 0 1

2 1 node2 1 0
2 14 #text 0 1

2 1 node3 0 0
3 3 #text 0 1 this node has attributes
2 15 node3 0 0
2 14 #text 0 1

2 1 node4 0 0
3 3 #text 0 1 other way to create content (which is al...
2 15 node4 0 0
2 14 #text 0 1