前文提到,编写copy to自定义函数可以将DuckDB 数据库中的数据导出成各种自己需要的格式,但为每种格式编写一个单独的程序很冗长乏味,现在有了一个正确的导出自定义csv实现,那么对它进行抽象, 编个通用模板,只要按照规定的格式添加处理新文件格式和后缀的实现代码就行了。DeepSeek擅长总结和归纳,所以合适干这个工作。
提示词如下
请以copy.cpp为基础,参考附件xlslib实现copy_xls.cpp,最好做成可扩充的,比如copy_pdf/copy_doc等,根据不同后缀自动选取相应的写入工具。
过程比我想象的要难,结局还是好的。通用模板可以正确导出传统xls格式了,假装也支持xlsx格式,其实还是xls格式改个后缀名。
考验扩展性的时刻到了,我找了一个读写docx格式的c++类,duckx的源代码,它很容易编译。解压、cmake .、make就行了,虽然报错,
[ 14%] Building CXX object CMakeFiles/duckx.dir/src/duckx.cpp.o
[ 28%] Building C object CMakeFiles/duckx.dir/thirdparty/zip/zip.c.o
[ 42%] Linking CXX static library libduckx.a
[ 42%] Built target duckx
[ 57%] Building CXX object test/CMakeFiles/unit_tests.dir/basic_tests.cpp.o
In file included from /par/DuckX/test/basic_tests.cpp:5:
/par/DuckX/test/doctest.h:4009:47: error: size of array 'altStackMem' is not an integral constant-expression
4009 | static char altStackMem[4 * SIGSTKSZ];
| ^
但libduckx.a成功生成了。用示例代码读写都没问题。
g++ readdocx.cpp -o readocx -I DuckX/include -I DuckX/thirdparty/zip -I DuckX/thirdparty/pugixml -lduckx -L DuckX
./readocx
This is a test
okay?
g++ sample2.cpp -o outputdocx -I DuckX/include -I DuckX/thirdparty/zip -I DuckX/thirdparty/pugixml -lduckx -L DuckX
./outputdocx
提示词如下
附件是正确的把Duckdb查询结果写入xls格式文件的程序,以及调用duckx库来写docx格式文件的程序,请把它们整合,为copy_files2.cpp增加Duckdb查询结果写入docx格式文件功能,仍保持对xls的支持。
输出的程序漏掉了原文件里的一个类,导致很多编译错误,把它从别处复制过来,再编译,就只有一个错误了。
copy_file3.cpp:298:28: error: base operand of '->' has non-pointer type 'duckdb::DOCXGlobalState'
298 | auto &doc = *gstate->document;
| ^~
我看另一个函数中也是这么调用的,却没有报错,
auto gstate = make_uniq<DOCXGlobalState>(file_path);
auto &doc = *gstate->document;
它很快意识到了问题,这个错误是因为在DOCXHandler::WriteData函数中,gstate是一个引用而非指针,所以应该使用点操作符(.)而不是箭头操作符(->)。
, 改好后还不忘科普:
关键是要理解在C++中,引用(&)和指针(*)访问成员的不同语法:
对于指针:ptr->member 等价于 (*ptr).member
对于引用或对象:直接使用 obj.member
修改后编译成功。测试没有报错,但2个输出的docx实际上还是xls格式。但所有该改的地方都改了。
g++ copy_file3.cpp -o copyfile2 -lduckdb -lxls -I /par/duck/src/include -I /par/xlslib/xlslib/src -I DuckX/include -I DuckX/thirdparty/zip -I DuckX/thirdparty/pugixml -lduckx -L DuckX
./copyfile2
把代码中涉及xls、xlsx字面量的默认值都取消,这下不会出来docx文件了。费了很久才找到原因。
原因1.duckx只能更新已有的文件,而不能创建新文件,这在网上已经有人指出来了,我白白测试了很久。而且duckx挑文件格式,我把它提供的示例文件用wps保存后就不能读写了。报错
*** Error in `./a2': double free or corruption (top): 0x0000000027d54920 ***
已放弃 (核心已转储)
原因2.DOCXGlobalState构造函数中有一行his->file_path = std::move(file_path); 和一行document = make_uniqduckx::Document(this->file_path);测试前面file_path还有值,后面就变空值了。DeepSeek解释,
您遇到的问题正是C++移动语义的典型陷阱!
将file_path的内容"移动"到this->file_path中
移动后file_path变为未定义状态(可能为空)
您观察到的现象:
std::cout << file_path 能输出正确值(这是未定义行为的侥幸表现)
但实际使用时file_path已失效(如test_copy中表现为空)
解决了这两个问题,才终于能把查询结果输出到docx文件,幸好我们用csv做一开始的概念验证,调通了数据库方面。否则现在还不知道错在哪里。
注意,如果你加载excel插件,duckdb本身就能生成.xlsx格式文件。
源代码
#include "duckdb.hpp"
#include "duckdb/common/file_system.hpp"
#include "duckdb/common/serializer/buffered_file_writer.hpp"
#include "duckdb/catalog/catalog_entry/copy_function_catalog_entry.hpp"
#include "duckdb/function/copy_function.hpp"
#include "duckdb/main/extension_util.hpp"
#include "xlslib.h"
#include "duckx.hpp"
#include <filesystem>
#include <memory>
#include <mutex>
#include <unordered_map>
#include <iostream>
#include "duckdb/parser/parsed_data/create_copy_function_info.hpp"
#define DEBUG_LOG(msg) //std::cerr << "[DEBUG] " << msg << std::endl
namespace duckdb {
// ==============================================
// 1. 基础类型定义
// ==============================================
// 格式选项基类
struct FormatOptions {
virtual ~FormatOptions() = default;
bool header = true;
virtual unique_ptr<FormatOptions> Copy() const = 0;
};
// XLS格式选项
struct XLSWriteOptions : public FormatOptions {
string sheet_name = "Sheet1";
unique_ptr<FormatOptions> Copy() const override {
auto copy = make_uniq<XLSWriteOptions>();
copy->header = header;
copy->sheet_name = sheet_name;
return copy;
}
};
// DOCX格式选项
struct DOCXWriteOptions : public FormatOptions {
string title = "DuckDB Export";
string author = "DuckDB";
unique_ptr<FormatOptions> Copy() const override {
auto copy = make_uniq<DOCXWriteOptions>();
copy->header = header;
copy->title = title;
copy->author = author;
return copy;
}
};
// 全局状态基类
struct FormatGlobalState : public GlobalFunctionData {
virtual ~FormatGlobalState() = default;
string file_path;
mutex write_mutex;
};
// XLS全局状态
struct XLSGlobalState : public FormatGlobalState {
unique_ptr<xlslib_core::workbook> workbook;
xlslib_core::worksheet* worksheet = nullptr;
explicit XLSGlobalState(string file_path) {
this->file_path = std::move(file_path);
workbook = make_uniq<xlslib_core::workbook>();
}
~XLSGlobalState() {
if (workbook) {
workbook->Dump(file_path.c_str());
}
}
};
namespace fs = std::filesystem;
bool ensure_directory_exists(const fs::path& file_path) {
if (file_path.has_parent_path()) {
std::error_code ec;
fs::create_directories(file_path.parent_path(), ec);
if (ec) {
std::cerr << "无法创建目录 " << file_path.parent_path()
<< ": " << ec.message() << "\n";
return false;
}
}
return true;
}
void safe_copy(const fs::path& from, const fs::path& to) {
std::error_code ec;
// 检查源文件是否存在
if (!fs::exists(from)) {
std::cout << "错误: 源文件不存在\n";
return;
}
// 确保目标目录存在
if (!ensure_directory_exists(to)) {
return;
}
// 尝试复制
bool success = fs::copy_file(from, to, fs::copy_options::overwrite_existing, ec);
if (!success) {
std::cout << "复制失败: " << ec.message() << "\n";
}
}
// DOCX全局状态
struct DOCXGlobalState : public FormatGlobalState {
unique_ptr<duckx::Document> document;
explicit DOCXGlobalState(string file_path) {
this->file_path = std::move(file_path);
std::cout<<this->file_path<<std::endl;
//复制模板empty.docx 到this->file_path
const fs::path template_file = "empty.docx";
if (!fs::exists(template_file)) {
std::cerr << "错误: 请确保当前目录存在 empty.docx 文件\n";
return ;
}
safe_copy(template_file, this->file_path);
document = make_uniq<duckx::Document>(this->file_path);
document->open();
}
~DOCXGlobalState() {
DEBUG_LOG("~DOCXGlobalState开始");
if (document) {
document->save();
}
}
};
// 本地状态基类
struct FormatLocalState : public LocalFunctionData {
virtual ~FormatLocalState() = default;
};
// XLS本地状态
struct XLSLocalState : public FormatLocalState {};
// DOCX本地状态
struct DOCXLocalState : public FormatLocalState {};
// ==============================================
// 2. 格式处理器接口和实现
// ==============================================
class FormatHandler {
public:
virtual ~FormatHandler() = default;
virtual unique_ptr<FormatOptions> ParseOptions(
ClientContext &context,
const case_insensitive_map_t<vector<Value>> &options) = 0;
virtual unique_ptr<GlobalFunctionData> InitializeGlobal(
ClientContext &context,
const string &file_path,
const vector<string> &names,
const vector<LogicalType> &sql_types,
FormatOptions &options) = 0;
virtual unique_ptr<LocalFunctionData> InitializeLocal(
ExecutionContext &context,
const vector<LogicalType> &sql_types) = 0;
virtual void WriteData(
ExecutionContext &context,
GlobalFunctionData &gstate,
LocalFunctionData &lstate,
DataChunk &input,
const vector<string> &names,
FormatOptions &options) = 0;
};
// XLS处理器
class XLSHandler : public FormatHandler {
public:
unique_ptr<FormatOptions> ParseOptions(
ClientContext &context,
const case_insensitive_map_t<vector<Value>> &options) override {
auto xls_options = make_uniq<XLSWriteOptions>();
xls_options->header = true;
for (auto &option : options) {
if (CaseInsensitiveStringEquality()(option.first, "header") && !option.second.empty()) {
xls_options->header = option.second[0].CastAs(context, LogicalType::BOOLEAN).GetValue<bool>();
} else if (CaseInsensitiveStringEquality()(option.first, "sheet_name") && !option.second.empty()) {
xls_options->sheet_name = option.second[0].ToString();
}
}
return xls_options;
}
unique_ptr<GlobalFunctionData> InitializeGlobal(
ClientContext &context,
const string &file_path,
const vector<string> &names,
const vector<LogicalType> &sql_types,
FormatOptions &options_base) override {
auto &options = dynamic_cast<XLSWriteOptions&>(options_base);
auto gstate = make_uniq<XLSGlobalState>(file_path);
// 创建工作表
gstate->worksheet = gstate->workbook->sheet(options.sheet_name.c_str());
// 写入表头
if (options.header) {
for (idx_t col = 0; col < names.size(); ++col) {
gstate->worksheet->label(0, col, names[col].c_str());
}
}
return gstate;
}
unique_ptr<LocalFunctionData> InitializeLocal(
ExecutionContext &context,
const vector<LogicalType> &sql_types) override {
return make_uniq<XLSLocalState>();
}
void WriteData(
ExecutionContext &context,
GlobalFunctionData &gstate_base,
LocalFunctionData &lstate_base,
DataChunk &input,
const vector<string> &names,
FormatOptions &options_base) override {
DEBUG_LOG("call xls write");
auto &gstate = dynamic_cast<XLSGlobalState&>(gstate_base);
auto &options = dynamic_cast<XLSWriteOptions&>(options_base);
lock_guard<mutex> lock(gstate.write_mutex);
idx_t start_row = options.header ? 1 : 0;
for (idx_t row = 0; row < input.size(); row++) {
idx_t output_row = start_row + row;
for (idx_t col = 0; col < input.ColumnCount(); col++) {
auto val = input.GetValue(col, row);
if (val.IsNull()) {
gstate.worksheet->label(output_row, col, "NULL");
} else {
switch(input.data[col].GetType().id()) {
case LogicalTypeId::INTEGER:
case LogicalTypeId::BIGINT:
case LogicalTypeId::DECIMAL:
case LogicalTypeId::DOUBLE:
case LogicalTypeId::FLOAT:
gstate.worksheet->number(output_row, col, val.GetValue<double>());
break;
default:
gstate.worksheet->label(output_row, col, val.ToString().c_str());
}
}
}
}
DEBUG_LOG("call xls write finished");
}
};
// DOCX处理器
class DOCXHandler : public FormatHandler {
public:
unique_ptr<FormatOptions> ParseOptions(
ClientContext &context,
const case_insensitive_map_t<vector<Value>> &options) override {
auto docx_options = make_uniq<DOCXWriteOptions>();
docx_options->header = true;
for (auto &option : options) {
if (CaseInsensitiveStringEquality()(option.first, "header") && !option.second.empty()) {
docx_options->header = option.second[0].CastAs(context, LogicalType::BOOLEAN).GetValue<bool>();
} else if (CaseInsensitiveStringEquality()(option.first, "title") && !option.second.empty()) {
docx_options->title = option.second[0].ToString();
} else if (CaseInsensitiveStringEquality()(option.first, "author") && !option.second.empty()) {
docx_options->author = option.second[0].ToString();
}
}
return docx_options;
}
unique_ptr<GlobalFunctionData> InitializeGlobal(
ClientContext &context,
const string &file_path,
const vector<string> &names,
const vector<LogicalType> &sql_types,
FormatOptions &options_base) override {
auto &options = dynamic_cast<DOCXWriteOptions&>(options_base);
auto gstate = make_uniq<DOCXGlobalState>(file_path);
//清空原有内容
int i=0;
auto &doc = *gstate->document;
for (auto p = doc.paragraphs(); p.has_next(); p.next()) {
i++;
for (auto r = p.runs(); r.has_next(); r.next()) {
//std::cout << r.get_text() << std::endl;
r.set_text("");//执行清除
}
}
DEBUG_LOG("清空原有内容,行数: " << i);
// 添加标题
duckx::Paragraph title_p = doc.paragraphs().insert_paragraph_after(options.title);
title_p.add_run("", duckx::bold);//标题加粗
// 添加表头
if (options.header) {
duckx::Paragraph header_p = doc.paragraphs().insert_paragraph_after("");
for (idx_t col = 0; col < names.size(); ++col) {
if (col != 0) header_p.add_run("\t");
header_p.add_run(names[col], duckx::bold);//表头加粗
}
}
DEBUG_LOG("添加标题和表头 完成");
return gstate;
}
unique_ptr<LocalFunctionData> InitializeLocal(
ExecutionContext &context,
const vector<LogicalType> &sql_types) override {
return make_uniq<DOCXLocalState>();
}
void WriteData(
ExecutionContext &context,
GlobalFunctionData &gstate_base,
LocalFunctionData &lstate_base,
DataChunk &input,
const vector<string> &names,
FormatOptions &options_base) override {
auto &gstate = dynamic_cast<DOCXGlobalState&>(gstate_base);
lock_guard<mutex> lock(gstate.write_mutex);
//auto &doc = *gstate.document;
DEBUG_LOG("获取DOCX文档对象");
auto &doc = *gstate.document;
// 确保文档已打开
if (!doc.is_open()) {
DEBUG_LOG("重新打开DOCX文档");
doc.open();
}
DEBUG_LOG("开始写入数据,行数: " << input.size());
for (idx_t row = 0; row < input.size(); row++) {
duckx::Paragraph p = doc.paragraphs().insert_paragraph_after("");
for (idx_t col = 0; col < input.ColumnCount(); col++) {
if (col != 0) p.add_run("\t");
auto val = input.GetValue(col, row);
p.add_run(val.IsNull() ? "NULL" : val.ToString());
}
}
// 立即保存而不是等待析构
DEBUG_LOG("立即保存DOCX文档");
doc.save();
DEBUG_LOG("DOCX文档保存完成");
}
};
// ==============================================
// 3. 绑定数据结构
// ==============================================
struct MultiFormatBindData : public TableFunctionData {
vector<string> files;
vector<LogicalType> sql_types;
vector<string> names;
unique_ptr<FormatOptions> options;
FormatHandler* handler = nullptr;
MultiFormatBindData(string file_path,
vector<LogicalType> sql_types,
vector<string> names,
unique_ptr<FormatOptions> options,
FormatHandler* handler)
: files({std::move(file_path)}),
sql_types(std::move(sql_types)),
names(std::move(names)),
options(std::move(options)),
handler(handler) {}
unique_ptr<FunctionData> Copy() const override {
return make_uniq<MultiFormatBindData>(
files[0], sql_types, names,
options->Copy(), handler
);
}
bool Equals(const FunctionData &other) const override {
auto &other_bind = other.Cast<MultiFormatBindData>();
return files == other_bind.files &&
sql_types == other_bind.sql_types &&
names == other_bind.names;
}
};
// ==============================================
// 3. 主功能类实现
// ==============================================
class MultiFormatCopyFunction : public CopyFunction {
public:
MultiFormatCopyFunction() : CopyFunction("multiformat") {
copy_to_bind = Bind;
copy_to_initialize_global = InitializeGlobal;
copy_to_initialize_local = InitializeLocal;
copy_to_sink = Sink;
// 注册支持的格式处理器
format_handlers["xls"] = make_uniq<XLSHandler>();
format_handlers["xlsx"] = make_uniq<XLSHandler>();
format_handlers["docx"] = make_uniq<DOCXHandler>();
}
static unique_ptr<FunctionData> Bind(
ClientContext &context,
CopyFunctionBindInput &input,
const vector<string> &names,
const vector<LogicalType> &sql_types) {
// 1. 优先从options中获取format参数
string format;
auto it = input.info.options.find("format");
if (it != input.info.options.end() && !it->second.empty()) {
format = it->second[0].ToString();
}
// 2. 如果未指定format,从文件扩展名推断
if (format.empty()) {
size_t dot_pos = input.info.file_path.find_last_of('.');
if (dot_pos != string::npos) {
format = input.info.file_path.substr(dot_pos + 1);
}
}
//DEBUG_LOG(format<<"format") ;
// 3. 检查格式是否支持
auto &instance = GetInstance();
auto handler = instance.GetHandlerForFormat(format);
if (!handler) {
// 列出支持的格式以便错误消息更友好
vector<string> supported_formats;
for (const auto &entry : instance.format_handlers) {
supported_formats.push_back(entry.first);
}
throw BinderException("Unsupported format: " + format +
". Supported formats: " +
StringUtil::Join(supported_formats, ", "));
}
//DEBUG_LOG(StringUtil::Join(supported_formats, ", ")<<"supported_formats") ;
auto options = handler->ParseOptions(context, input.info.options);
return make_uniq<MultiFormatBindData>(
input.info.file_path,
sql_types,
names,
std::move(options),
handler
);
}
static unique_ptr<GlobalFunctionData> InitializeGlobal(
ClientContext &context,
FunctionData &bind_data,
const string &file_path) {
auto &data = bind_data.Cast<MultiFormatBindData>();
return data.handler->InitializeGlobal(
context,
file_path,
data.names,
data.sql_types,
*data.options
);
}
static unique_ptr<LocalFunctionData> InitializeLocal(
ExecutionContext &context,
FunctionData &bind_data) {
auto &data = bind_data.Cast<MultiFormatBindData>();
return data.handler->InitializeLocal(context, data.sql_types);
}
static void Sink(
ExecutionContext &context,
FunctionData &bind_data,
GlobalFunctionData &gstate,
LocalFunctionData &lstate,
DataChunk &input) {
auto &data = bind_data.Cast<MultiFormatBindData>();
data.handler->WriteData(
context,
gstate,
lstate,
input,
data.names,
*data.options
);
}
static MultiFormatCopyFunction &GetInstance() {
static MultiFormatCopyFunction instance;
return instance;
}
private:
unordered_map<string, unique_ptr<FormatHandler>> format_handlers;
FormatHandler* GetHandlerForFormat(const string &format) {
string lower_format = StringUtil::Lower(format);
auto it = format_handlers.find(lower_format);
return it != format_handlers.end() ? it->second.get() : nullptr;
}
};
// ==============================================
// 4. 注册函数
// ==============================================
void RegisterMultiFormatCopyFunction(DatabaseInstance &db) {
// 注册主函数
{
MultiFormatCopyFunction function;
ExtensionUtil::RegisterFunction(db, function);
}
// 为每个格式创建别名
Connection con(db);
con.BeginTransaction();
try {
auto &catalog = Catalog::GetSystemCatalog(db);
// 为xlsx创建别名
{
MultiFormatCopyFunction xlsx_func;
CreateCopyFunctionInfo xlsx_info(xlsx_func);
xlsx_info.name = "xlsx";
catalog.CreateCopyFunction(*con.context, xlsx_info);
}
// 为xls创建别名
{
MultiFormatCopyFunction xls_func;
CreateCopyFunctionInfo xls_info(xls_func);
xls_info.name = "xls";
catalog.CreateCopyFunction(*con.context, xls_info);
}
// 为docx创建别名
{
MultiFormatCopyFunction docx_func;
CreateCopyFunctionInfo docx_info(docx_func);
docx_info.name = "docx";
catalog.CreateCopyFunction(*con.context, docx_info);
}
con.Commit();
} catch (...) {
con.Rollback();
throw;
}
}
} // namespace duckdb
// ==============================================
// 5. 测试主函数
// ==============================================
using namespace duckdb;
int main() {
try {
std::cerr << "=== 开始测试 ===" << std::endl;
DuckDB db(nullptr);
Connection con(db);
std::cerr << "=== 数据库初始化完成 ===" << std::endl;
// 注册自定义函数
duckdb::RegisterMultiFormatCopyFunction(*db.instance);
std::cerr << "=== 函数注册调用完成 ===" << std::endl;
// 创建测试表
con.Query("CREATE TABLE test(id INTEGER, name VARCHAR, age INTEGER, score DOUBLE)");
con.Query("INSERT INTO test VALUES (1, 'Alice', 25, 95.5), (2, 'Bob', 30, 88.0), (3, 'Charlie', NULL, 76.5)");
std::cerr << "=== 测试表创建完成 ===" << std::endl;
// 测试XLS导出
auto result_xls = con.Query("COPY test TO 'test_output.xls' (FORMAT xls, HEADER true, SHEET_NAME 'Students')");
if(result_xls->HasError()) {
std::cerr << "XLS导出错误: " << result_xls->GetError() << std::endl;
} else {
std::cerr << "=== XLS导出成功 ===" << std::endl;
}
// 测试XLSX导出
auto result_xlsx = con.Query("COPY (from range(10)) TO 'test_output2.xlsx' (FORMAT xlsx)");
if(result_xlsx->HasError()) {
std::cerr << "XLSX导出错误: " << result_xlsx->GetError() << std::endl;
} else {
std::cerr << "=== XLSX导出成功 ===" << std::endl;
}
// 测试DOCX导出
auto result_docx = con.Query("COPY test TO '/par/test_output.docx' (FORMAT docx, TITLE 'Student Report', AUTHOR 'DuckDB')");
if(result_docx->HasError()) {
std::cerr << "DOCX导出错误: " << result_docx->GetError() << std::endl;
} else {
std::cerr << "=== DOCX导出成功 ===" << std::endl;
}
// 测试3: 通过扩展名自动识别格式
auto result3 = con.Query("COPY test TO 'auto_recognize.xls'");
if(result3->HasError()) {
std::cerr << "测试3错误: " << result3->GetError() << std::endl;
}
// 测试4: 通过扩展名自动识别docx格式
auto result4 = con.Query("COPY test TO 'auto_recognize.docx'");
if(result4->HasError()) {
std::cerr << "测试4错误: " << result4->GetError() << std::endl;
}
// 测试5: 不支持的格式
auto result5 = con.Query("COPY test TO 'unknown.foo'");
if(!result5->HasError()) {
std::cerr << "测试5应该失败但没有失败" << std::endl;
}
return 0;
} catch (const std::exception& e) {
std::cerr << "!!! 发生异常: " << e.what() << std::endl;
return 1;
}
}
编译命令行参数为:
g++ temp/duckx.cpp temp/pugixml.cpp temp/zip.c copy_file5.cpp -o copyfile2 -lduckdb -lxls -I /par/include -I /par/xlslib/xlslib/src -I DuckX/include -I DuckX/thirdparty/zip -I DuckX/thirdparty/pugixml -I xlslib/src -std=c++17
root@66d4e20ec1d7:/par# ./copyfile2
=== 开始测试 ===
=== 数据库初始化完成 ===
=== 函数注册调用完成 ===
=== 测试表创建完成 ===
=== XLS导出成功 ===
=== XLSX导出成功 ===
/par/test_output.docx
=== DOCX导出成功 ===
auto_recognize.docx
测试5应该失败但没有失败
其中temp目录存放从duckx源码包解压的源代码,包括第三方源代码和头文件。
对于未知后缀名,duckdb本身就会默认保存成一个csv格式,所以测试5也导出了数据。