【数据析要】CASIA-HWDB 中文手写数据集一站式处理指南

原创于 2025-11-10 11:44:52 发布 · 1.2k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#数据分析 #数据挖掘

文章目录

一、CASIA 数据集介绍
二、CASIA 数据集下载
三、CASIA-HWDB1.x 数据集处理
四、CASIA-HWDB2.x 数据集处理

一、CASIA 数据集介绍

CASIA-OLHWDB（在线）与 CASIA-HWDB（离线）中文手写数据库由中国科学院自动化研究所模式识别国家实验室（NLPR, CASIA）构建。
1020 名书写者使用 Anoto 数码笔在专用纸张上书写，因而同时获得了在线笔迹和离线图像。样本涵盖孤立字符与连续文本两种形式。数据采集于 2007–2010 年，2010 年完成切分与标注。

整套数据库共 12 个子集：

在线、离线各 6 个；
其中 3 个为孤立字符（DB1.0–1.2），3 个为连续文本（DB2.0–2.2）。

规模一览（在线或离线均同）

孤立字符：约 390 万个样本、7 356 类（7 185 个汉字＋171 个符号）。
连续文本：约 5 090 页图像、135 万个字符级样本。

所有数据均已完成字符级切分与标注，并提供了标准的训练集/测试集划分。

二、CASIA 数据集下载

下载地址：http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html

本文仅处理离线部分数据集 HWDB1.x 和 HWDB2.x。

HWDB1.x
HWDB2.x

三、CASIA-HWDB1.x 数据集处理

3.1 HWDB1.x 基本信息

离线手写汉字数据库HWDB1.x中包含三个孤立字符数据集，其统计信息如下表所示：

该数据集共包含 1,020 个文件，每个文件（*.gnt）存储了一位书写者所写字符的灰度图像，图像按顺序串联存储。

离线孤立字符数据集统计信息：

数据集	书写者数量	样本总数	符号数	汉字数/类别数
HWDB1.0	420	1,680,258	71,122	1,609,136 / 3,866
HWDB1.1	300	1,172,907	51,158	1,121,749 / 3,755
HWDB1.2	300	1,041,970	50,981	990,989 / 3,319
总计	1,020	3,895,135	173,261	3,721,874 / 7,185

字符集说明：

HWDB1.0：包含 3,866 个汉字和 171 个字母数字及符号。其中 3,740 个汉字属于 GB2312-80 一级字集（共 3,755 字）。
HWDB1.1：包含 3,755 个 GB2312-80 一级汉字和 171 个字母数字及符号。
HWDB1.2：包含 3,319 个汉字和 171 个字母数字及符号。其汉字集合与 HWDB1.0 完全不重叠。
HWDB1.0 + HWDB1.2 共包含 7,185 个汉字，覆盖了 GB2312 标准中全部 6,763 个汉字。

3.2 HWDB1.x 处理代码

3.2.0 *.gnt文件格式说明与处理思路

*.gnt 文件格式说明：

字段	类型	长度	示例	说明
本样本总字节数	unsigned int	4 B	0x00000462	从本字段开始到下一个样本开始之前的字节数（含自身 4 B）
标签/内码	char	2 B	0xb0a1（对应汉字“啊”）	GB 编码，低字节在前，高字节在后
宽度 Width	unsigned short	2 B	0x0040 = 64	图像一行所占像素数
高度 Height	unsigned short	2 B	0x0040 = 64	图像总行数
位图 Bitmap	char	Width×Height B	…	灰度值按行主序（row-major）连续存储，0=黑，255=白

读取思路：

先读 4 B 拿到“样本总长度”，就能准确定位下一个样本的起始偏移。
接着读 2 B 的 GB 编码，注意 Little-Endian（低位在前）。
再读 2 B 宽、2 B 高，就能算出后续位图大小。
最后一次性读取 Width×Height 个字节，reshape 成 (H, W) 的 NumPy 数组即可直接可视化。

3.2.1 统计字符总数，生成字典

import struct
import os
import numpy as np
from tqdm import tqdm
import json

ROOT_PATH = "CASIA_Handwriting/HWDB1_x"
DATA_PATH = os.path.join(ROOT_PATH, 'data')

def read_from_gnt_dir(gnt_dir=DATA_PATH):
    def one_file(f):
        header_size = 10
        while True:
            header = np.fromfile(f, dtype='uint8', count=header_size)
            if not header.size: 
                break
            sample_size = header[0] + (header[1]<<8) + (header[2]<<16) + (header[3]<<24)
            tagcode = header[5] + (header[4]<<8)
            width = header[6] + (header[7]<<8)
            height = header[8] + (header[9]<<8)
            assert header_size + width*height == sample_size, 'Header + data size not equal sample_size!'
            image = np.fromfile(f, dtype='uint8', count=width*height).reshape((height, width))
            yield image, tagcode

    for file_name in os.listdir(gnt_dir):
        if file_name.endswith('.gnt'):
            file_path = os.path.join(gnt_dir, file_name)
            with open(file_path, 'rb') as f:
                for image, tagcode in one_file(f):
                    yield image, tagcode

# 生成字符字典
char_set = set()
for _, tagcode in tqdm(read_from_gnt_dir(), total=4000000):
    tagcode_unicode = struct.pack('>H', tagcode).decode('gbk')
    char_set.add(tagcode_unicode)

char_list = list(char_set)
char_dict = dict(zip(sorted(char_list), range(len(char_list))))

# 存储成json
json_str = json.dumps(char_dict, ensure_ascii=False)
with open(f'{ROOT_PATH}/char_dict.json', 'w') as json_file:
    json_file.write(json_str)

3.2.2 获取单字符图像并拼接

控制每张图片上的字符数量为80（8行*10列）

import struct
import os
import numpy as np
from tqdm import tqdm
import json
import cv2
import random

ROOT_PATH = "CASIA_Handwriting/HWDB1_x"
DATA_PATH = os.path.join(ROOT_PATH, 'data')
LABEL_DIR  = os.path.join(ROOT_PATH, 'labels')
IMAGE_DIR  = os.path.join(ROOT_PATH, 'images')

os.makedirs(IMAGE_DIR, exist_ok=True)
os.makedirs(LABEL_DIR, exist_ok=True)

# load char_dict
with open(f'{ROOT_PATH}/char_dict.json', 'r') as f:
    char_dict = json.load(f)

def read_from_gnt(gnt_path):
    image_infos = []
    with open(gnt_path, 'rb') as f:
        header_size = 10
        while True:
            header = np.fromfile(f, dtype='uint8', count=header_size)
            if not header.size: 
                break
            sample_size = header[0] + (header[1]<<8) + (header[2]<<16) + (header[3]<<24)
            tagcode = header[5] + (header[4]<<8)
            width = header[6] + (header[7]<<8)
            height = header[8] + (header[9]<<8)
            tagcode_unicode = struct.pack('>H', tagcode).decode('gbk')
            assert header_size + width*height == sample_size, 'Header + data size not equal sample_size!'
            image = np.fromfile(f, dtype='uint8', count=width*height).reshape((height, width))
            # 过滤掉非中文字符
            if char_dict[tagcode_unicode] >= 169 and char_dict[tagcode_unicode] <= 7353:
                image_infos.append([image, tagcode_unicode, width, height])
    return image_infos

def concat_images(image_infos, base_name):
    random.shuffle(image_infos)
    total_num = len(image_infos) // 80
    # print("total_num=", total_num)
    ratio = 0.8
    padding = 20    # 间隙距离
    for i in range(total_num):
        infos = image_infos[i*80:(i+1)*80]
        images = [info[0] for info in infos]
        labels = [info[1] for info in infos]
        widths = [info[2] for info in infos]
        heights = [info[3] for info in infos]
        # 计算拼接后图像的宽和高
        width = int((np.array(widths).max() - np.array(widths).min()) * ratio + np.array(widths).min())
        height = int((np.array(heights).max() - np.array(heights).min()) * ratio + np.array(heights).min())
        # 拼接图像
        final_image = np.zeros((height*8 + 9*padding, width*10 + 11*padding), dtype=np.uint8)
        # print(final_image.shape)
        final_image[:, :] = 255
        for j in range(80):
            row = j // 10
            col = j % 10
            # 对image[j]的大小做改变
            images[j] = cv2.resize(images[j], (width, height))
            final_image[padding*(row+1)+height*row:padding*(row+1)+height*(row+1), padding*(col+1)+width*col:padding*(col+1)+width*(col+1)] = images[j]
        # 拼接label
        all_label = []
        for j in range(8):
            all_label.append(' '.join(labels[j*10:(j+1)*10]))

        # 保存图像
        img_path   = os.path.join(IMAGE_DIR, f'{base_name}_{i:03d}.jpg')
        label_path = os.path.join(LABEL_DIR, f'{base_name}_{i:03d}.txt')
        cv2.imwrite(img_path, final_image)
        with open(label_path, 'w', encoding='utf-8') as f:
            f.write('\n'.join(all_label))
    return total_num

files = os.listdir(DATA_PATH)
file_paths = [os.path.join(DATA_PATH, file) for file in files]
print(f"一共找到{len(file_paths)}个文件")
total_count = 0
for file_path in tqdm(file_paths, total=len(file_paths)):
    try:
        image_infos = read_from_gnt(file_path)
        total_num = concat_images(image_infos, os.path.basename(file_path).split('.')[0])
        total_count += total_num
    except Exception as e:
        print(f"❌ 处理文件失败: {file_path}, 错误: {e}")
        continue
print(f"一共生成{total_count}个样本")

3.3 拼接结果示例

在这里插入图片描述

四、CASIA-HWDB2.x 数据集处理

4.1 HWDB2.x 基本信息

离线文本数据集由“孤立字符数据集”同一批书写者完成。每人抄写 5 页给定文本；由于数据丢失，缺 1 名书写者（编号 371）及 4 页图像，共少 5 页。每页保存为一个 *.dgrl 文件，文件名格式为“书写者索引_页码.dgrl”。
除灰度图像外，文件内还自带文本行切分 ground-truth 和每个字符的类别标签（GB 码）

离线手写文本数据集统计信息：

数据集	书写者数	页数	文本行数	字符样本数 / 类别数	集外样本数
HWDB2.0	419	2,092	20,495	538,868 / 1,222	1,106
HWDB2.1	300	1,500	17,292	429,553 / 2,310	172
HWDB2.2	300	1,499	14,443	380,993 / 1,331	581
合计	1,019	5,091	52,230	1,349,414 / 2,703	1,859

“集外样本”指不属于 HWDB1.0-1.2 共 7,356 类的字符。

4.2 HWDB2.x 处理代码

4.2.0 *.dgrl文件格式说明

字段	类型	长度	说明 / 示例
文件头（File Header）
Header 总字节数	int	4 B	36 + strlen(illustration)
格式标识	char[8]	8 B	固定字符串 “DGRL”
说明文本	char*	任意	以 ‘\0’ 结尾的 ASCII 描述
编码类型	char[20]	20 B	“ASCII”、“GB” 等
编码字节数	short	2 B	通常 1（GB1）、2（GBK）
每像素位数	short	2 B	1=二值，8=灰度
图像记录（Image Records）
文档高	int	4 B	整页图像像素行数
文档宽	int	4 B	整页图像像素列数
行数	int	4 B	本页文本行总量
行记录（Line Records，循环重复）
本行字符数	int	4 B	后续码序列长度
字符编码	char[]	CodeLen×CharNum	每字符 CodeLen 字节；无效字符填 0xFF
左上角坐标	int×2	4 B+4 B	top, left（相对于整页）
行高	int	4 B	本行 bitmap 高度 H
行宽	int	4 B	本行 bitmap 宽度 W
位图数据	BYTE[]	H × ((W+7)/8) 或 H×W	二值按字节位打包；灰度按像素顺序存储

注意点：

背景像素值固定为 255，前景笔画灰度 0-254。
每页按“行”顺序存储；行与行之间在垂直方向可能出现笔画重叠，因此还原整页图像时，需把多行前景像素“按位或”方式合并，而非简单拼贴。

4.2.2 获取单行图像并拼接

import struct
import os
import cv2
import numpy as np
from tqdm import tqdm

ROOT_PATH = 'CASIA_Handwriting/HWDB2_x'
DATA_PATH = os.path.join(ROOT_PATH, 'data')
LABEL_DIR  = os.path.join(ROOT_PATH, 'labels')
IMAGE_DIR  = os.path.join(ROOT_PATH, 'images')

os.makedirs(IMAGE_DIR, exist_ok=True)
os.makedirs(LABEL_DIR, exist_ok=True)

def read_from_dgrl(dgrl):
    if not os.path.exists(dgrl):
        print('DGRL not exist!')
        return
    
    base_name = os.path.basename(dgrl).replace('.dgrl', '')
    with open(dgrl, 'rb') as f:
        # 1. header size
        header_size = int(np.fromfile(f, dtype='<u4', count=1))   # 小端 uint32
        # 2. skip 剩余 header，并取 code_length
        header = np.fromfile(f, dtype='uint8', count=header_size - 4)
        code_length = int(header[-4]) + (int(header[-3]) << 8)   # 2 字节小端
        # 3. 图像基本信息
        height, width, line_num = np.fromfile(f, dtype='<u4', count=3)
        # print(f'[INFO] {base_name}  高={height} 宽={width} 行数={line_num}')

        # 4. 依次读取每行：label / 位置 / 像素
        full_img  = np.zeros((height, width), dtype=np.uint8)
        full_img[:, :] = 255
        all_label = []

        last_y = 0
        first_y = 0
        for i in range(line_num):
            # 4.1 字符数
            char_num = int(np.fromfile(f, dtype='<u4', count=1))
            # 4.2 读取该行的标注信息
            label = np.fromfile(f, dtype='uint8', count=code_length*char_num)
            label = [label[i]<<(8*(i%code_length)) for i in range(code_length*char_num)]
            label = [sum(label[i*code_length:(i+1)*code_length]) for i in range(char_num)]
            label = [struct.pack('I', i).decode('gbk', 'ignore')[0] for i in label]
            label = ''.join(label)
            label = ''.join(label.split(b'\x00'.decode()))	# 去掉不可见字符 \x00，这一步不加的话后面保存的内容会出现看不见的问题
            all_label.append(label)
            # 4.3 位置
            y, x, h, w = np.fromfile(f, dtype='<u4', count=4)
            if i == 0:
                first_y = y
            # print(y, x, h, w)
            # 4.4 位图
            bitmap = np.fromfile(f, dtype='uint8', count=h*w).reshape(h, w)
            # 处理y重叠现象
            if y < last_y:
                y = last_y
            last_y = y + h
            # print("last_y: ", last_y)
            if last_y > height:
                # 扩充图像
                full_img = np.pad(full_img, ((0, last_y + 300 - height), (0, 0)), 'constant', constant_values=255)
                # print(f"填充后高度：{height} -> {last_y + 300}")
                height = last_y + 300
            full_img[y:y + h, x:x + w] = bitmap

    # 裁剪图像, 使得上下边界合理
    full_img = full_img[max(first_y - 300, 0): min(last_y + 300, height), :]

    # 5. 保存
    img_path   = os.path.join(IMAGE_DIR, f'{base_name}.jpg')
    label_path = os.path.join(LABEL_DIR, f'{base_name}.txt')
    cv2.imwrite(img_path, full_img)
    with open(label_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(all_label))
    # print(f'[SAVE] 图像 -> {img_path}  标签 -> {label_path}')

files = os.listdir(DATA_PATH)
file_paths = [os.path.join(DATA_PATH, file) for file in files]
for file_path in tqdm(file_paths, total=len(file_paths)):
    try:
        read_from_dgrl(file_path)
    except Exception as e:
        print(f"❌ 处理文件失败: {file_path}, 错误: {e}")
        continue