基于AWS Data Science项目的文本抽取与自定义实体识别实战-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00770/article/details/148578168

基于AWS Data Science项目的文本抽取与自定义实体识别实战

data-science-on-aws AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker 项目地址: https://gitcode.com/gh_mirrors/da/data-science-on-aws

本文将详细介绍如何利用AWS的Textract和Comprehend服务，构建一个端到端的自定义实体识别系统。我们将以简历解析为应用场景，展示从文档处理到模型训练的全流程。

背景介绍

在当今企业环境中，大量有价值的信息以非结构化文档形式存在，如PDF简历、扫描合同等。传统的人工处理方式效率低下，而通用OCR工具又难以提取特定领域的实体信息。AWS提供的Textract和Comprehend服务组合，为解决这一问题提供了完美方案。

Textract是AWS提供的先进OCR服务，不仅能识别文本，还能理解文档结构（如表格、表单）。Comprehend则是自然语言处理服务，可识别文本中的实体、情感等。通过组合这两项服务，我们可以构建强大的文档信息提取系统。

系统架构概述

整个解决方案包含以下几个关键步骤：

使用Textract进行文档OCR处理
通过GroundTruth进行实体标注
训练自定义实体识别模型
部署模型进行推理

环境准备

首先需要设置必要的IAM角色和权限。Comprehend需要特定权限来访问S3中的数据，以下是创建相关角色的代码示例：

assume_role_policy_doc = {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "comprehend.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

iam_role_textract_comprehend = iam.create_role(
    RoleName='DSOAWS_Textract_Comprehend',
    AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),
    Description='DSOAWS Textract Comprehend Role'
)

数据处理流程

1. 文档上传与OCR处理

我们将简历PDF上传至S3，然后使用Textract进行OCR处理：

# 上传PDF简历到S3
pdfResumeFileList = glob.glob("./resume_pdf/*.pdf")
prefix_resume_pdf = prefix + "/resume_pdf/"

for filePath in tqdm(pdfResumeFileList):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(
        os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)

2. 启动Textract处理作业

Textract提供异步API处理文档，以下代码演示如何启动处理作业：

job_id = StartDocumentTextDetection(bucket, file_obj)
response = getDocumentTextDetection(job_id)

# 将Textract输出写入文本文件
with codecs.open(output_dir + text_output_name, "w", "utf-8") as output_file:
    for item in response["Blocks"]:
        if item["BlockType"] == "LINE":
            output_file.write(item["Text"]+'\n')

实体标注与模型训练

1. 使用GroundTruth进行标注

虽然原始数据集已包含部分标注，但我们使用GroundTruth重新标注以确保数据质量。标注的重点是识别简历中的"技能"实体。

2. 训练自定义实体识别模型

准备好标注数据后，可以训练自定义实体识别器：

comprehend_custom_recognizer_response = comprehend_client.create_entity_recognizer(
    RecognizerName = custom_recognizer_name,
    DataAccessRoleArn=iam_role_textract_comprehend_arn,
    InputDataConfig={
        'EntityTypes': [{'Type': 'SKILLS'}],
        'Documents': {'S3Uri': comprehend_input_doucuments},
        'EntityList': {'S3Uri': comprehend_input_entity_list}
    },
    LanguageCode='en'
)

模型评估与部署

训练完成后，可以评估模型性能：

print('Number of Document Trained:', 
      comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTrainedDocuments'])
print('Precision:', 
      comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Precision'])
print('Recall:', 
      comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Recall'])