23、文档分类、实时端点测试及PDF批量处理解决方案-优快云博客

本文链接：https://blog.youkuaiyun.com/algae/article/details/151099271

文档分类、实时端点测试及PDF批量处理解决方案

在当今数字化时代，文档处理和分类是许多企业面临的重要任务。利用先进的云服务和人工智能技术，我们可以高效地完成这些任务，并通过引入人工干预来进一步提高处理的准确性。本文将详细介绍如何使用亚马逊的相关服务进行文档分类训练、创建实时端点进行文档测试，以及如何构建PDF批量处理解决方案。

1. 亚马逊Comprehend分类训练任务创建

在开始分类训练之前，我们已经通过之前的步骤从亚马逊S3中的扫描文档样本中提取了数据和标签。下面是创建Comprehend分类训练任务的详细步骤：
1. 创建数据检索函数 ：

def data_retriever_from_path(path):    
    mapping={}
    for i in names:
        if os.path.isdir(path+i):
            mapping[i] = sorted(os.listdir(path+i))
    # label or class or target list
    label_compre = []
    # text file data list
    text_compre = []
    # unpacking and iterating through dictionary
    for i, j in mapping.items():
        # iterating through list of files for each class
        for k in j:
            # appending labels/class/target
            label_compre.append(i)
            # reading the file and appending to data list
            text_compre.append(open(path+i+"/"+k, 
encoding="utf-8").read().replace('\n',' '))
    return label_compre, text_compre

调用数据检索函数 ：

label_compre, text_compre=[],[]
path=word_prefix+'train/'
label_compre_train, text_compre_train=data_retriever_
from_path(path)
label_compre.append(label_compre_train)
text_compre.append(text_compre_train)
if type(label_compre[0]) is list:
        label_compre=[item for sublist in label_compre 
for item in sublist]
        #print(label_compre)
        text_compre=[item for sublist in text_compre for 
item in sublist]
        #print(text_compre)
data_compre= pd.DataFrame()
data_compre["label"] =label_compre   
data_compre["document"] = text_compre
data_compre

保存DataFrame为CSV并上传到S3 ：

csv_compre=io.StringIO()
data_compre.to_csv(csv_compre,index=False, header=False)
key='comprehend_train_data.csv'  
input_bucket=data_bucket        
output_bucket= data_bucket       
response2 = s3.put_object(
        Body=csv_compre.getvalue(),
        Bucket=input_bucket,
        Key=key)

在亚马逊Comprehend控制台创建分类任务 ：
- 访问链接：https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#classification
- 点击“Train Classifier”
- 在“Model name”中输入“doc-classifier”，在“Version name”中输入“1”
- 选择“Using Single-label model”作为分类器模式
- 确保数据格式为CSV文件
- 训练数据位置：浏览到之前创建的S3存储桶或输入 s3://doc-processing-bucket-MMDD/comprehend_train_data.csv
- 测试数据集：选择默认的“Autosplit”
- 输出数据：输入 s3://doc-processing-bucket-MMDD
- 访问权限：选择“Create an IAM Role”，在“NameSuffix”中输入“classifydoc”
- 点击“Train Classifier”开始训练

2. 创建亚马逊Comprehend实时端点并测试样本文档

完成训练后，我们可以创建实时端点并测试样本文档，具体步骤如下：
1. 创建实时端点 ：
- 访问链接：https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints
- 点击“Create Endpoint”
- 端点名称：输入“classify-doc”
- 自定义模型：选择之前训练的“doc-classifier”
- 推理单元：设置为“1”
- 点击“Create Endpoint”，为避免产生费用，在清理部分删除此端点
2. 复制端点ARN并在Jupyter Notebook中使用 ：

ENDPOINT_ARN='your endpoint arn paste here'

选择样本测试文档并提取文本 ：

documentName = "paystubsample.png"
display(Image(filename=documentName))

调用Comprehend ClassifyDocument API进行文档分类 ：

response = comprehend.classify_document(
    Text= page_string,
    EndpointArn=ENDPOINT_ARN
)
print(response)

3. 使用人工干预设置主动学习

为了进一步提高模型的准确性，我们可以引入人工干预，设置主动学习流程，步骤如下：
1. 创建私有工作团队 ：参考相关文档创建私有工作团队，并复制私有ARN。

REGION = 'enter your region'
WORKTEAM_ARN= "enter your private workforce arn "
BUCKET = data_bucket
ENDPOINT_ARN= ENDPOINT_ARN
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name
prefix = "custom-classify" + str(uuid.uuid1())

创建工人任务模板 ：使用预构建的分类模板定义HTML模板。

# 运行笔记本单元格定义HTML模板

创建UI任务 ：

def create_task_ui():
    response = sagemaker.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response

调用函数创建UI任务 ：

# 运行单元格调用函数

定义人工审核工作流 ：

create_workflow_definition_response = sagemaker.create_
flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn= role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Read the instructions",
            "TaskTitle": "Classify the text"
        },
        OutputConfig={
            "S3OutputPath" : "s3://"+BUCKET+"/output"
        }

flowDefinitionArn = create_workflow_definition_
response['FlowDefinitionArn']

获取Comprehend分类器端点的响应并解析 ：

response = comprehend.classify_document(
    Text= page_string,
    EndpointArn=ENDPOINT_ARN
)
print(response)
p = response['Classes'][0]['Name']
score = response['Classes'][0]['Score']
response = {}
response['utterance']=page_string
response['prediction']=p
response['confidence'] = score
print(response)

设置置信度阈值并触发人工循环 ：

human_loops_started = []
CONFIDENCE_SCORE_THRESHOLD = .90
if(response['confidence'] > CONFIDENCE_SCORE_THRESHOLD):
        humanLoopName = str(uuid.uuid4())
        human_loop_input = {}
        human_loop_input['taskObject'] = 
response['utterance']
        start_loop_response = a2i_runtime_client.start_
human_loop(
        HumanLoopName=humanLoopName,
        FlowDefinitionArn=flowDefinitionArn,
        HumanLoopInput={
                "InputContent": json.dumps(human_loop_
input)
            }
        )
        print(human_loop_input)
        human_loops_started.append(humanLoopName)
        print(f'Score is less than the threshold of 
{CONFIDENCE_SCORE_THRESHOLD}')
        print(f'Starting human loop with name: 
{humanLoopName}  \n')
else:
         print('No human loop created. \n')

获取私有工作团队链接并开始标注 ：

workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do 
the tasks. Make sure you've invited yourself to your 
workteam!")
print('https://' + sagemaker.describe_
workteam(WorkteamName=workteamName)['Workteam']
['SubDomain'])

完成分类任务并获取结果 ：

completed_human_loops = []
resp = a2i_runtime_client.describe_human_
loop(HumanLoopName=humanLoopName)

审查人工审核结果 ：

for resp in completed_human_loops:
    splitted_string = re.split('s3://' + data_bucket  + 
'/', resp['HumanLoopOutput']['OutputS3Uri'])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=data_bucket, 
Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)
    pp.pprint(json_output)

4. PDF批量处理解决方案

接下来，我们将介绍PDF批量处理的解决方案，以满足企业对PDF文档批量处理的需求。

4.1 技术要求

访问AWS账户：https://aws.amazon.com/console/
Python代码和示例数据集：https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2016
查看代码运行视频：https://bit.ly/3nobrCo

4.2 PDF批量处理用例介绍

在企业注册过程中，会计部门会生成PDF文档，需要对其中检测到的文本行进行审核，特别是那些由于文档质量问题可能未正确解释的文本行。因此，我们需要在文档处理解决方案中添加批量处理组件，具体如下：
- 使用亚马逊Textract的异步文档文本检测API进行文本提取
- 使用亚马逊A2I设置人工工作流，审查和修改置信度低于95%的文本
- 使用亚马逊DynamoDB存储原始检测文本和修正后的文本

4.3 构建解决方案

我们将使用亚马逊SageMaker Jupyter笔记本逐步执行以下任务：
1. 创建私有标注工作团队 ：使用亚马逊SageMaker控制台创建私有标注工作团队，详细信息参考：https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private.html
2. 启动文本检测任务 ：检查克隆的GitHub存储库中的样本注册表单，使用亚马逊Textract启动异步文本检测任务
3. 获取检测结果并检查置信度 ：获取文本检测任务的结果，选择特定文本行并检查检测置信度分数
4. 设置人工审核循环 ：使用表格任务UI模板设置亚马逊A2I人工审核循环，将所有文档的文本行发送到人工循环
5. 完成审核任务 ：以私有工人身份登录，处理分配的审核任务，对所有文档中置信度低的文本行进行修改
6. 上传数据到DynamoDB ：将检测到的和修正后的文本行上传到DynamoDB表进行下游处理

通过以上步骤，我们可以实现高效的文档分类、实时端点测试和PDF批量处理，同时引入人工干预提高处理的准确性。在实际应用中，我们可以根据具体需求调整参数和流程，以达到最佳效果。

文档分类、实时端点测试及PDF批量处理解决方案

5. 详细流程分析与总结

为了更清晰地理解整个文档处理流程，我们可以将各个步骤整理成如下表格：

步骤	任务	主要操作	涉及服务
1	亚马逊Comprehend分类训练任务创建	创建数据检索函数、调用函数生成DataFrame、保存为CSV并上传到S3、在控制台创建分类任务	Amazon Comprehend、Amazon S3
2	创建亚马逊Comprehend实时端点并测试样本文档	创建实时端点、复制端点ARN、选择样本测试文档并提取文本、调用API进行分类	Amazon Comprehend
3	使用人工干预设置主动学习	创建私有工作团队、创建工人任务模板、创建UI任务、定义人工审核工作流、获取端点响应并解析、设置阈值触发人工循环、获取链接开始标注、完成任务获取结果、审查人工审核结果	Amazon Comprehend、Amazon A2I、Amazon SageMaker
4	PDF批量处理解决方案	创建私有标注工作团队、启动文本检测任务、获取检测结果并检查置信度、设置人工审核循环、完成审核任务、上传数据到DynamoDB	Amazon Textract、Amazon A2I、Amazon DynamoDB、Amazon SageMaker

下面是整个流程的mermaid流程图：

graph LR
    classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;

    A(开始):::process --> B(亚马逊Comprehend分类训练任务创建):::process
    B --> C(创建亚马逊Comprehend实时端点并测试样本文档):::process
    C --> D(使用人工干预设置主动学习):::process
    D --> E(PDF批量处理解决方案):::process
    E --> F(结束):::process

6. 代码解释与注意事项

在上述流程中，涉及到了多个代码片段，下面对部分关键代码进行解释：

数据检索函数 ：

def data_retriever_from_path(path):    
    mapping={}
    for i in names:
        if os.path.isdir(path+i):
            mapping[i] = sorted(os.listdir(path+i))
    # label or class or target list
    label_compre = []
    # text file data list
    text_compre = []
    # unpacking and iterating through dictionary
    for i, j in mapping.items():
        # iterating through list of files for each class
        for k in j:
            # appending labels/class/target
            label_compre.append(i)
            # reading the file and appending to data list
            text_compre.append(open(path+i+"/"+k, 
encoding="utf-8").read().replace('\n',' '))
    return label_compre, text_compre

这个函数的主要作用是从指定路径下读取文件，并将文件的标签和文本内容分别存储在 label_compre 和 text_compre 列表中。

设置置信度阈值并触发人工循环 ：

human_loops_started = []
CONFIDENCE_SCORE_THRESHOLD = .90
if(response['confidence'] > CONFIDENCE_SCORE_THRESHOLD):
        humanLoopName = str(uuid.uuid4())
        human_loop_input = {}
        human_loop_input['taskObject'] = 
response['utterance']
        start_loop_response = a2i_runtime_client.start_
human_loop(
        HumanLoopName=humanLoopName,
        FlowDefinitionArn=flowDefinitionArn,
        HumanLoopInput={
                "InputContent": json.dumps(human_loop_
input)
            }
        )
        print(human_loop_input)
        human_loops_started.append(humanLoopName)
        print(f'Score is less than the threshold of 
{CONFIDENCE_SCORE_THRESHOLD}')
        print(f'Starting human loop with name: 
{humanLoopName}  \n')
else:
         print('No human loop created. \n')

此代码根据模型预测的置信度分数与设定的阈值进行比较，如果置信度分数超过阈值，则触发人工循环，让人工对预测结果进行审核。

7. 实际应用与拓展

在实际应用中，我们可以根据不同的业务场景对上述流程进行拓展和优化：

增加训练数据 ：在文档分类训练中，增加训练数据的数量和多样性可以提高模型的准确性。可以收集更多不同类型的文档，如发票、合同等，进行训练。
调整置信度阈值 ：根据业务需求和对错误的容忍度，调整置信度阈值。例如，对于一些对准确性要求极高的场景，可以将阈值设置得更高。
自动化流程 ：使用Lambda函数等工具自动化整个文档处理流程，减少人工干预的同时提高处理效率。例如，当新的PDF文档上传到S3时，自动触发文本检测和分类任务。