NLP工具——UIE离线环境使用
0. 关于UIE
UIE模型是百度今年开源出来的可以应用于zero-shot的新模型,其功能强大使用简便,虽不至于将NLP带入一个新的阶段,但也确实极大的降低了NLP基础任务的工程化使用门槛,是一个非常有效的工具。
在官方git上提供的使用方法中,用几行简单的代码对如何利用UIE进行预测进行了介绍:
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
>>> ie = Taskflow('information_extraction', schema=schema)
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) # Better print results using pprint
[{'时间': [{'end': 6,
'probability': 0.9857378532924486,
'start': 0,
'text': '2月8日上午'}],
'赛事名称': [{'end': 23,
'probability': 0.8503089953268272,
'start': 6,
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
'选手': [{'end': 31,
'probability': 0.8981548639781138,
'start': 28,
'text': '谷爱凌'}]}]
用一行代码直接创建了可以用于推理的模型,在联网环境下非常方便,但是在某些场景下,我们需要在离线环境下进行部署,本文就介绍一下如何在没有联网的环境下使用UIE模型。
虽然我本人极少使用paddle,但是隐约记得paddle中的模型创建可以使用from_pretrained
方法创建模型,这就让我怀疑paddle的代码逻辑是借鉴了transformers
模块的,对于transformers
的代码和逻辑,那可就是滚瓜烂熟了,于是我决定试着按照transformers
的逻辑去应用于paddle
。
其基本逻辑是,模型首先假设输入的字符串是一个本地路径,然后尝试搜索这个路径并创建模型,如果失败的话,就认为这个字符串是一个模型的名称,然后拼接出url链接,去获取模型并保存在本地缓存,同时赋予一个md5码进行标记。
- 如果你对具体的过程不感兴趣,请直接跳转第2部分
- 如果你想看详细的过程,那请直接顺序阅读。
1. UIE的离线使用详解
首先,安装paddlenlp
pip install paddlenlp==2.3.4
通过官方的代码,我们知道paddlenlp利用taskflow创建了一个模型,从这个名字可以看出,这应该是一个兼容了包括UIE在内的各类模型的类,那么我们要做的就是把uie从中取出来。无非就是两个步骤,下载模型储存在本地,以及实例化模型。
我们进入taskflow.py
中的Taskflow
类,找到它的构造函数,不论什么类,都必然是在构造函数中实例化的,那我们就来看一下构造函数的逻辑:
def __init__(self, task, model=None, mode=None, device_id=0, **kwargs):
assert task in TASKS, "The task name:{} is not in Taskflow list, please check your task name.".format(
task)
self.task = task
if self.task in ["word_segmentation", "ner"]:
tag = "modes"
ind_tag = "mode"
self.model = mode
else:
tag = "models"
ind_tag = "model"
self.model = model
if self.model is not None:
assert self.model in set(TASKS[task][tag].keys(
)), "The {} name: {} is not in task:[{}]".format(tag, model, task)
else:
self.model = TASKS[task]['default'][ind_tag]
# 省略余下内容...
于是可以看到,在官方的示例中,代码进入第二个条件语句的else分支,在全局变量TASK
的帮助下创建了self.model,在同py文件中,可以看到这个TASK
,可以看到UIE对应的部分:
'information_extraction': {
"models": {
"uie-base": {
"task_class": UIETask,
"hidden_size": 768,
"task_flag": "information_extraction-uie-base"
},
"uie-medium": {
"task_class": UIETask,
"hidden_size": 768,
"task_flag": "information_extraction-uie-medium"
},
"uie-mini": {
"task_class": UIETask,
"hidden_size": 384,
"task_flag": "information_extraction-uie-mini"
},
"uie-micro": {
"task_class": UIETask,
"hidden_size": 384,
"task_flag": "information_extraction-uie-micro"
},
"uie-nano": {
"task_class": UIETask,
"hidden_size": 312,
"task_flag": "information_extraction-uie-nano"
},
"uie-tiny": {
"task_class": UIETask,
"hidden_size": 768,
"task_flag": "information_extraction-uie-tiny"
},
"uie-medical-base": {
"task_class": UIETask,
"hidden_size": 768,
"task_flag": "information_extraction-uie-medical-base"
},
},
"default": {
"model": "uie-base"
}
可以看出这是对应不同尺寸的预训练模型给入了不同的参数,顺着这个线索,我们找到UIETask
类,跳转到infomation_extraction.py
, 直接看到的就是在类属性里,找到了UIE各个模型的下载链接:
class UIETask(Task):
"""
Universal Information Extraction Task.
Args:
task(string): The name of task.
model(string): The model name in the task.
kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
"""
resource_files_names = {
"model_state": "model_state.pdparams",
"model_config": "model_config.json",
"vocab_file": "vocab.txt",
"special_tokens_map": "special_tokens_map.json",
"tokenizer_config": "tokenizer_config.json"
}
# vocab.txt/special_tokens_map.json/tokenizer_config.json are common to the default model.
resource_files_urls = {
"uie-base": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_v1.0/model_state.pdparams",
"aeca0ed2ccf003f4e9c6160363327c9b"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json",
"a36c185bfc17a83b6cfef6f98b29c909"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
# 后边省略
在这里我们通过这几个链接,把相关的文件全都下载下来,放在一个文件夹内,把文件夹明明为uie
,这个路径就是一会儿我们要使用from_pretrained
引入的路径。
然后观察它的构造函数,发现并没有创建模型,那么应该是在父类中创建了模型。于是,我们再顺着这个线索,到task.py
中找到它的父类Task
,绕了一圈之后发现又绕回了子类,在子类重写的_construct_model
方法中建立模型,这样一来,又回到了UIETask
中,我们找到_construct_model
方法,把它的from_pretrained
传入的名称写死成我们本地的路径,也就是刚刚创建的包含了各个模型文件的那个目录的路径:
def _construct_model(self, model):
"""
Construct the inference model for the predictor.
"""
# model_instance = UIE.from_pretrained(self._task_path)
# 把这里的目录写死为刚刚下载的模型文件所在的目录
model_instance = UIE.from_pretrained({your_model_path})
self._model = model_instance
self._model.eval()
这样一来,模型在本地找到文件之后,就不会尝试联网去下载了。
原本事情到这里应该已经结束,但是在执行之后还是报错了,我又去检查了一下发现,原来paddle
好像本就没想让用户使用本地文件,在Task
类中有一个check的机制,它对md5码进行验证,由于我们在下载文件时没有注意这一点,所以检验没通过,它还是认为我们没有下载文件,又去请求互联网下载。所以我们只需要把它注释掉就可以了。
在UIETask
中,我们把check那行注释掉即可。
def __init__(self, task, model, schema, **kwargs):
super().__init__(task=task, model=model, **kwargs)
self._schema_tree = None
self.set_schema(schema)
# 就是这里,把下面这行注释掉
# self._check_task_files()
self._construct_tokenizer()
self._check_predictor_type()
self._get_inference_model()
self._usage = usage
self._max_seq_len = self.kwargs[
'max_seq_len'] if 'max_seq_len' in self.kwargs else 512
self._batch_size = self.kwargs[
'batch_size'] if 'batch_size' in self.kwargs else 64
self._split_sentence = self.kwargs[
'split_sentence'] if 'split_sentence' in self.kwargs else False
self._position_prob = self.kwargs[
'position_prob'] if 'position_prob' in self.kwargs else 0.5
self._lazy_load = self.kwargs[
'lazy_load'] if 'lazy_load' in self.kwargs else False
self._num_workers = self.kwargs[
'num_workers'] if 'num_workers' in self.kwargs else 0
然后再通过原来的语句,就可以在离线条件下创建模型啦。
ie = Taskflow('information_extraction', schema=schema)
2. 总结
刚刚的描述话比较多,不想看的同学直接跳转这里。
2.1 下载文件
第一步, 创建一个目录,并且下载模型放在这个目录下。下载你想要的模型即可,但是文件要全,例如你想用base模型,则需要下载uie-base
下的所有文件,并且都放在这个目录下。
下载路径如下:
resource_files_urls = {
"uie-base": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_v1.0/model_state.pdparams",
"aeca0ed2ccf003f4e9c6160363327c9b"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json",
"a36c185bfc17a83b6cfef6f98b29c909"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
"uie-medium": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium_v1.0/model_state.pdparams",
"15874e4e76d05bc6de64cc69717f172e"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium/model_config.json",
"6f1ee399398d4f218450fbbf5f212b15"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
"uie-mini": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_mini_v1.0/model_state.pdparams",
"f7b493aae84be3c107a6b4ada660ce2e"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_mini/model_config.json",
"9229ce0a9d599de4602c97324747682f"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
"uie-micro": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_micro_v1.0/model_state.pdparams",
"80baf49c7f853ab31ac67802104f3f15"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_micro/model_config.json",
"07ef444420c3ab474f9270a1027f6da5"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
"uie-nano": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_nano_v1.0/model_state.pdparams",
"ba934463c5cd801f46571f2588543700"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_nano/model_config.json",
"e3a9842edf8329ccdd0cf6039cf0a8f8"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
# Rename to `uie-medium` and the name of `uie-tiny` will be deprecated in future.
"uie-tiny": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_tiny_v0.1/model_state.pdparams",
"15874e4e76d05bc6de64cc69717f172e"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_tiny/model_config.json",
"6f1ee399398d4f218450fbbf5f212b15"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
},
"uie-medical-base": {
"model_state": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medical_base_v0.1/model_state.pdparams",
"569b4bc1abf80eedcdad5a6e774d46bf"
],
"model_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json",
"a36c185bfc17a83b6cfef6f98b29c909"
],
"vocab_file": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
"1c1c1f4fd93c5bed3b4eebec4de976a8"
],
"special_tokens_map": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
"8b3fb1023167bb4ab9d70708eb05f6ec"
],
"tokenizer_config": [
"https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
"59acb0ce78e79180a2491dfd8382b28c"
]
}
}
2.2 修改加载的文件地址
刚刚我们下载了模型文件,并放在一个文件夹里,现在我们要记住这个文件夹的路径,然后找到paddlenlp中的information_extraction.py
,这个文件一般在site-packages
的paddlenlp
目录下的taskflow
中,如果找不到的话可以直接搜索。
我们找到大概290行,把引用的目录修改为我们刚刚创建的模型所在的路径:
def _construct_model(self, model):
"""
Construct the inference model for the predictor.
"""
# model_instance = UIE.from_pretrained(self._task_path)
# 把这里的目录写死为刚刚下载的模型文件所在的目录
model_instance = UIE.from_pretrained({your_model_path})
self._model = model_instance
self._model.eval()
2.3 注释掉源码中检验的代码
最后还是在information_extraction.py
中,找到大概248行,注释掉self._check_task_files():
def __init__(self, task, model, schema, **kwargs):
super().__init__(task=task, model=model, **kwargs)
self._schema_tree = None
self.set_schema(schema)
# 就是这里,把下面这行注释掉
# self._check_task_files()
self._construct_tokenizer()
至此,大功告成。
如有疑问请留言,我们下期再见。