transformers Preprocessing data

基本使用

主要是使用tokenizer,首先会分割文本成单词(tokens),然后将这些单词转换为数字。
在 pretraining or fine-tuning一个model时,使用from_pretrained()来处理文本

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

input_ids:每个token的代表数字
token_type_ids:当输入成对句子时,用来标明哪些属于第一句,哪些属于第二句
attention_mask: 标明哪些是padding,padding部分不需要注意,既标记为0

可以将input_ids转换为句子。

>>> tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"

tokenizer可以接收列表。

通过参数可以给句子设定:
1.长度不够的句子,添加pad(padding=True)
2.超出长度的句子,截取truncate(truncation=True)
3.返回tensor(return_tensors=‘pt’)

同时处理多个句子

batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]}

预处理句子

batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)
{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
                      [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
                      [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 0, 0, 0, 0],
                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

参数描述
max_length控制padding和truncation的长度,默认为模型能接受的最大长度
padding
truncation

成对的句子

在做相似度比较,或智能问答时,我们需要同时输入两个句子,既[CLS] Sequence A [SEP] Sequence B [SEP]

encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)
{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer.decode(encoded_input["input_ids"])
"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"

同样也可以直接输入列表:

batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
               [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

for ids in encoded_inputs["input_ids"]:
    print(tokenizer.decode(ids))
[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]

Pre-tokenized inputs

在NER和POS中使用预处理的tokenizer是很有用的。
Pre-tokenized就是已经做好了分词,我们直接传入词语,而不是句子。

encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
print(encoded_input)
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

总结

batch = tokenizer(batch_sentences,
                  batch_of_second_sentences,
                  is_pretokenized=True,
                  padding=True,
                  truncation=True,
                  return_tensors="pt")

参考:
https://huggingface.co/transformers/preprocessing.html

你给我的代码我运行了以下部分:import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # (三)特征工程 # 复制原始数据避免污染 df_processed = df.copy() # 1.删除不参与建模的列 df_processed.drop(columns=['student_id'], inplace=True) # 2.对需要反向编码的列进行处理(数值越小越优 -> 数值越大越优) reverse_cols = ['atth', 'attc', 'mid2'] df_processed[reverse_cols] = 5 - df_processed[reverse_cols] # 3.处理二值列(将sex从1/2转换为0/1) df_processed['sex'] = df_processed['sex'] - 1 # 4.定义特征和标签 X = df_processed.drop(columns=['ecgp']) y = df_processed['ecgp'] # 5.划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # (四)构建预处理管道 # 定义需要独热编码的分类列(排除已处理的二值列) categorical_cols = ['student_age', 'gh', 'ship', 'studyhours', 'readfren', 'readfres', 'attc', 'mid2', 'noteclass', 'listencla', 'cgp'] preprocessor = ColumnTransformer( transformers=[ ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols) ], remainder='passthrough' # 保留已处理的二值列(sex, atth) ) # (五)逻辑回归建模 lr_pipe = Pipeline([ ('preprocessor', preprocessor), ('classifier', LogisticRegression(max_iter=1000)) ]) lr_pipe.fit(X_train, y_train) print(f"\n逻辑回归准确率:{lr_pipe.score(X_test, y_test):.2%}") # (六)决策树建模与可视化 dt_pipe = Pipeline([ ('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier(max_depth=3)) ]) dt_pipe.fit(X_train, y_train) print(f"决策树准确率:{dt_pipe.score(X_test, y_test):.2%}") 在运行代码dt_pipe = Pipeline([ ('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier(max_depth=3)) ]) dt_pipe.fit(X_train, y_train) print(f"决策树准确率:{dt_pipe.score(X_test, y_test):.2%}") 时报错Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\33584\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 756, in score Xt = transform.transform(Xt) ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\33584\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_set_output.py", line 157, in wrapped data_to_wrap = f(self, X, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\33584\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\compose\_column_transformer.py", line 805, in transform named_transformers = self.named_transformers_ ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\33584\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\compose\_column_transformer.py", line 484, in named_transformers_ return Bunch(**{name: trans for name, trans, _ in self.transformers_}) ^^^^^^^^^^^^^^^^^^ AttributeError: 'ColumnTransformer' object has no attribute 'transformers_'. Did you mean: 'transformers'?
最新发布
03-18
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值