1. 基于keras的文本预处理
2. 多标签的one-hot表示
sklearn.preprocessing.MultiLabelBinarizer(classes=None,
sparse_output=False)
classes_属性:若设置classes参数时,其值等于classes参数值,否则从训练集统计标签值
from sklearn.preprocessing import MultiLabelBinarizer as MLB
mlb = MLB()
onehot = mlb.fit_transform([(1, 2), (3,4),(5,)])
print(onehot)
print(mlb.classes_)
onehot: [[1 1 0 0 0]
[0 0 1 1 0]
[0 0 0 0 1]]
labels: [1 2 3 4 5]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
one_hot = mlb.fit_transform([['sci-fi', 'thriller'], ['comedy']]).toarray()
print("one_hot:", one_hot)
print("labels:", mlb.classes_)
one_hot: [[0 1 1]
[1 0 0]]
labels: ['comedy' 'sci-fi' 'thriller']
设置classes参数,classes_属性值等于classes参数值
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes = [2,3,4,5,6,1])
onehot =mlb.fit_transform([(1, 2), (3,4),(5,)])
label = mlb.classes_
print("onehot:", onehot)
print("label:", label)
onehot: [[1 0 0 0 0 1]
[0 1 1 0 0 0]
[0 0 0 1 0 0]]
label: [2 3 4 5 6 1]
3. 获取数据batch_iter
方法1: TensorDataset 和 LoaderDataset 结合
torch.utils.data.TensorDataset(data_tensor, target_tensor)
参数
data_tensor (Tensor) - 包含样本数据
target_tensor (Tensor) - 包含样本目标(标签)
通过沿着第一个维度索引将两个张量恢复成样本,[样本,标签]一一对应
torch.utils.data.DataLoader(dataset,
batch_size=1,
shuffle=False,
sampler=None,
num_workers=0,
collate_fn=<function default_collate>,
pin_memory=False,
drop_last=False)
参数
dataset (Dataset)
– 加载数据的数据集,即上面TensorDataset返回的对象
batch_size (int, optional)
– 每个batch加载多少个样本(默认: 1)。
shuffle (bool, optional)
– 设置为True时会在每个epoch重新打乱数据(默认: False).
sampler (Sampler, optional) – 定义从数据集中提取样本的策略。如果指定,则忽略shuffle参数。
num_workers (int, optional)
– 用多少个子进程加载数据。0表示数据将在主进程中加载(默认: 0)
drop_last (bool, optional)
– 如果数据集大小不能被batch size整除,则设置为True后可删除最后一个不完整的batch。如果设为False并且数据集的大小不能被batch size整除,则最后一个batch将更小。(默认: False)
返回:一个迭代器来产生一个batch,可以使用for循环来遍历出所有的batch
例3.1
from torch.utils.data import TensorDataset, DataLoader, Dataset
import torch
x = torch.rand(4,3)
y = torch.rand(4)
print("x:", x)
print("y:", y)
dataset = TensorDataset(x, y)
print("dataset[0]:", dataset[0])
batch_iter = DataLoader(dataset, shuffle=True, batch_size=2)
for x, y in batch_iter:
print("x:", x,"\t", "y:", y)
输出
x: tensor([[0.1339, 0.5167, 0.4475],
[0.5751, 0.5181, 0.0108],
[0.2190, 0.0514, 0.3397],
[0.8105, 0.5136, 0.6912]])
y: tensor([0.5459, 0.3035, 0.1959, 0.3087])
dataset[0]: (tensor([0.1339, 0.5167, 0.4475]), tensor(0.5459))
x: tensor([[0.5751, 0.5181, 0.0108],
[0.1339, 0.5167, 0.4475]]) y: tensor([0.3035, 0.5459])
x: tensor([[0.8105, 0.5136, 0.6912],
[0.2190, 0.0514, 0.3397]]) y: tensor([0.3087, 0.1959])
方法2:Dataset类和DataLoader组合
torch.utils.data.Dataset
是代表自定义数据集方法的抽象类,自己定义的数据类需继承这个抽象类,并在自定义类中定义__len__和__getitem__这两个方法,如下自定义了一个MyDataset
类
from torch.utils.data import DataLoader, Dataset
import torch
class MyDataset(Dataset):
def __init__(self, x, y):
self.x = x
self.y = y
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return len(x)
x = torch.rand(4,3)
y = torch.rand(4)
print("x:", x)
print("y:", y)
dataset = MyDataset(x, y)
print("dataset[0]:", dataset[0])
batch_iter = DataLoader(dataset, shuffle=True, batch_size=2)
for x, y in batch_iter:
print("x:", x,"\t", "y:", y)
输出
x: tensor([[0.2112, 0.1261, 0.5459],
[0.4378, 0.0240, 0.8877],
[0.1814, 0.6185, 0.7762],
[0.2082, 0.6414, 0.4059]])
y: tensor([0.8861, 0.0685, 0.8854, 0.8199])
dataset[0]: (tensor([0.2112, 0.1261, 0.5459]), tensor(0.8861))
x: tensor([[0.2112, 0.1261, 0.5459],
[0.1814, 0.6185, 0.7762]]) y: tensor([0.8861, 0.8854])
x: tensor([[0.2082, 0.6414, 0.4059],
[0.4378, 0.0240, 0.8877]]) y: tensor([0.8199, 0.0685])