详解Keras3.0 Data loading: Text data loading

该文章详细讲解了如何使用Keras的text_dataset_from_directory函数处理目录下的文本文件,生成适合神经网络的整数序列数据集,以及其关键参数和示例应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

text_dataset_from_directory

用于从目录中读取文本文件并创建一个数据集。这个函数可以自动将文本文件转换为整数序列,以便在神经网络中使用。

keras.utils.text_dataset_from_directory(
    directory,
    labels="inferred",
    label_mode="int",
    class_names=None,
    batch_size=32,
    max_length=None,
    shuffle=True,
    seed=None,
    validation_split=None,
    subset=None,
    follow_links=False,
)
参数说明 
  • directory包含文本文件的目录路径。
  • labels="inferred"默认值为"inferred",表示Keras将尝试从文件名中推断标签。如果需要手动指定标签,可以将此参数设置为一个整数列表或字典。
  • label_mode="int"默认值为"int",表示标签将被编码为整数。如果需要使用其他模式(如"categorical"),可以将此参数设置为相应的字符串。
  • class_names=None可选参数,用于指定类别名称。如果提供了类别名称列表,数据集将使用这些名称进行编码。
  • batch_size=32每个批次中的样本数量。
  • max_length=None可选参数,用于限制每个样本的最大长度。如果未指定,则使用文件中最长的样本作为最大长度。
  • shuffle=True是否在每个epoch开始时对数据进行洗牌。
  • seed=None随机数生成器的种子,用于确保可重复的结果。
  • validation_split=None可选参数,用于指定验证集的比例。如果提供了值,将从训练集中划分出相应比例的数据作为验证集。
  • subset=None可选参数,用于指定要加载的子集("training"、"validation"或"testing")。
  • follow_links=False是否跟随符号链接。如果为True,将加载符号链接指向的文件;如果为False,将加载实际文件。
示例1
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


#定义一个函数来创建数据集
def create_dataset(directory, num_words=None, sequence_length=100):
    # 使用Tokenizer对文本进行编码
    tokenizer = Tokenizer(num_words=num_words)
    tokenizer.fit_on_texts(directory)
    sequences = tokenizer.texts_to_sequences(directory)

    # 对序列进行填充,使它们具有相同的长度
    padded_sequences = pad_sequences(sequences, maxlen=sequence_length)

    return padded_sequences


#使用create_dataset函数创建数据集,并将其分为训练集和验证集
train_data = create_dataset('path/to/train/directory')
val_data = create_dataset('path/to/validation/directory')


#使用keras.utils.text_dataset_from_directory函数创建数据集
train_dataset = keras.utils.text_dataset_from_directory(
    'path/to/train/directory',
    batch_size=32,
    sequence_length=100,
    validation_split=0.2,
    subset='training',
    seed=42,
    class_mode='categorical'
)

val_dataset = keras.utils.text_dataset_from_directory(
    'path/to/validation/directory',
    batch_size=32,
    sequence_length=100,
    validation_split=0.2,
    subset='validation',
    seed=42,
    class_mode='categorical'
)
示例2 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import text_dataset_from_directory

# 设置参数
directory = 'path/to/your/directory'  # 指定包含文本数据的目录路径
batch_size = 32  # 每个批次中的样本数量
max_length = 100  # 每个样本的最大长度

# 创建数据集
train_dataset = text_dataset_from_directory(
    directory,
    batch_size=batch_size,
    max_length=max_length,
    shuffle=True,
    validation_split=0.2,
    subset='training',   #设置subset='training',表示只加载训练集的数据
)

val_dataset = text_dataset_from_directory(
    directory,
    batch_size=batch_size,
    max_length=max_length,
    shuffle=True,
    validation_split=0.2,
    subset='validation',   #设置subset='validation',表示只加载验证集的数据
)

# Ultralytics YOLO 🚀, AGPL-3.0 license # Default training settings and hyperparameters for medium-augmentation COCO training task: detect # (str) YOLO task, i.e. detect, segment, classify, pose mode: train # (str) YOLO mode, i.e. train, val, predict, export, track, benchmark # Train settings ------------------------------------------------------------------------------------------------------- model: # (str, optional) path to model file, i.e. yolov8n.pt, yolov8n.yaml data: # (str, optional) path to data file, i.e. coco128.yaml epochs: 200 # (int) number of epochs to train for patience: 300 # (int) epochs to wait for no observable improvement for early stopping of training batch: 2 # (int) number of images per batch (-1 for AutoBatch) imgsz: 640 # (int | list) input images size as int for train and val modes, or list[w,h] for predict and export modes save: True # (bool) save train checkpoints and predict results save_period: -1 # (int) Save checkpoint every x epochs (disabled if < 1) cache: True # (bool) True/ram, disk or False. Use cache for data loading device: # (int | str | list, optional) device to run on, i.e. cuda device=0 or device=0,1,2,3 or device=cpu workers: 0 # (int) number of worker threads for data loading (per RANK if DDP) project: # (str, optional) project name name: # (str, optional) experiment name, results saved to &#39;project/name&#39; directory exist_ok: False # (bool) whether to overwrite existing experiment pretrained: True # (bool | str) whether to use a pretrained model (bool) or a model to load weights from (str) optimizer: auto # (str) optimizer to use, choices=[SGD, Adam, Adamax, AdamW, NAdam, RAdam, RMSProp, auto] verbose: True # (bool) whether to print verbose output seed: 0 # (int) random seed for reproducibility deterministic: True # (bool) whether to enable deterministic mode single_cls: False # (bool) train multi-class data as single-class rect: False # (bool) rectangular training if mode=&#39;train&#39; or rectangular validation if mode=&#39;val&#39; cos_lr: False # (bool) use cosine learning rate scheduler close_mosaic: 10 # (int) disable mosaic augmentation for final epochs (0 to disable) resume: False # (bool) resume training from last checkpoint amp: True # (bool) Automatic Mixed Precision (AMP) training, choices=[True, False], True runs AMP check fraction: 1.0 # (float) dataset fraction to train on (default is 1.0, all images in train set) profile: False # (bool) profile ONNX and TensorRT speeds during training for loggers freeze: None # (int | list, optional) freeze first n layers, or freeze list of layer indices during training # Segmentation overlap_mask: True # (bool) masks should overlap during training (segment train only) mask_ratio: 4 # (int) mask downsample ratio (segment train only) # Classification dropout: 0.0 # (float) use dropout regularization (classify train only) # Val/Test settings ---------------------------------------------------------------------------------------------------- val: True # (bool) validate/test during training split: val # (str) dataset split to use for validation, i.e. &#39;val&#39;, &#39;test&#39; or &#39;train&#39; save_json: False # (bool) save results to JSON file save_hybrid: False # (bool) save hybrid version of labels (labels + additional predictions) conf: # (float, optional) object confidence threshold for detection (default 0.25 predict, 0.001 val) iou: 0.7 # (float) intersection over union (IoU) threshold for NMS max_det: 300 # (int) maximum number of detections per image half: False # (bool) use half precision (FP16) dnn: False # (bool) use OpenCV DNN for ONNX inference plots: True # (bool) save plots during train/val # Prediction settings -------------------------------------------------------------------------------------------------- source: # (str, optional) source directory for images or videos show: False # (bool) show results if possible save_txt: False # (bool) save results as .txt file save_conf: False # (bool) save results with confidence scores save_crop: False # (bool) save cropped images with results show_labels: True # (bool) show object labels in plots show_conf: True # (bool) show object confidence scores in plots vid_stride: 1 # (int) video frame-rate stride stream_buffer: False # (bool) buffer all streaming frames (True) or return the most recent frame (False) line_width: # (int, optional) line width of the bounding boxes, auto if missing visualize: False # (bool) visualize model features augment: False # (bool) apply image augmentation to prediction sources agnostic_nms: False # (bool) class-agnostic NMS classes: # (int | list[int], optional) filter results by class, i.e. classes=0, or classes=[0,2,3] retina_masks: False # (bool) use high-resolution segmentation masks boxes: True # (bool) Show boxes in segmentation predictions # Export settings ------------------------------------------------------------------------------------------------------ format: torchscript # (str) format to export to, choices at https://docs.ultralytics.com/modes/export/#export-formats keras: False # (bool) use Kera=s optimize: False # (bool) TorchScript: optimize for mobile int8: False # (bool) CoreML/TF INT8 quantization dynamic: False # (bool) ONNX/TF/TensorRT: dynamic axes simplify: False # (bool) ONNX: simplify model opset: # (int, optional) ONNX: opset version workspace: 4 # (int) TensorRT: workspace size (GB) nms: False # (bool) CoreML: add NMS # Hyperparameters ------------------------------------------------------------------------------------------------------ lr0: 0.01 # (float) initial learning rate (i.e. SGD=1E-2, Adam=1E-3) lrf: 0.01 # (float) final learning rate (lr0 * lrf) momentum: 0.937 # (float) SGD momentum/Adam beta1 weight_decay: 0.0005 # (float) optimizer weight decay 5e-4 warmup_epochs: 3.0 # (float) warmup epochs (fractions ok) warmup_momentum: 0.8 # (float) warmup initial momentum warmup_bias_lr: 0.1 # (float) warmup initial bias lr box: 7.5 # (float) box loss gain cls: 0.5 # (float) cls loss gain (scale with pixels) dfl: 1.5 # (float) dfl loss gain pose: 12.0 # (float) pose loss gain kobj: 1.0 # (float) keypoint obj loss gain label_smoothing: 0.0 # (float) label smoothing (fraction) nbs: 64 # (int) nominal batch size hsv_h: 0.015 # (float) image HSV-Hue augmentation (fraction) hsv_s: 0.7 # (float) image HSV-Saturation augmentation (fraction) hsv_v: 0.4 # (float) image HSV-Value augmentation (fraction) degrees: 0.0 # (float) image rotation (+/- deg) translate: 0.1 # (float) image translation (+/- fraction) scale: 0.5 # (float) image scale (+/- gain) shear: 0.0 # (float) image shear (+/- deg) perspective: 0.0 # (float) image perspective (+/- fraction), range 0-0.001 flipud: 0.0 # (float) image flip up-down (probability) fliplr: 0.5 # (float) image flip left-right (probability) mosaic: 1.0 # (float) image mosaic (probability) mixup: 0.0 # (float) image mixup (probability) copy_paste: 0.0 # (float) segment copy-paste (probability) # Custom config.yaml --------------------------------------------------------------------------------------------------- cfg: # (str, optional) for overriding defaults.yaml # Tracker settings ------------------------------------------------------------------------------------------------------ tracker: botsort.yaml # (str) tracker type, choices=[botsort.yaml, bytetrack.yaml] 这段代码什么意思
最新发布
05-12
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

缘起性空、

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值