sklearn.utils.class_weight 样本均衡

当我们的数据,有多个类别,每个类别的数据量有很大差距时,这是我们需要对每个类别的样本做一次均衡,这样会让每个类别的特征都在一定程度上被模型学习,下面给出一个简单的小例子,如下:

from sklearn.utils.class_weight import compute_class_weight
import numpy as np

def class_balance():

    class_weight = 'balanced'
    list_sample_number = np.array([9, 1, 2])
    label = [2] * list_sample_number[0] + [6] * list_sample_number[1] + [4] * list_sample_number[2]
    print(label)
    classes = [2, 6, 4]

    # 样本均衡后的权重
    weight = compute_class_weight(class_weight, classes, label)
    print('weight = ', weight, 'type=', type(weight))   # [0.44444444 4.         2.        ]

    # 样本均衡后,把每个类别的权重与当前类别下样本数相乘
    weight_result = np.multiply(weight, list_sample_number)
    print('weight_result=', weight_result)   # [4. 4. 4.]

    # 可以看到三种类别的样本,被均衡后的个数都为4

    return

class_balance()

 

import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from keras.models import Sequential from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix, classification_report from sklearn.metrics import roc_auc_score from sklearn.utils.class_weight import compute_class_weight # 读取数据 data = pd.read_csv('database.csv') # 数据预处理 X = data.iloc[:, :-1].values y = data.iloc[:, -1].values scaler = StandardScaler() X = scaler.fit_transform(X) # 特征选择 pca = PCA(n_components=10) X = pca.fit_transform(X) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train) # 构建CNN模型 model = Sequential() model.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(10, 1))) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(Dense(10, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # 训练模型 X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1)) X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1)) model.fit(X_train, y_train,class_weight=class_weights,epochs=100, batch_size=64, validation_data=(X_test, y_test)) # 预测结果 y_pred = model.predict(X_test) #检验值 accuracy = accuracy_score(y_test, y_pred) auc = roc_auc_score(y_test, y_pred) print(auc) print("Accuracy:", accuracy) print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred)) print('Classification Report:\n', classification_report(y_test, y_pred))
06-01
(style_tune) C:\Users\28996\Desktop\AI\persona_contrastive_finetuning>python Contrastive_Training_LM.py INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). trainable params: 1,572,864 || all params: 1,838,401,536 || trainable%: 0.0856 训练集样本示例: {'anchor_input_ids': [56568, 118919, 116122, 11319], 'positive_input_ids': [116122, 20412, 107340, 9370, 100357, 102323, 3837, 109202, 104078, 103975, 100675, 101940, 100912, 105054, 6313], 'negative_input_ids': [100323, 104307, 99245, 9370, 106059, 104060, 3837, 104530, 115604, 99329, 11319]} 验证集样本示例: {'anchor_input_ids': [56568, 118919, 116122, 11319], 'positive_input_ids': [116122, 20412, 107340, 9370, 100357, 102323, 3837, 109202, 104078, 103975, 100675, 101940, 100912, 105054, 6313], 'negative_input_ids': [100323, 104307, 99245, 9370, 106059, 104060, 3837, 104530, 115604, 99329, 11319]} Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. INFO:__main__:GPU内存使用: 已分配 2.93GB, 保留 4.13GB 可训练参数列表: - base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.18.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.18.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.18.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.18.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.19.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.19.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.19.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.19.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.20.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.20.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.20.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.20.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.21.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.21.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.21.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.21.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.22.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.22.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.22.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.22.self_attn.v_proj.lora_B.default.weight - base_model.model.model.layers.23.self_attn.q_proj.lora_A.default.weight - base_model.model.model.layers.23.self_attn.q_proj.lora_B.default.weight - base_model.model.model.layers.23.self_attn.v_proj.lora_A.default.weight - base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight 0%| | 0/3 [00:00<?, ?it/s]You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. INFO:__main__:GPU内存使用: 已分配 4.00GB, 保留 4.21GB Could not estimate the number of tokens of the input, floating-point operations will not be computed Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. INFO:__main__:GPU内存使用: 已分配 4.02GB, 保留 4.22GB 33%|████████████████████████████ | 1/3 [00:03<00:07, 3.66s/it]Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. INFO:__main__:GPU内存使用: 已分配 4.01GB, 保留 4.25GB Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. INFO:__main__:GPU内存使用: 已分配 4.02GB, 保留 4.26GB 67%|████████████████████████████████████████████████████████ | 2/3 [00:06<00:03, 3.36s/it]Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. INFO:__main__:GPU内存使用: 已分配 4.01GB, 保留 4.25GB Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. INFO:__main__:GPU内存使用: 已分配 4.02GB, 保留 4.26GB {'train_runtime': 10.2272, 'train_samples_per_second': 0.587, 'train_steps_per_second': 0.293, 'train_loss': 1.0806043942769368, 'epoch': 3.0} 100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00, 3.41s/it] Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. 评估过程中发生错误: 'anchor_input_ids' Traceback (most recent call last): File "C:\Users\28996\Desktop\AI\persona_contrastive_finetuning\Contrastive_Training_LM.py", line 437, in <module> eval_results = trainer.evaluate() File "C:\Users\28996\Desktop\AI\persona_contrastive_finetuning\Contrastive_Training_LM.py", line 299, in evaluate return super().evaluate( File "C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\transformers\trainer.py", line 4076, in evaluate output = eval_loop( File "C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\transformers\trainer.py", line 4270, in evaluation_loop losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "C:\Users\28996\miniconda3\envs\style_tune\lib\site-packages\transformers\trainer.py", line 4486, in prediction_step loss, outputs = self.compute_loss(model, inputs, return_outputs=True) File "C:\Users\28996\Desktop\AI\persona_contrastive_finetuning\Contrastive_Training_LM.py", line 224, in compute_loss anchor_ids = inputs["anchor_input_ids"] KeyError: 'anchor_input_ids'
最新发布
07-21
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值