keras深度训练7: constant val_acc

本文探讨了使用Keras训练模型时遇到的acc和val_acc在300个周期内保持不变的问题,分析了可能的原因,并提供了调整建议。此外,还讨论了batch_size选择的影响以及TensorBoard常见错误的解决方案。

Keras: acc and val_acc are constant over 300 epochs, is this normal?

https://stats.stackexchange.com/questions/259418/keras-acc-and-val-acc-are-constant-over-300-epochs-is-this-normal

It seems that your model is not able to make sensible adjustments to your weights. The log loss is decreasing a tiny bit, but then gets stuck. It is just randomly guessing.

I think the root of the problem is that you have sparse positive inputs, positive initial weights and a ReLu activation. I suspect that this combination does not lead to nonzero weight adjustments (however, I do not have any literature background on this)

There are a few things that you could try:

Change the initialization to normal.
Use sigmoid layers everywhere.
Normalize your input, e.g. use StandardScaler from scikit learn.
Increase the initial learning rate and/or choose a different optimizer.
For debugging purposes, decrease the size of the hidden layer or even remove it.

Loss曲线震荡:

分析原因: 1:训练的batch_size太小

  1. 当数据量足够大的时候可以适当的减小batch_size,由于数据量太大,内存不够。但盲目减少会导致无法收敛,batch_size=1时为在线学习。

  2. batch的选择,首先决定的是下降方向,如果数据集比较小,则完全可以采用全数据集的形式。这样做的好处有两点,

    1)全数据集的方向能够更好的代表样本总体,确定其极值所在。

    2)由于不同权重的梯度值差别巨大,因此选取一个全局的学习率很困难。

  3. 增大batchsize的好处有三点:

    1)内存的利用率提高了,大矩阵乘法的并行化效率提高。

    2)跑完一次epoch(全数据集)所需迭代次数减少,对于相同的数据量的处理速度进一步加快。

    3)一定范围内,batchsize越大,其确定的下降方向就越准,引起训练震荡越小。

  4. 盲目增大的坏处:

    1)当数据集太大时,内存撑不住。

    2)batchsize增大到一定的程度,其确定的下降方向已经基本不再变化。

    总结:

    1)batch数太小,而类别又比较多的时候,可能会导致loss函数震荡而不收敛,尤其是在你的网络比较复杂的时候。
    
    2)随着batchsize增大,处理相同的数据量的速度越快。
    
    3)随着batchsize增大,达到相同精度所需要的epoch数量越来越多。
    
    4)由于上述两种因素的矛盾, Batch_Size 增大到某个时候,达到时间上的最优。
    
    5)过大的batchsize的结果是网络很容易收敛到一些不好的局部最优点。同样太小的batch也存在一些问题,比如训练速度很慢,训练不容易收敛等。
    
    6)具体的batch size的选取和训练集的样本数目相关
    

分析原因: 2:数据输入不对

1:数据输入不对包括数据的格式不是网络模型指定的格式,导致训练的时候网络学习的数据不是想要的; 此时会出现loss曲线震荡;

解决办法:检查数据输入格式,数据输入的路径;

分析原因: 3:训练脚本里面的 路径是否配置正确;

1:当脚本中的train.bin的路径或者模型参数的路径配置不对时,会导致训练模型结果不对.

解决办法:检查脚本配置是否正确.

tensorboard错误 :TensorBoard attempted to bind to port 6006, but it was already in use
https://blog.youkuaiyun.com/weixin_35654926/article/details/75577515

理解 TensorBoard
https://blog.youkuaiyun.com/u010099080/article/details/77426577

import numpy as np import tensorflow as tf import pandas as pd import random import argparse from sklearn.model_selection import KFold, train_test_split from sklearn.metrics.pairwise import cosine_similarity from rdkit import Chem from rdkit.Chem import AllChem from tqdm import tqdm import tensorflow.keras.backend as K from tensorflow.keras import layers, Model # 配置GPU内存动态分配 gpus = tf.config.list_physical_devices('GPU') if gpus: try: for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) logical_gpus = tf.config.list_logical_devices('GPU') print(f"{len(gpus)} 物理GPU, {len(logical_gpus)} 逻辑GPU") except RuntimeError as e: print(e) # 补充必要的工具函数 def positional_encoding(max_position, d_model, min_freq=1e-6): position = np.arange(max_position) freqs = min_freq **(2 * (np.arange(d_model) // 2) / d_model) pos_enc = position.reshape(-1, 1) * freqs.reshape(1, -1) pos_enc[:, ::2] = np.cos(pos_enc[:, ::2]) pos_enc[:, 1::2] = np.sin(pos_enc[:, 1::2]) return pos_enc.astype(np.float32) # 初始化位置编码参数 dimn = 64 # 光谱编码维度 cnn_feature_dim = 64 # CNN输出特征维度 transformer_dim = 64 # Transformer特征维度 P = positional_encoding(256, transformer_dim, min_freq=1e2) # 光谱数据增强 def augment_spectrum(mz_list, intensity_list, noise_factor=0.01): noisy_mz = [mz + np.random.normal(0, noise_factor) for mz in mz_list] noisy_intensity = [intensity * (1 + np.random.normal(0, noise_factor)) for intensity in intensity_list] noisy_intensity = [max(0, i) for i in noisy_intensity] return noisy_mz, noisy_intensity def prepro_specs_train(df, augment=True): df = df.reset_index(drop=True) valid = [] mz_intensity = df['Spectrum'].to_list() def process_line(line): pairs = line.split() mz_list = [] intensity_list = [] for pair in pairs: mz, intensity = pair.split(':') mz_list.append(float(mz)) intensity_list.append(float(intensity)) return mz_list, intensity_list for idx, intensities in tqdm(enumerate(mz_intensity), disable=False, desc="预处理光谱数据"): mz_list, intensity_list = process_line(intensities) mz_list.append(float(df.at[idx, 'Total Exact Mass'])) if augment: mz_list, intensity_list = augment_spectrum(mz_list, intensity_list) round_mz_list = [round(float(mz), 2) for mz in mz_list] round_intensity_list = [round(float(intensity), 2) for intensity in intensity_list] valid.append([round_mz_list, round_intensity_list]) return tf.ragged.constant(valid) def prepro_specs_test(df): return prepro_specs_train(df, augment=False) def encoding(rag_tensor, P, dimn): to_pad = [] for sample in tqdm(rag_tensor, desc="编码光谱数据"): mz_list = sample[0].numpy().tolist() intensity_list = sample[1].numpy().tolist() positions = [min(int(round(intensity * 100)), len(P)-1) for intensity in intensity_list] pos_enc = np.array([P[pos] for pos in positions]) if positions else np.zeros((1, dimn)) averaged_encoding = [np.mean(pos_enc[:, dim]) for dim in range(dimn)] to_pad.append(averaged_encoding) return np.array(to_pad, dtype=np.float32) # 将Scaffold转换为摩根指纹 def scaffold_to_morgan(smiles, radius=2, nBits=2048): if pd.isna(smiles) or smiles == '': return np.zeros(nBits, dtype=np.float32) mol = Chem.MolFromSmiles(smiles) if mol is None: return np.zeros(nBits, dtype=np.float32) fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits) return np.array(fp, dtype=np.float32) # 自定义Transformer编码器层 def transformer_encoder_layer(units, num_heads, dropout, name="transformer_encoder_layer"): inputs = layers.Input(shape=(None, units), name="inputs") attention = layers.MultiHeadAttention( num_heads=num_heads, key_dim=units//num_heads, name="attention" )(inputs, inputs) attention = layers.Dropout(dropout)(attention) attention = layers.Add()([inputs, attention]) attention = layers.LayerNormalization(epsilon=1e-6)(attention) ffn = layers.Dense(units * 2, activation="gelu", kernel_regularizer=tf.keras.regularizers.l2(1e-4))(attention) ffn = layers.Dense(units, activation="gelu", kernel_regularizer=tf.keras.regularizers.l2(1e-4))(ffn) ffn = layers.Dropout(dropout)(ffn) outputs = layers.Add()([attention, ffn]) outputs = layers.LayerNormalization(epsilon=1e-6)(outputs) return Model(inputs=inputs, outputs=outputs, name=name) # 构建CNN+Transformer模型 def build_cnn_transformer_encoder(input_dim=dimn, cnn_filters=cnn_feature_dim, transformer_dim=transformer_dim, num_layers=2, num_heads=2, dropout=0.3): inputs = layers.Input(shape=(input_dim,), name="input_layer") x = layers.Reshape((input_dim, 1))(inputs) x = layers.Conv1D(filters=32, kernel_size=3, padding='same', activation='gelu', kernel_regularizer=tf.keras.regularizers.l2(1e-4))(x) x = layers.BatchNormalization()(x) x = layers.Dropout(dropout/2)(x) # 修复此处的类型错误 x = layers.Conv1D(filters=cnn_filters, kernel_size=5, padding='same', activation='gelu', kernel_regularizer=tf.keras.regularizers.l2(1e-4))(x) x = layers.BatchNormalization()(x) x = layers.Dropout(dropout/2)(x) # 修复此处的类型错误 if cnn_filters != transformer_dim: x = layers.Conv1D(filters=transformer_dim, kernel_size=1, padding='same')(x) pos_encoding = positional_encoding(input_dim, transformer_dim) pos_encoding = tf.convert_to_tensor(pos_encoding, dtype=tf.float32) pos_encoding = tf.expand_dims(pos_encoding, axis=0) x = layers.Add()([x, pos_encoding]) for i in range(num_layers): x = transformer_encoder_layer( units=transformer_dim, num_heads=num_heads, dropout=dropout, name=f"transformer_layer_{i}" )(x) x = layers.GlobalAveragePooling1D()(x) x = layers.Dense(2048, activation='relu')(x) x = layers.Dropout(0.2)(x) outputs = layers.Dense(2048, activation='relu', name="output_layer")(x) return Model(inputs=inputs, outputs=outputs, name="cnn_transformer_encoder") # NT-Xent损失函数 def nt_xent_loss(y_true, y_pred, temperature=0.05): encoder_output = tf.nn.l2_normalize(y_pred, axis=1) morgan_fp = tf.nn.l2_normalize(y_true, axis=1) samples_per_group = 256 batch_size = tf.shape(encoder_output)[0] num_groups = batch_size // samples_per_group encoder_grouped = tf.reshape(encoder_output, (num_groups, samples_per_group, -1)) morgan_grouped = tf.reshape(morgan_fp, (num_groups, samples_per_group, -1)) similarity_matrix = tf.matmul(encoder_grouped, morgan_grouped, transpose_b=True) positive_similarity = similarity_matrix[:, 0, 0] base_mask = tf.logical_not(tf.eye(samples_per_group, dtype=tf.bool)) mask = tf.tile(tf.expand_dims(base_mask, 0), [num_groups, 1, 1]) group_similarities = similarity_matrix[:, 0, :] group_mask = mask[:, 0, :] negative_similarities = tf.boolean_mask(group_similarities, group_mask) numerator = tf.exp(positive_similarity / temperature) denominator = tf.reduce_sum(tf.exp(negative_similarities / temperature), axis=0) per_group_loss = -tf.math.log(numerator / (denominator + numerator)) return tf.reduce_mean(per_group_loss) # 批次生成函数 def generate_batches(df, encoded_spectra, morgan_fps, groups_per_batch=2, shuffle=True): valid_smiles = [] smiles_groups = df.groupby('SMILES') for smiles, group in smiles_groups: pos_count = len(group[group['Type'] == 'Pos']) neg_count = len(group[group['Type'] == 'Neg']) if pos_count == 1 and neg_count == 255: valid_smiles.append(smiles) if shuffle: random.shuffle(valid_smiles) for i in range(0, len(valid_smiles), groups_per_batch): batch_smiles = valid_smiles[i:i+groups_per_batch] if not batch_smiles: continue all_spectra = [] all_morgan = [] all_df = [] for smiles in batch_smiles: group = smiles_groups.get_group(smiles) pos_samples = group[group['Type'] == 'Pos'] neg_samples = group[group['Type'] == 'Neg'] ordered_group = pd.concat([pos_samples, neg_samples]) group_indices = ordered_group.index.tolist() all_spectra.append(encoded_spectra[group_indices]) all_morgan.append(morgan_fps[group_indices]) all_df.append(ordered_group) batch_spectra = np.concatenate(all_spectra, axis=0) batch_morgan = np.concatenate(all_morgan, axis=0) batch_df = pd.concat(all_df, ignore_index=True) yield batch_spectra, batch_morgan, batch_df # 计算有效批次数 def count_valid_batches(df, groups_per_batch=2): valid_count = 0 for _, group in df.groupby('SMILES'): pos_count = len(group[group['Type'] == 'Pos']) neg_count = len(group[group['Type'] == 'Neg']) if pos_count == 1 and neg_count == 255: valid_count += 1 return (valid_count + groups_per_batch - 1) // groups_per_batch # 计算Top1准确度 def calculate_top1_accuracy(model, test_df, test_spectra, test_morgan, groups_per_batch=2): correct = 0 total = 0 test_generator = generate_batches( test_df, test_spectra, test_morgan, groups_per_batch=groups_per_batch, shuffle=False ) test_batch_count = count_valid_batches(test_df, groups_per_batch=groups_per_batch) for _ in tqdm(range(test_batch_count), desc="计算Top1准确度"): batch_spectra, batch_morgan, _ = next(test_generator) encoder_output = model(batch_spectra) samples_per_group = 256 num_groups = len(batch_spectra) // samples_per_group for group_idx in range(num_groups): start_idx = group_idx * samples_per_group end_idx = start_idx + samples_per_group group_encoder = encoder_output[start_idx:end_idx] group_morgan = batch_morgan[start_idx:end_idx] similarities = cosine_similarity(group_encoder, group_morgan) pos_idx = 0 pos_similarities = similarities[pos_idx] max_sim_idx = np.argmax(pos_similarities) if max_sim_idx == pos_idx: correct += 1 total += 1 if total == 0: return 0.0 return correct / total # 数据加载和预处理 def load_and_preprocess_data(csv_path, test_size=0.2, random_state=46): print(f"加载数据: {csv_path}") df = pd.read_csv(csv_path) valid_smiles = [] smiles_groups = df.groupby('SMILES') for smiles, group in smiles_groups: pos_count = len(group[group['Type'] == 'Pos']) neg_count = len(group[group['Type'] == 'Neg']) if pos_count == 1 and neg_count == 255: valid_smiles.append(smiles) print(f"有效SMILES组数: {len(valid_smiles)}") # 先划分出独立测试集 train_val_smiles, test_smiles = train_test_split( valid_smiles, test_size=test_size, random_state=random_state ) train_val_df = df[df['SMILES'].isin(train_val_smiles)].reset_index(drop=True) test_df = df[df['SMILES'].isin(test_smiles)].reset_index(drop=True) print(f"训练验证集大小: {len(train_val_df)}, 独立测试集大小: {len(test_df)}") print(f"训练验证SMILES数量: {len(train_val_smiles)}, 测试SMILES数量: {len(test_smiles)}") # 处理测试集 print("预处理测试集光谱数据...") test_rag_tensor = prepro_specs_test(test_df) test_encoded_spectra = encoding(test_rag_tensor, P, dimn) print("转换测试集Scaffold为摩根指纹...") test_df['morgan_fp'] = test_df['Scaffold'].apply(scaffold_to_morgan) test_morgan_fps = np.stack(test_df['morgan_fp'].values) return train_val_df, train_val_smiles, test_df, test_encoded_spectra, test_morgan_fps # 交叉验证训练函数 def cross_validate_model(train_val_df, train_val_smiles, hyperparams, n_splits=10, epochs=10): """执行十折交叉验证评估给定超参数""" kf = KFold(n_splits=n_splits, shuffle=True, random_state=42) fold_results = [] for fold, (train_idx, val_idx) in enumerate(kf.split(train_val_smiles)): print(f"\n{'='*50}") print(f"开始第 {fold+1}/{n_splits} 折交叉验证") print(f"{'='*50}\n") # 划分当前折的训练集和验证集 train_smiles = [train_val_smiles[i] for i in train_idx] val_smiles = [train_val_smiles[i] for i in val_idx] fold_train_df = train_val_df[train_val_df['SMILES'].isin(train_smiles)].reset_index(drop=True) fold_val_df = train_val_df[train_val_df['SMILES'].isin(val_smiles)].reset_index(drop=True) # 预处理光谱数据 print("预处理当前折训练集光谱数据...") train_rag_tensor = prepro_specs_train(fold_train_df, augment=True) train_encoded_spectra = encoding(train_rag_tensor, P, dimn) print("预处理当前折验证集光谱数据...") val_rag_tensor = prepro_specs_test(fold_val_df) val_encoded_spectra = encoding(val_rag_tensor, P, dimn) # 处理摩根指纹 print("转换当前折训练集Scaffold为摩根指纹...") fold_train_df['morgan_fp'] = fold_train_df['Scaffold'].apply(scaffold_to_morgan) train_morgan_fps = np.stack(fold_train_df['morgan_fp'].values) print("转换当前折验证集Scaffold为摩根指纹...") fold_val_df['morgan_fp'] = fold_val_df['Scaffold'].apply(scaffold_to_morgan) val_morgan_fps = np.stack(fold_val_df['morgan_fp'].values) # 构建模型 model = build_cnn_transformer_encoder( input_dim=dimn, cnn_filters=hyperparams['cnn_filters'], transformer_dim=hyperparams['transformer_dim'], num_layers=hyperparams['num_layers'], num_heads=hyperparams['num_heads'], dropout=hyperparams['dropout'] ) # 编译模型 optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparams['learning_rate']) model.compile(optimizer=optimizer, loss=nt_xent_loss) # 计算批次数 train_batch_count = count_valid_batches(fold_train_df, groups_per_batch=hyperparams['groups_per_batch']) val_batch_count = count_valid_batches(fold_val_df, groups_per_batch=hyperparams['groups_per_batch']) print(f"当前折训练批次数: {train_batch_count}, 验证批次数: {val_batch_count}") # 训练模型 best_val_acc = 0.0 for epoch in range(epochs): print(f"\n第 {epoch+1}/{epochs} 轮") train_total_loss = 0.0 train_generator = generate_batches( fold_train_df, train_encoded_spectra, train_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'], shuffle=True ) # 训练 for _ in tqdm(range(train_batch_count), desc="训练"): batch_spectra, batch_morgan, _ = next(train_generator) loss = model.train_on_batch(batch_spectra, batch_morgan) train_total_loss += loss train_avg_loss = train_total_loss / train_batch_count if train_batch_count > 0 else 0.0 # 验证 val_generator = generate_batches( fold_val_df, val_encoded_spectra, val_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'], shuffle=False ) val_total_loss = 0.0 for _ in range(val_batch_count): batch_spectra, batch_morgan, _ = next(val_generator) loss = model.test_on_batch(batch_spectra, batch_morgan) val_total_loss += loss val_avg_loss = val_total_loss / val_batch_count if val_batch_count > 0 else 0.0 # 计算验证集准确度 val_acc = calculate_top1_accuracy( model, fold_val_df, val_encoded_spectra, val_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'] ) print(f"训练损失: {train_avg_loss:.6f}, 验证损失: {val_avg_loss:.6f}, 验证准确度: {val_acc:.6f}") # 保存当前折中最佳模型 if val_acc > best_val_acc: best_val_acc = val_acc fold_results.append(best_val_acc) print(f"\n第 {fold+1} 折最佳验证准确度: {best_val_acc:.6f}") # 清除内存 K.clear_session() # 计算交叉验证结果 mean_acc = np.mean(fold_results) std_acc = np.std(fold_results) print(f"\n{'='*50}") print(f"十折交叉验证结果: {mean_acc:.6f} ± {std_acc:.6f}") print(f"{'='*50}\n") return mean_acc, std_acc, fold_results # 在完整训练集上训练最终模型 def train_final_model(train_val_df, hyperparams, epochs=10): print("\n开始在完整训练集上训练最终模型...") # 预处理所有训练数据 print("预处理完整训练集光谱数据...") train_rag_tensor = prepro_specs_train(train_val_df, augment=True) train_encoded_spectra = encoding(train_rag_tensor, P, dimn) print("转换完整训练集Scaffold为摩根指纹...") train_val_df['morgan_fp'] = train_val_df['Scaffold'].apply(scaffold_to_morgan) train_morgan_fps = np.stack(train_val_df['morgan_fp'].values) # 构建模型 model = build_cnn_transformer_encoder( input_dim=dimn, cnn_filters=hyperparams['cnn_filters'], transformer_dim=hyperparams['transformer_dim'], num_layers=hyperparams['num_layers'], num_heads=hyperparams['num_heads'], dropout=hyperparams['dropout'] ) # 编译模型 optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparams['learning_rate']) model.compile(optimizer=optimizer, loss=nt_xent_loss) # 计算批次数 train_batch_count = count_valid_batches(train_val_df, groups_per_batch=hyperparams['groups_per_batch']) print(f"完整训练集批次数: {train_batch_count}") # 训练模型 for epoch in range(epochs): print(f"\n第 {epoch+1}/{epochs} 轮") train_total_loss = 0.0 train_generator = generate_batches( train_val_df, train_encoded_spectra, train_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'], shuffle=True ) # 训练 for _ in tqdm(range(train_batch_count), desc="训练"): batch_spectra, batch_morgan, _ = next(train_generator) loss = model.train_on_batch(batch_spectra, batch_morgan) train_total_loss += loss train_avg_loss = train_total_loss / train_batch_count if train_batch_count > 0 else 0.0 print(f"训练损失: {train_avg_loss:.6f}") return model # 主函数 def main(args): # 加载数据并划分训练验证集和独立测试集 train_val_df, train_val_smiles, test_df, test_encoded_spectra, test_morgan_fps = load_and_preprocess_data( args.data, test_size=args.test_size ) # 定义超参数搜索空间 - 修复了参数格式,每个字典代表一组完整的超参数组合 param_grid = [ { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch } ] # 超参数优化 best_score = -1 best_params = None results = [] print("\n开始超参数搜索和十折交叉验证...") for i, params in enumerate(param_grid): print(f"\n{'#'*50}") print(f"测试第 {i+1}/{len(param_grid)} 组超参数: {params}") print(f"{'#'*50}\n") mean_acc, std_acc, fold_results = cross_validate_model( train_val_df, train_val_smiles, params, n_splits=10, epochs=args.epochs ) results.append({ 'params': params, 'mean_acc': mean_acc, 'std_acc': std_acc, 'fold_results': fold_results }) if mean_acc > best_score: best_score = mean_acc best_params = params # 输出最佳超参数 print("\n" + "="*70) print("超参数搜索完成!") print(f"最佳超参数: {best_params}") print(f"最佳交叉验证准确度: {best_score:.6f}") print("="*70 + "\n") # 使用最佳超参数训练最终模型 final_model = train_final_model( train_val_df, best_params, epochs=args.epochs ) # 在独立测试集上评估 print("\n在独立测试集上评估最终模型...") test_acc = calculate_top1_accuracy( final_model, test_df, test_encoded_spectra, test_morgan_fps, groups_per_batch=best_params['groups_per_batch'] ) print(f"\n{'='*70}") print(f"独立测试集Top1准确度: {test_acc:.6f}") print("="*70 + "\n") # 保存模型 final_model.save(args.output) print(f"最终模型已保存至: {args.output}") # 保存超参数搜索结果 import json with open('hyperparameter_results.json', 'w') as f: json.dump(results, f, indent=2, default=lambda x: x.tolist() if isinstance(x, np.ndarray) else x) print("超参数搜索结果已保存至 hyperparameter_results.json") if __name__ == "__main__": parser = argparse.ArgumentParser(description='带十折交叉验证的质谱数据对比学习模型(CNN+Transformer)') parser.add_argument('--data', type=str, default='/home/admin123/code/骨架256.csv', help='CSV数据文件路径') parser.add_argument('--epochs', type=int, default=1000, help='训练轮数') parser.add_argument('--test_size', type=float, default=0.2, help='独立测试集比例') parser.add_argument('--groups_per_batch', type=int, default=8, choices=range(1, 65), help='每个批次包含的SMILES组数(1到64之间)') parser.add_argument('--output', type=str, default='best_cnn_transformer_encoder.h5', help='最终模型保存路径') args = parser.parse_args() main(args)修改代码确保计算损失时提取到的正样本是来自group['Type'] == 'Pos'
08-20
import numpy as np import tensorflow as tf import pandas as pd import random import argparse from sklearn.model_selection import KFold, train_test_split from sklearn.metrics.pairwise import cosine_similarity from rdkit import Chem from rdkit.Chem import AllChem from tqdm import tqdm import tensorflow.keras.backend as K from tensorflow.keras import layers, Model 配置GPU内存动态分配 gpus = tf.config.list_physical_devices(‘GPU’) if gpus: try: for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) logical_gpus = tf.config.list_logical_devices(‘GPU’) print(f"{len(gpus)} 物理GPU, {len(logical_gpus)} 逻辑GPU") except RuntimeError as e: print(e) 补充必要的工具函数 def positional_encoding(max_position, d_model, min_freq=1e-6): position = np.arange(max_position) freqs = min_freq **(2 * (np.arange(d_model) // 2) / d_model) pos_enc = position.reshape(-1, 1) * freqs.reshape(1, -1) pos_enc[:, ::2] = np.cos(pos_enc[:, ::2]) pos_enc[:, 1::2] = np.sin(pos_enc[:, 1::2]) return pos_enc.astype(np.float32) 初始化位置编码参数 dimn = 64 # 光谱编码维度 cnn_feature_dim = 64 # CNN输出特征维度 transformer_dim = 64 # Transformer特征维度 P = positional_encoding(256, transformer_dim, min_freq=1e2) 光谱数据增强 def augment_spectrum(mz_list, intensity_list, noise_factor=0.01): noisy_mz = [mz + np.random.normal(0, noise_factor) for mz in mz_list] noisy_intensity = [intensity * (1 + np.random.normal(0, noise_factor)) for intensity in intensity_list] noisy_intensity = [max(0, i) for i in noisy_intensity] return noisy_mz, noisy_intensity def prepro_specs_train(df, augment=True): df = df.reset_index(drop=True) valid = [] mz_intensity = df[‘Spectrum’].to_list() def process_line(line): pairs = line.split() mz_list = [] intensity_list = [] for pair in pairs: mz, intensity = pair.split(':') mz_list.append(float(mz)) intensity_list.append(float(intensity)) return mz_list, intensity_list for idx, intensities in tqdm(enumerate(mz_intensity), disable=False, desc="预处理光谱数据"): mz_list, intensity_list = process_line(intensities) mz_list.append(float(df.at[idx, 'Total Exact Mass'])) if augment: mz_list, intensity_list = augment_spectrum(mz_list, intensity_list) round_mz_list = [round(float(mz), 2) for mz in mz_list] round_intensity_list = [round(float(intensity), 2) for intensity in intensity_list] valid.append([round_mz_list, round_intensity_list]) return tf.ragged.constant(valid) def prepro_specs_test(df): return prepro_specs_train(df, augment=False) def encoding(rag_tensor, P, dimn): to_pad = [] for sample in tqdm(rag_tensor, desc=“编码光谱数据”): mz_list = sample[0].numpy().tolist() intensity_list = sample[1].numpy().tolist() positions = [min(int(round(intensity * 100)), len(P)-1) for intensity in intensity_list] pos_enc = np.array([P[pos] for pos in positions]) if positions else np.zeros((1, dimn)) averaged_encoding = [np.mean(pos_enc[:, dim]) for dim in range(dimn)] to_pad.append(averaged_encoding) return np.array(to_pad, dtype=np.float32) 将Scaffold转换为摩根指纹 def scaffold_to_morgan(smiles, radius=2, nBits=2048): if pd.isna(smiles) or smiles == ‘’: return np.zeros(nBits, dtype=np.float32) mol = Chem.MolFromSmiles(smiles) if mol is None: return np.zeros(nBits, dtype=np.float32) fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits) return np.array(fp, dtype=np.float32) 自定义Transformer编码器层 def transformer_encoder_layer(units, num_heads, dropout, name=“transformer_encoder_layer”): inputs = layers.Input(shape=(None, units), name=“inputs”) attention = layers.MultiHeadAttention( num_heads=num_heads, key_dim=units//num_heads, name="attention" )(inputs, inputs) attention = layers.Dropout(dropout)(attention) attention = layers.Add()([inputs, attention]) attention = layers.LayerNormalization(epsilon=1e-6)(attention) ffn = layers.Dense(units * 2, activation="gelu", kernel_regularizer=tf.keras.regularizers.l2(1e-4))(attention) ffn = layers.Dense(units, activation="gelu", kernel_regularizer=tf.keras.regularizers.l2(1e-4))(ffn) ffn = layers.Dropout(dropout)(ffn) outputs = layers.Add()([attention, ffn]) outputs = layers.LayerNormalization(epsilon=1e-6)(outputs) return Model(inputs=inputs, outputs=outputs, name=name) 构建CNN+Transformer模型 def build_cnn_transformer_encoder(input_dim=dimn, cnn_filters=cnn_feature_dim, transformer_dim=transformer_dim, num_layers=2, num_heads=2, dropout=0.3): inputs = layers.Input(shape=(input_dim,), name=“input_layer”) x = layers.Reshape((input_dim, 1))(inputs) x = layers.Conv1D(filters=32, kernel_size=3, padding='same', activation='gelu', kernel_regularizer=tf.keras.regularizers.l2(1e-4))(x) x = layers.BatchNormalization()(x) x = layers.Dropout(dropout/2)(x) # 修复此处的类型错误 x = layers.Conv1D(filters=cnn_filters, kernel_size=5, padding='same', activation='gelu', kernel_regularizer=tf.keras.regularizers.l2(1e-4))(x) x = layers.BatchNormalization()(x) x = layers.Dropout(dropout/2)(x) # 修复此处的类型错误 if cnn_filters != transformer_dim: x = layers.Conv1D(filters=transformer_dim, kernel_size=1, padding='same')(x) pos_encoding = positional_encoding(input_dim, transformer_dim) pos_encoding = tf.convert_to_tensor(pos_encoding, dtype=tf.float32) pos_encoding = tf.expand_dims(pos_encoding, axis=0) x = layers.Add()([x, pos_encoding]) for i in range(num_layers): x = transformer_encoder_layer( units=transformer_dim, num_heads=num_heads, dropout=dropout, name=f"transformer_layer_{i}" )(x) x = layers.GlobalAveragePooling1D()(x) x = layers.Dense(2048, activation='relu')(x) x = layers.Dropout(0.2)(x) outputs = layers.Dense(2048, activation='relu', name="output_layer")(x) return Model(inputs=inputs, outputs=outputs, name="cnn_transformer_encoder") NT-Xent损失函数 def nt_xent_loss(y_true, y_pred, temperature=0.05): encoder_output = tf.nn.l2_normalize(y_pred, axis=1) morgan_fp = tf.nn.l2_normalize(y_true, axis=1) samples_per_group = 256 batch_size = tf.shape(encoder_output)[0] num_groups = batch_size // samples_per_group encoder_grouped = tf.reshape(encoder_output, (num_groups, samples_per_group, -1)) morgan_grouped = tf.reshape(morgan_fp, (num_groups, samples_per_group, -1)) similarity_matrix = tf.matmul(encoder_grouped, morgan_grouped, transpose_b=True) positive_similarity = similarity_matrix[:, 0, 0] base_mask = tf.logical_not(tf.eye(samples_per_group, dtype=tf.bool)) mask = tf.tile(tf.expand_dims(base_mask, 0), [num_groups, 1, 1]) group_similarities = similarity_matrix[:, 0, :] group_mask = mask[:, 0, :] negative_similarities = tf.boolean_mask(group_similarities, group_mask) numerator = tf.exp(positive_similarity / temperature) denominator = tf.reduce_sum(tf.exp(negative_similarities / temperature), axis=0) per_group_loss = -tf.math.log(numerator / (denominator + numerator)) return tf.reduce_mean(per_group_loss) 批次生成函数 def generate_batches(df, encoded_spectra, morgan_fps, groups_per_batch=2, shuffle=True): valid_smiles = [] smiles_groups = df.groupby(‘SMILES’) for smiles, group in smiles_groups: pos_count = len(group[group['Type'] == 'Pos']) neg_count = len(group[group['Type'] == 'Neg']) if pos_count == 1 and neg_count == 255: valid_smiles.append(smiles) if shuffle: random.shuffle(valid_smiles) for i in range(0, len(valid_smiles), groups_per_batch): batch_smiles = valid_smiles[i:i+groups_per_batch] if not batch_smiles: continue all_spectra = [] all_morgan = [] all_df = [] for smiles in batch_smiles: group = smiles_groups.get_group(smiles) pos_samples = group[group['Type'] == 'Pos'] neg_samples = group[group['Type'] == 'Neg'] ordered_group = pd.concat([pos_samples, neg_samples]) group_indices = ordered_group.index.tolist() all_spectra.append(encoded_spectra[group_indices]) all_morgan.append(morgan_fps[group_indices]) all_df.append(ordered_group) batch_spectra = np.concatenate(all_spectra, axis=0) batch_morgan = np.concatenate(all_morgan, axis=0) batch_df = pd.concat(all_df, ignore_index=True) yield batch_spectra, batch_morgan, batch_df 计算有效批次数 def count_valid_batches(df, groups_per_batch=2): valid_count = 0 for _, group in df.groupby(‘SMILES’): pos_count = len(group[group[‘Type’] == ‘Pos’]) neg_count = len(group[group[‘Type’] == ‘Neg’]) if pos_count == 1 and neg_count == 255: valid_count += 1 return (valid_count + groups_per_batch - 1) // groups_per_batch 计算Top1准确度 def calculate_top1_accuracy(model, test_df, test_spectra, test_morgan, groups_per_batch=2): correct = 0 total = 0 test_generator = generate_batches( test_df, test_spectra, test_morgan, groups_per_batch=groups_per_batch, shuffle=False ) test_batch_count = count_valid_batches(test_df, groups_per_batch=groups_per_batch) for _ in tqdm(range(test_batch_count), desc="计算Top1准确度"): batch_spectra, batch_morgan, _ = next(test_generator) encoder_output = model(batch_spectra) samples_per_group = 256 num_groups = len(batch_spectra) // samples_per_group for group_idx in range(num_groups): start_idx = group_idx * samples_per_group end_idx = start_idx + samples_per_group group_encoder = encoder_output[start_idx:end_idx] group_morgan = batch_morgan[start_idx:end_idx] similarities = cosine_similarity(group_encoder, group_morgan) pos_idx = 0 pos_similarities = similarities[pos_idx] max_sim_idx = np.argmax(pos_similarities) if max_sim_idx == pos_idx: correct += 1 total += 1 if total == 0: return 0.0 return correct / total 数据加载和预处理 def load_and_preprocess_data(csv_path, test_size=0.2, random_state=46): print(f"加载数据: {csv_path}") df = pd.read_csv(csv_path) valid_smiles = [] smiles_groups = df.groupby('SMILES') for smiles, group in smiles_groups: pos_count = len(group[group['Type'] == 'Pos']) neg_count = len(group[group['Type'] == 'Neg']) if pos_count == 1 and neg_count == 255: valid_smiles.append(smiles) print(f"有效SMILES组数: {len(valid_smiles)}") # 先划分出独立测试集 train_val_smiles, test_smiles = train_test_split( valid_smiles, test_size=test_size, random_state=random_state ) train_val_df = df[df['SMILES'].isin(train_val_smiles)].reset_index(drop=True) test_df = df[df['SMILES'].isin(test_smiles)].reset_index(drop=True) print(f"训练验证集大小: {len(train_val_df)}, 独立测试集大小: {len(test_df)}") print(f"训练验证SMILES数量: {len(train_val_smiles)}, 测试SMILES数量: {len(test_smiles)}") # 处理测试集 print("预处理测试集光谱数据...") test_rag_tensor = prepro_specs_test(test_df) test_encoded_spectra = encoding(test_rag_tensor, P, dimn) print("转换测试集Scaffold为摩根指纹...") test_df['morgan_fp'] = test_df['Scaffold'].apply(scaffold_to_morgan) test_morgan_fps = np.stack(test_df['morgan_fp'].values) return train_val_df, train_val_smiles, test_df, test_encoded_spectra, test_morgan_fps 交叉验证训练函数 def cross_validate_model(train_val_df, train_val_smiles, hyperparams, n_splits=10, epochs=10): “”“执行十折交叉验证评估给定超参数”“” kf = KFold(n_splits=n_splits, shuffle=True, random_state=42) fold_results = [] for fold, (train_idx, val_idx) in enumerate(kf.split(train_val_smiles)): print(f"\n{'='*50}") print(f"开始第 {fold+1}/{n_splits} 折交叉验证") print(f"{'='*50}\n") # 划分当前折的训练集和验证集 train_smiles = [train_val_smiles[i] for i in train_idx] val_smiles = [train_val_smiles[i] for i in val_idx] fold_train_df = train_val_df[train_val_df['SMILES'].isin(train_smiles)].reset_index(drop=True) fold_val_df = train_val_df[train_val_df['SMILES'].isin(val_smiles)].reset_index(drop=True) # 预处理光谱数据 print("预处理当前折训练集光谱数据...") train_rag_tensor = prepro_specs_train(fold_train_df, augment=True) train_encoded_spectra = encoding(train_rag_tensor, P, dimn) print("预处理当前折验证集光谱数据...") val_rag_tensor = prepro_specs_test(fold_val_df) val_encoded_spectra = encoding(val_rag_tensor, P, dimn) # 处理摩根指纹 print("转换当前折训练集Scaffold为摩根指纹...") fold_train_df['morgan_fp'] = fold_train_df['Scaffold'].apply(scaffold_to_morgan) train_morgan_fps = np.stack(fold_train_df['morgan_fp'].values) print("转换当前折验证集Scaffold为摩根指纹...") fold_val_df['morgan_fp'] = fold_val_df['Scaffold'].apply(scaffold_to_morgan) val_morgan_fps = np.stack(fold_val_df['morgan_fp'].values) # 构建模型 model = build_cnn_transformer_encoder( input_dim=dimn, cnn_filters=hyperparams['cnn_filters'], transformer_dim=hyperparams['transformer_dim'], num_layers=hyperparams['num_layers'], num_heads=hyperparams['num_heads'], dropout=hyperparams['dropout'] ) # 编译模型 optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparams['learning_rate']) model.compile(optimizer=optimizer, loss=nt_xent_loss) # 计算批次数 train_batch_count = count_valid_batches(fold_train_df, groups_per_batch=hyperparams['groups_per_batch']) val_batch_count = count_valid_batches(fold_val_df, groups_per_batch=hyperparams['groups_per_batch']) print(f"当前折训练批次数: {train_batch_count}, 验证批次数: {val_batch_count}") # 训练模型 best_val_acc = 0.0 for epoch in range(epochs): print(f"\n第 {epoch+1}/{epochs} 轮") train_total_loss = 0.0 train_generator = generate_batches( fold_train_df, train_encoded_spectra, train_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'], shuffle=True ) # 训练 for _ in tqdm(range(train_batch_count), desc="训练"): batch_spectra, batch_morgan, _ = next(train_generator) loss = model.train_on_batch(batch_spectra, batch_morgan) train_total_loss += loss train_avg_loss = train_total_loss / train_batch_count if train_batch_count > 0 else 0.0 # 验证 val_generator = generate_batches( fold_val_df, val_encoded_spectra, val_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'], shuffle=False ) val_total_loss = 0.0 for _ in range(val_batch_count): batch_spectra, batch_morgan, _ = next(val_generator) loss = model.test_on_batch(batch_spectra, batch_morgan) val_total_loss += loss val_avg_loss = val_total_loss / val_batch_count if val_batch_count > 0 else 0.0 # 计算验证集准确度 val_acc = calculate_top1_accuracy( model, fold_val_df, val_encoded_spectra, val_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'] ) print(f"训练损失: {train_avg_loss:.6f}, 验证损失: {val_avg_loss:.6f}, 验证准确度: {val_acc:.6f}") # 保存当前折中最佳模型 if val_acc > best_val_acc: best_val_acc = val_acc fold_results.append(best_val_acc) print(f"\n第 {fold+1} 折最佳验证准确度: {best_val_acc:.6f}") # 清除内存 K.clear_session() # 计算交叉验证结果 mean_acc = np.mean(fold_results) std_acc = np.std(fold_results) print(f"\n{'='*50}") print(f"十折交叉验证结果: {mean_acc:.6f} ± {std_acc:.6f}") print(f"{'='*50}\n") return mean_acc, std_acc, fold_results 在完整训练集上训练最终模型 def train_final_model(train_val_df, hyperparams, epochs=10): print(“\n开始在完整训练集上训练最终模型…”) # 预处理所有训练数据 print("预处理完整训练集光谱数据...") train_rag_tensor = prepro_specs_train(train_val_df, augment=True) train_encoded_spectra = encoding(train_rag_tensor, P, dimn) print("转换完整训练集Scaffold为摩根指纹...") train_val_df['morgan_fp'] = train_val_df['Scaffold'].apply(scaffold_to_morgan) train_morgan_fps = np.stack(train_val_df['morgan_fp'].values) # 构建模型 model = build_cnn_transformer_encoder( input_dim=dimn, cnn_filters=hyperparams['cnn_filters'], transformer_dim=hyperparams['transformer_dim'], num_layers=hyperparams['num_layers'], num_heads=hyperparams['num_heads'], dropout=hyperparams['dropout'] ) # 编译模型 optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparams['learning_rate']) model.compile(optimizer=optimizer, loss=nt_xent_loss) # 计算批次数 train_batch_count = count_valid_batches(train_val_df, groups_per_batch=hyperparams['groups_per_batch']) print(f"完整训练集批次数: {train_batch_count}") # 训练模型 for epoch in range(epochs): print(f"\n第 {epoch+1}/{epochs} 轮") train_total_loss = 0.0 train_generator = generate_batches( train_val_df, train_encoded_spectra, train_morgan_fps, groups_per_batch=hyperparams['groups_per_batch'], shuffle=True ) # 训练 for _ in tqdm(range(train_batch_count), desc="训练"): batch_spectra, batch_morgan, _ = next(train_generator) loss = model.train_on_batch(batch_spectra, batch_morgan) train_total_loss += loss train_avg_loss = train_total_loss / train_batch_count if train_batch_count > 0 else 0.0 print(f"训练损失: {train_avg_loss:.6f}") return model 主函数 def main(args): # 加载数据并划分训练验证集和独立测试集 train_val_df, train_val_smiles, test_df, test_encoded_spectra, test_morgan_fps = load_and_preprocess_data( args.data, test_size=args.test_size ) # 定义超参数搜索空间 - 修复了参数格式,每个字典代表一组完整的超参数组合 param_grid = [ { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 1, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 2, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.3, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 1e-4, 'groups_per_batch': args.groups_per_batch }, { 'cnn_filters': 64, 'transformer_dim': 64, 'num_layers': 2, 'num_heads': 4, 'dropout': 0.4, 'learning_rate': 5e-5, 'groups_per_batch': args.groups_per_batch } ] # 超参数优化 best_score = -1 best_params = None results = [] print("\n开始超参数搜索和十折交叉验证...") for i, params in enumerate(param_grid): print(f"\n{'#'*50}") print(f"测试第 {i+1}/{len(param_grid)} 组超参数: {params}") print(f"{'#'*50}\n") mean_acc, std_acc, fold_results = cross_validate_model( train_val_df, train_val_smiles, params, n_splits=10, epochs=args.epochs ) results.append({ 'params': params, 'mean_acc': mean_acc, 'std_acc': std_acc, 'fold_results': fold_results }) if mean_acc > best_score: best_score = mean_acc best_params = params # 输出最佳超参数 print("\n" + "="*70) print("超参数搜索完成!") print(f"最佳超参数: {best_params}") print(f"最佳交叉验证准确度: {best_score:.6f}") print("="*70 + "\n") # 使用最佳超参数训练最终模型 final_model = train_final_model( train_val_df, best_params, epochs=args.epochs ) # 在独立测试集上评估 print("\n在独立测试集上评估最终模型...") test_acc = calculate_top1_accuracy( final_model, test_df, test_encoded_spectra, test_morgan_fps, groups_per_batch=best_params['groups_per_batch'] ) print(f"\n{'='*70}") print(f"独立测试集Top1准确度: {test_acc:.6f}") print("="*70 + "\n") # 保存模型 final_model.save(args.output) print(f"最终模型已保存至: {args.output}") # 保存超参数搜索结果 import json with open('hyperparameter_results.json', 'w') as f: json.dump(results, f, indent=2, default=lambda x: x.tolist() if isinstance(x, np.ndarray) else x) print("超参数搜索结果已保存至 hyperparameter_results.json") if name == “main”: parser = argparse.ArgumentParser(description=‘带十折交叉验证的质谱数据对比学习模型(CNN+Transformer)’) parser.add_argument(‘–data’, type=str, default=‘/home/admin123/code/骨架256.csv’, help=‘CSV数据文件路径’) parser.add_argument(‘–epochs’, type=int, default=1000, help=‘训练轮数’) parser.add_argument(‘–test_size’, type=float, default=0.2, help=‘独立测试集比例’) parser.add_argument(‘–groups_per_batch’, type=int, default=8, choices=range(1, 65), help=‘每个批次包含的SMILES组数(1到64之间)’) parser.add_argument(‘–output’, type=str, default=‘best_cnn_transformer_encoder.h5’, help=‘最终模型保存路径’) args = parser.parse_args() main(args)修改代码确保计算损失时提取到的正样本是来自group['Type'] == 'Pos',写出修改后的完整代码
08-20
import os import pandas as pd import numpy as np import networkx as nx import matplotlib.pyplot as plt import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers zip_file = keras.utils.get_file( fname="cora.tgz",#保存的文件名 origin="https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz",#下载路径 extract=True,#下载后解压 ) data_dir = os.path.join(os.path.dirname(zip_file), "cora")#获取解压后的路径再与cora拼接 #加载引用关系文件,相当于图中的边 citations = pd.read_csv( os.path.join(data_dir, "cora.cites"),#拼接路径 sep="\t",#以制表符作为分割 header=None,#无表头 names=["target", "source"], ) print("Citations shape:", citations.shape) citations.sample(frac=1).head()#将原本的所有数据打乱然后取前五个 #定义特征数据 column_names = ["paper_id"] + [f"term_{idx}" for idx in range(1433)] + ["subject"] #加载论文数据,相当于图中的节点 papers = pd.read_csv( os.path.join(data_dir, "cora.content"), sep="\t", header=None, names=column_names, ) print("Papers shape:", papers.shape) print(papers.sample(5).T) print(papers.subject.value_counts()) class_values = sorted(papers["subject"].unique())#获取到类别的数量 class_idx = {name: id for id, name in enumerate(class_values)}#根据论文类别建立索引 paper_idx = {name: idx for idx, name in enumerate(sorted(papers["paper_id"].unique()))}#根据论文id建立索引 #使用索引建立映射 papers["paper_id"] = papers["paper_id"].apply(lambda name: paper_idx[name]) citations["source"] = citations["source"].apply(lambda name: paper_idx[name]) citations["target"] = citations["target"].apply(lambda name: paper_idx[name]) papers["subject"] = papers["subject"].apply(lambda value: class_idx[value]) plt.figure(figsize=(10, 10)) colors = papers["subject"].tolist() cora_graph = nx.from_pandas_edgelist(citations.sample(n=1500))#随机选取1500条数据并转化为图的结构 subjects = list(papers[papers["paper_id"].isin(list(cora_graph.nodes))]["subject"])#获取到图中节点的论文类别 nx.draw_spring(cora_graph, node_size=15, node_color=subjects) train_data, test_data = [], [] #按照论文类别分组,且每个类别大约半的数据在训练集,另一半在测试集。 for _, group_data in papers.groupby("subject"): random_selection = np.random.rand(len(group_data.index)) <= 0.5 train_data.append(group_data[random_selection]) test_data.append(group_data[~random_selection]) #将所有类别的训练集和测试集分别合并并且打乱数据 train_data = pd.concat(train_data).sample(frac=1) test_data = pd.concat(test_data).sample(frac=1) print("Train data shape:", train_data.shape) print("Test data shape:", test_data.shape) hidden_units = [32, 32]#隐藏层:邻居聚合和消息传递 learning_rate = 0.01 dropout_rate = 0.5 num_epochs = 300 batch_size = 256 #训练和评估模型 def run_experiment(model, x_train, y_train): model.compile( optimizer=keras.optimizers.Adam(learning_rate),#优化器 loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),#损失函数 metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],#分类准确率 ) # 早停回调:防止过拟合,当验证集的准确率不再提升时停止训练. early_stopping = keras.callbacks.EarlyStopping( monitor="val_acc", #监听验证集的准确率 patience=50,#允许验证集准确率不提升的最大轮数 restore_best_weights=True#回到最佳模型 ) #模型训练 history = model.fit( x=x_train,#训练数据 y=y_train,#训练标签 epochs=num_epochs, batch_size=batch_size, validation_split=0.15,#训练集划分15%为验证集 callbacks=[early_stopping],#启用早停 ) return history #数据可视化 def display_learning_curves(history): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5)) ax1.plot(history.history["loss"]) ax1.plot(history.history["val_loss"]) ax1.legend(["train", "val"], loc="upper right") ax1.set_xlabel("Epochs") ax1.set_ylabel("Loss") ax2.plot(history.history["acc"]) ax2.plot(history.history["val_acc"]) ax2.legend(["train", "test"], loc="upper right") ax2.set_xlabel("Epochs") ax2.set_ylabel("Accuracy") plt.show() #前馈神经网络:输入->隐藏层->输出 def create_ffn(hidden_units, dropout_rate, name=None): fnn_layers = [] #隐藏层结构 for units in hidden_units: fnn_layers.append(layers.BatchNormalization())#批归一化 fnn_layers.append(layers.Dropout(dropout_rate))#丢弃 fnn_layers.append(layers.Dense(units, activation=tf.nn.gelu))#全连接 return keras.Sequential(fnn_layers, name=name) feature_names = set(papers.columns) - {"paper_id", "subject"}#获取特征列名 num_features = len(feature_names)#特征数量 num_classes = len(class_idx)#类别数量 #将训练和测试的的特征转化为numpy x_train = train_data[feature_names].to_numpy() x_test = test_data[feature_names].to_numpy() #训练和测试的标签 y_train = train_data["subject"] y_test = test_data["subject"] #带有跳跃连接的前反馈神经网络的基线模型 def create_baseline_model(hidden_units, num_classes, dropout_rate=0.2): inputs = layers.Input(shape=(num_features,), name="input_features") x = create_ffn(hidden_units, dropout_rate, name=f"ffn_block1")(inputs)#对输入数据进行非线性变化,进行特征提取。 #跳跃连接:解决深层前馈网络易出现梯度消失或训练不稳定。 for block_idx in range(4): #创造一个新的fnn x1 = create_ffn(hidden_units, dropout_rate, name=f"ffn_block{block_idx + 2}")(x) #跳跃链接 x = layers.Add(name=f"skip_connection{block_idx + 2}")([x, x1]) #输出 logits = layers.Dense(num_classes, name="logits")(x) #返回模型 return keras.Model(inputs=inputs, outputs=logits, name="baseline") #创建模型 baseline_model = create_baseline_model(hidden_units, num_classes, dropout_rate) baseline_model.summary() #训练模型 history = run_experiment(baseline_model, x_train, y_train) #可视化展示 display_learning_curves(history) #测试结果 _, test_accuracy = baseline_model.evaluate(x=x_test, y=y_test, verbose=0) print(f"Test accuracy: {round(test_accuracy * 100, 2)}%") #生成随机实例 def generate_random_instances(num_instances): token_probability = x_train.mean(axis=0)#特征全局出现的概率 instances = [] for _ in range(num_instances): probabilities = np.random.uniform(size=len(token_probability))#生成随机数 instance = (probabilities <= token_probability).astype(int) instances.append(instance) return np.array(instances) def display_class_probabilities(probabilities): for instance_idx, probs in enumerate(probabilities): print(f"Instance {instance_idx + 1}:") for class_idx, prob in enumerate(probs): print(f"- {class_values[class_idx]}: {round(prob * 100, 2)}%") new_instances = generate_random_instances(num_classes) logits = baseline_model.predict(new_instances)#获取模型输出 probabilities = keras.activations.softmax(tf.convert_to_tensor(logits)).numpy()#转为概率 display_class_probabilities(probabilities) #边 edges = citations[["source", "target"]].to_numpy().T #边权重 edge_weights = tf.ones(shape=edges.shape[1]) #点特征 node_features = tf.cast( papers.sort_values("paper_id")[feature_names].to_numpy(), dtype=tf.dtypes.float32 ) #图 graph_info = (node_features, edges, edge_weights) print("Edges shape:", edges.shape) print("Nodes shape:", node_features.shape) class GraphConvLayer(layers.Layer): def __init__( self, hidden_units, dropout_rate=0.2, aggregation_type="mean", combination_type="concat",#自身特征和聚合特征的结合方式:拼接 normalize=False, *args, **kwargs, ): super(GraphConvLayer, self).__init__(*args, **kwargs) self.aggregation_type = aggregation_type self.combination_type = combination_type self.normalize = normalize self.ffn_prepare = create_ffn(hidden_units, dropout_rate) if self.combination_type == "gated": #门控制单元GRU self.update_fn = layers.GRU( units=hidden_units, activation="tanh", recurrent_activation="sigmoid", dropout=dropout_rate, return_state=True, recurrent_dropout=dropout_rate, ) else: self.update_fn = create_ffn(hidden_units, dropout_rate) #消息准备 def prepare(self, node_repesentations, weights=None): #对消息进行非线性变换 messages = self.ffn_prepare(node_repesentations) #有权重则加权计算 if weights is not None: messages = messages * tf.expand_dims(weights, -1) return messages #消息聚合 """def aggregate(self, node_indices, neighbour_messages, node_repesentations): #node_indices表示接收聚合消息的目标节点 num_nodes = node_repesentations.shape[0] if self.aggregation_type == "sum": aggregated_message = tf.math.unsorted_segment_sum( neighbour_messages, node_indices, num_segments=num_nodes #参数分别代表邻居消息,分组索引,分组数量 ) elif self.aggregation_type == "mean": aggregated_message = tf.math.unsorted_segment_mean( neighbour_messages, node_indices, num_segments=num_nodes ) elif self.aggregation_type == "max": aggregated_message = tf.math.unsorted_segment_max( neighbour_messages, node_indices, num_segments=num_nodes ) else: raise ValueError(f"Invalid aggregation type: {self.aggregation_type}.") return aggregated_message""" def aggregate(self, node_indices, neighbour_messages, node_repesentations): # 计算度矩阵(假设edge_weights已包含自环) num_nodes = node_repesentations.shape[0] degrees = tf.math.unsorted_segment_sum( tf.ones_like(edge_weights), node_indices, num_segments=num_nodes ) D_inv_sqrt = tf.pow(degrees + 1e-7, -0.5) # 避免除零 # 获取归一化权重 norm_weights = D_inv_sqrt * edge_weights * tf.gather(D_inv_sqrt, node_indices) # 加权聚合 aggregated_message = tf.math.unsorted_segment_sum( neighbour_messages * tf.expand_dims(norm_weights, -1), node_indices, num_segments=num_nodes) return aggregated_message #消息更新 def update(self, node_repesentations, aggregated_messages): if self.combination_type == "gru": h = tf.stack([node_repesentations, aggregated_messages], axis=1) elif self.combination_type == "concat": h = tf.concat([node_repesentations, aggregated_messages], axis=1) elif self.combination_type == "add": h = node_repesentations + aggregated_messages else: raise ValueError(f"Invalid combination type: {self.combination_type}.") #特征变换 node_embeddings = self.update_fn(h) if self.combination_type == "gru": node_embeddings = tf.unstack(node_embeddings, axis=1)[-1] #归一化 if self.normalize: node_embeddings = tf.nn.l2_normalize(node_embeddings, axis=-1) return node_embeddings #消息传递 def call(self, inputs): node_repesentations, edges, edge_weights = inputs edges = tf.cast(edges, dtype=tf.int32) edge_weights = tf.cast(edge_weights, dtype=tf.float32) #源节点和目标节点 node_indices, neighbour_indices = edges[0], edges[1] #聚合邻居的消息 neighbour_repesentations = tf.gather(node_repesentations, neighbour_indices) #邻居消息准备 neighbour_messages = self.prepare(neighbour_repesentations, edge_weights) #消息聚合 aggregated_messages = self.aggregate( node_indices, neighbour_messages, node_repesentations ) #更新 return self.update(node_repesentations, aggregated_messages) class GNNNodeClassifier(tf.keras.Model): def __init__( self, graph_info, num_classes, hidden_units, aggregation_type="sum", combination_type="concat", dropout_rate=0.2, normalize=True, *args, **kwargs, ): super(GNNNodeClassifier, self).__init__(*args, **kwargs) node_features, edges, edge_weights = graph_info self.node_features = tf.convert_to_tensor(graph_info[0], dtype=tf.float32) self.edges = tf.convert_to_tensor(graph_info[1], dtype=tf.int32) self.edge_weights = tf.convert_to_tensor(graph_info[2], dtype=tf.float32) # 设置权重为1 if self.edge_weights is None: self.edge_weights = tf.ones(shape=edges.shape[1]) #权重和为一 self.edge_weights = self.edge_weights / tf.math.reduce_sum(self.edge_weights) #预处理:对原始节点特征进行非线性变化 self.preprocess = create_ffn(hidden_units, dropout_rate, name="preprocess") #第一个图卷积层 self.conv1 = GraphConvLayer( hidden_units, dropout_rate, aggregation_type, combination_type, normalize, name="graph_conv1", ) #第二个 self.conv2 = GraphConvLayer( hidden_units, dropout_rate, aggregation_type, combination_type, normalize, name="graph_conv2", ) #后处理层 self.postprocess = create_ffn(hidden_units, dropout_rate, name="postprocess") #逻辑输出层 self.compute_logits = layers.Dense(units=num_classes, name="logits") def call(self, input_node_indices): #预处理 x = self.preprocess(self.node_features) #第一个图卷积层 x1 = self.conv1((x, self.edges, self.edge_weights)) x = x1 + x #第二个图卷积层 x2 = self.conv2((x, self.edges, self.edge_weights)) x = x2 + x #后处理 x = self.postprocess(x) node_embeddings = tf.gather(x, input_node_indices) return self.compute_logits(node_embeddings) gnn_model = GNNNodeClassifier( graph_info=graph_info, num_classes=num_classes, hidden_units=hidden_units, dropout_rate=dropout_rate, name="gnn_model", ) print("GNN output shape:", gnn_model([1, 10, 100])) gnn_model.summary() x_train = train_data.paper_id.to_numpy() history = run_experiment(gnn_model, x_train, y_train) display_learning_curves(history) x_test = test_data.paper_id.to_numpy() _, test_accuracy = gnn_model.evaluate(x=x_test, y=y_test, verbose=0) print(f"Test accuracy: {round(test_accuracy * 100, 2)}%") #加入新节点 num_nodes = node_features.shape[0] new_node_features = np.concatenate([node_features, new_instances]) #为新节点加入边 new_node_indices = [i + num_nodes for i in range(num_classes)] new_citations = [] #引用关系 for subject_idx, group in papers.groupby("subject"): subject_papers = list(group.paper_id) #从当前学科选五篇 selected_paper_indices1 = np.random.choice(subject_papers, 5) #从所有论文中选两篇 selected_paper_indices2 = np.random.choice(list(papers.paper_id), 2) #合并选中的id selected_paper_indices = np.concatenate( [selected_paper_indices1, selected_paper_indices2], axis=0 ) #添加边 citing_paper_indx = new_node_indices[subject_idx] for cited_paper_idx in selected_paper_indices: new_citations.append([citing_paper_indx, cited_paper_idx]) new_citations = np.array(new_citations).T new_edges = np.concatenate([edges, new_citations], axis=1)#合并新旧边 print("Original node_features shape:", gnn_model.node_features.shape) print("Original edges shape:", gnn_model.edges.shape) #更新图 gnn_model.node_features = new_node_features gnn_model.edges = new_edges gnn_model.edge_weights = tf.ones(shape=new_edges.shape[1]) print("New node_features shape:", gnn_model.node_features.shape) print("New edges shape:", gnn_model.edges.shape) logits = gnn_model.predict(tf.convert_to_tensor(new_node_indices))#输出 probabilities = keras.activations.softmax(tf.convert_to_tensor(logits)).numpy()#概率 display_class_probabilities(probabilities)报错ensorflow.python.framework.errors_impl.InvalidArgumentError: Exception encountered when calling layer 'graph_conv1' (type GraphConvLayer). {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:CPU:0}} Incompatible shapes: [2708] vs. [5429] [Op:Mul] Call arguments received by layer 'graph_conv1' (type GraphConvLayer): • inputs=('tf.Tensor(shape=(2708, 32), dtype=float32)', 'tf.Tensor(shape=(2, 5429), dtype=int32)', 'tf.Tensor(shape=(5429,), dtype=float32)')
08-13
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值