How do I rename a data file

在论坛看到一个帖子,说是rename data file 报错了,第一感觉他的语法有问题,也很少用这个命令,就google了一下,发现语法没有问题,可能他的操作顺序有问题。 因为rename的时候数据文件 必须是offline的,或者数据库是mount状态。 

在仔细想了想,以前也整理过一篇rename的文章,不过那篇blog的标题是移动数据文件,和今天的这篇的唯一区别就是在Rename的同时还改了下路径。 还有就是这篇是英文的,就当练习一下英语吧。 

Oracle 移动数据文件的操作方法

http://blog.youkuaiyun.com/tianlesoftware/archive/2009/11/30/4899172.aspx

以下内容引自网络:

Datafiles can be moved or renamed using one of two methods:  alter database or  alter tablespace. 

The main difference between them is that alter tablespace only applies to datafiles that do not contain the SYSTEM tablespace, active rollback segments, or temporary segments, but the procedure can be performed while the instance is still running. The alter database method works for any datafile, but the instance must be shut down.

1. The alter database method:

1. Shut down the instance.

2. Rename and/or move the datafile using operating system commands.

3. Mount the database and use alter databaseto rename the file within the database. A fully qualified filename is required in the syntax of your operating system. For example to rename a file called 'data01.dbf ' to ' data04.dbf ' and move it to a new location at the same time (at this point in the example the instance has been shutdown) and;

4. Start the instance.

      SVRMGR>  connect sys/oracle as sysdba;

      SVRMGR>  startup mount U1;

      SVRMGR>  alter databaserename file '/u01/oracle/U1/data01.dbf ' TO '/u02/oracle/U1/data04.dbf ' ;

      SVRMGR> alter database open;

Notice the single quotes around the fully qualified filenames and remember, the files must exist at the source and destination paths. The instance is now open using the new location and name for the datafile.

2. The alter tablespacemethod:

This method has the advantage that it doesn't require shutting down the instance, but it only works with non-SYSTEM tablespaces. Further, it can't be used for tablespaces that contain active rollback segments or temporary segments.

1. Take the tablespace offline.

2. Rename and/or move the datafile using operating system commands.

3. Use the alter tablespacecommand to rename the file in the database.

4. Bring the tablespace back online.

    SVRMGR> connect sys/oracle as sysdba

    SVRMGR> alter tablespace app_data offline;

    SVRMGR> alter tablespace app_date rename datafile '/u01/oracle/U1/data01.dbf ' TO '/u02/oracle/U1/data04.dbf ' ;

    SVRMGR> alter tablespace app_data online;

The tablespace will be back online using the new name and/or location of the datafile.

Both of these methodologies can be used within Oracle Enterprise Manager also.



------------------------------------------------------------------------------
QQ: 492913789
Email: ahdba@qq.com
Blog: http://www.cndba.cn/dave
网上资源: http://tianlesoftware.download.youkuaiyun.com
相关视频:http://blog.youkuaiyun.com/tianlesoftware/archive/2009/11/27/4886500.aspx
DBA1 群:62697716(满); DBA2 群:62697977(满)
DBA3 群:63306533;     聊天 群:40132017

import numpy as np import pandas as pd import os # 检查文件是否存在 def check_file_exists(file_path): if not os.path.exists(file_path): raise FileNotFoundError(f"文件 {file_path} 不存在,请检查路径是否正确") # 1. 数据加载与预处理 # 加载季度GDP数据 quarterly_gdp_path = '季度GDP.xlsx' check_file_exists(quarterly_gdp_path) quarterly_gdp = pd.read_excel(quarterly_gdp_path, engine='openpyxl') # 处理季度GDP的日期格式 def convert_quarter_date(date_str): quarter, year = date_str.split('-') quarter_map = {'I': '03', 'II': '06', 'III': '09', 'IV': '12'} month = quarter_map.get(quarter, '03') return pd.to_datetime(f"{year}-{month}-01") quarterly_gdp['jidu'] = quarterly_gdp['jidu'].apply(convert_quarter_date) quarterly_gdp.set_index('jidu', inplace=True) quarterly_gdp.rename(columns={'gdp': 'GDP_Q'}, inplace=True) # 加载月度数据 monthly_data_path = '四大行业月度数据.xlsx' check_file_exists(monthly_data_path) monthly_data = pd.read_excel(monthly_data_path, engine='openpyxl') # 检查月度数据内容 if 'MONTH' not in monthly_data.columns: raise ValueError("月度数据文件中没有找到 'MONTH' 列") # 处理月度数据的日期格式 monthly_data['MONTH'] = pd.to_datetime(monthly_data['MONTH'], format='%Y-%m') monthly_data.set_index('MONTH', inplace=True) monthly_data.rename(columns={'INV': 'INV', 'TRADE': 'TRADE', 'POWER': 'POWER', 'IMEX': 'IMEX'}, inplace=True) 前面的代码是这样的 里面有数据信息 怎么改
03-28
我发现表单3的数据并没有被处理和分析。下面是对此的修改: ```python def load_and_process_data(file_path, sheet_name): """加载和处理数据""" try: # 检查文件是否存在 if not os.path.exists(file_path): log_message(f"错误: 文件 '{file_path}' 不存在") return None # 读取数据 data = pd.read_excel(file_path, sheet_name=sheet_name) log_message(f"成功加载数据, 共 {len(data)} 行") # 列名修复 col_renames = { '表面风化化': '表面风化', '采采样点风化类型': '采样点风化类型', '样点风化类型': '采样点风化类型', '总成分': '总含量' } new_columns = [] for col in data.columns: if col in col_renames: new_columns.append(col_renames[col]) else: new_columns.append(col) data.columns = new_columns # 化学成分列重命名 rename_dict = { '氧化硅(Si)': '二氧化硅(SiO2)', '氧化锡(SnO)': '氧化锡(SnO2)', '氧化硫(SO3)': '二氧化硫(SO2)', '氧化亚铜(Cu2O)': '氧化亚铜(Cu2O)', '氧化铜(CuO)': '氧化铜(CuO)', '三氧化二铁(Fe2O3)': '三氧化二铁(Fe2O3)' } data = data.rename(columns=rename_dict) # 删除总含量列 if '总含量' in data.columns: data = data.drop('总含量', axis=1) log_message("已删除'总含量'列") # 添加缺失列的处理 required_cols = ['表面风化', '采样点风化类型', '类型'] for col in required_cols: if col not in data.columns: data[col] = np.nan log_message(f"警告: 列 '{col}' 不存在,已创建空白列") # 移除空白行 data.dropna(how='all', inplace=True) log_message(f"处理后数据行数: {len(data)}") return data except Exception as e: log_message(f"数据处理失败: {str(e)}") traceback.print_exc() return None # 加载表单3的数据 unknown_file_path = r"D:\Users\86157\Desktop\数学建模\附件.xlsx" unknown_data = load_and_process_data(unknown_file_path, sheet_name='表单3') if unknown_data is not None: # 在这里对表单3的数据进行分析和处理 log_message("成功加载并处理了表单3的数据") else: log_message("无法加载或处理表单3的数据") ``` 添加了一个新的参数`sheet_name`到`load_and_process_data`函数中,以便加载特定的工作表。然后我直接调用了这个函数来加载表单3的数据,并在后续的代码中对其进行处理。 这样就可以确保表单3的数据也被正确地加载和处理了。请检查一下下面代码,不要修改文件地址,修改下面代码,并给出完整代码 代码如下: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import f1_score, adjusted_rand_score from itertools import permutations import os import traceback from mpl_toolkits.mplot3d import Axes3D import warnings from scipy.stats import f_oneway # 设置中文显示 plt.rcParams['font.sans-serif'] = ['SimHei'] # 使用黑体 plt.rcParams['axes.unicode_minus'] = False # 解决负号显示问题 warnings.filterwarnings('ignore') def log_message(message): """记录日志消息并打印到控制台""" print(f"[INFO] {message}") def get_colors(style='bright'): """获取颜色调色板""" if style == 'bright': return sns.color_palette('bright') elif style == 'all': return sns.color_palette('hsv', 15) elif style == 'rainbow': return sns.color_palette('rainbow') else: return sns.color_palette('deep') def boxplot(data, rows, cols, hue=None, vars=None, figsize=(12, 8), subplots_adjust=(0.5, 0.5)): """创建箱线图""" try: if not vars: numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist() if hue and hue in numerical_cols: numerical_cols.remove(hue) vars = numerical_cols fig = plt.figure(figsize=figsize) ax_num = 1 if hue: palette = get_colors('rainbow') else: palette = None for col in vars: plt.subplot(rows, cols, ax_num) if hue: sns.boxplot(x=hue, y=col, data=data, palette=palette) else: sns.boxplot(y=data[col], color=np.random.choice(sns.color_palette())) plt.title(col) plt.xticks(rotation=45) ax_num += 1 plt.tight_layout() plt.subplots_adjust(hspace=subplots_adjust[0], wspace=subplots_adjust[1]) plt.savefig('化学成分箱线图分析.jpg', dpi=300, bbox_inches='tight') plt.show() return True except Exception as e: log_message(f"创建箱线图失败: {str(e)}") return False def distplot(data, rows=3, cols=4, bins=10, vars=None, hue=None, kind='hist', stat='count', shade=True, figsize=(12, 5), color_style='all', alpha=0.7, subplots_adjust=(0.3, 0.2)): """创建分布图""" try: fig = plt.figure(figsize=figsize) numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist() if not vars: vars = numerical_cols colors = get_colors(color_style) ax_num = 1 for col in vars: if col in numerical_cols and col != hue: plt.subplot(rows, cols, ax_num) col_data = data[col].dropna() if kind == 'hist': sns.histplot(data=data, x=col, bins=bins, color=np.random.choice(colors), hue=hue, alpha=alpha, stat=stat) elif kind == 'kde': sns.kdeplot(data=data, x=col, color=np.random.choice(colors), alpha=alpha, hue=hue, fill=shade) elif kind == 'both': sns.histplot(data=data, x=col, bins=bins, color=np.random.choice(colors), alpha=alpha, hue=hue, stat='density') sns.kdeplot(data=data, x=col, color='darkred', alpha=0.7, hue=hue, fill=False) plt.xlabel(col) ax_num += 1 plt.subplots_adjust(hspace=subplots_adjust[0], wspace=subplots_adjust[1]) plt.savefig('化学成分分布图.jpg', dpi=300, bbox_inches='tight') plt.show() return True except Exception as e: log_message(f"创建分布图失败: {str(e)}") return False def load_and_process_data(file_path): """加载和处理数据""" try: # 检查文件是否存在 if not os.path.exists(file_path): log_message(f"错误: 文件 '{file_path}' 不存在") return None # 读取数据 data = pd.read_excel(file_path) log_message(f"成功加载数据, 共 {len(data)} 行") # 列名修复 col_renames = { '表面风化化': '表面风化', '采采样点风化类型': '采样点风化类型', '样点风化类型': '采样点风化类型', '总成分': '总含量' } new_columns = [] for col in data.columns: if col in col_renames: new_columns.append(col_renames[col]) else: new_columns.append(col) data.columns = new_columns # 化学成分列重命名 rename_dict = { '氧化硅(Si)': '二氧化硅(SiO2)', '氧化锡(SnO)': '氧化锡(SnO2)', '氧化硫(SO3)': '二氧化硫(SO2)', '氧化亚铜(Cu2O)': '氧化亚铜(Cu2O)', '氧化铜(CuO)': '氧化铜(CuO)', '三氧化二铁(Fe2O3)': '三氧化二铁(Fe2O3)' } data = data.rename(columns=rename_dict) # 删除总含量列 if '总含量' in data.columns: data = data.drop('总含量', axis=1) log_message("已删除'总含量'列") # 添加缺失列的处理 required_cols = ['表面风化', '采样点风化类型', '类型'] for col in required_cols: if col not in data.columns: data[col] = np.nan log_message(f"警告: 列 '{col}' 不存在,已创建空白列") # 移除空白行 data.dropna(how='all', inplace=True) log_message(f"处理后数据行数: {len(data)}") return data except Exception as e: log_message(f"数据处理失败: {str(e)}") traceback.print_exc() return None def select_subclass_features(data): """亚类划分特征选择""" try: # 分离高钾和铅钡玻璃数据 gaojia_data = data[data['类型'] == '高钾'].copy() qianbai_data = data[data['类型'] == '铅钡'].copy() log_message(f"高钾玻璃数据量: {len(gaojia_data)}") log_message(f"铅钡玻璃数据量: {len(qianbai_data)}") # 获取化学成分列 chem_cols = [col for col in data.columns if any(x in col for x in ['氧化', '二氧化', '化学'])] log_message(f"找到 {len(chem_cols)} 个化学成分列") # 高钾玻璃的特征选择 log_message("\n==== 高钾玻璃特征选择 ====") if len(gaojia_data) > 0: gaojia_x = gaojia_data[chem_cols] gaojia_y = gaojia_data['采样点风化类型'] # 处理缺失值 for col in chem_cols: if gaojia_x[col].isna().any(): median_val = gaojia_x[col].median() gaojia_x[col].fillna(median_val, inplace=True) # 特征重要性分析 model = RandomForestClassifier(random_state=42) parameters = {'max_depth': range(1, 5), 'min_samples_leaf': [1, 2], 'criterion': ['gini', 'entropy'], 'min_impurity_decrease': [0.01, 0.02]} grid_search = GridSearchCV(model, parameters, cv=min(5, len(gaojia_data)), n_jobs=-1) grid_search.fit(gaojia_x, gaojia_y) log_message(f'高钾玻璃特征选择精度: {grid_search.best_score_:.4f}') log_message(f'最优参数: {grid_search.best_params_}') best_model = grid_search.best_estimator_ best_model.fit(gaojia_x, gaojia_y) # 特征重要性排序 gaojia_fea_df = pd.DataFrame({ '化学成分': chem_cols, '特征重要性': best_model.feature_importances_ }).sort_values('特征重要性', ascending=False) log_message("\n高钾玻璃特征重要性排序:") log_message(gaojia_fea_df.head(10).to_string()) else: log_message("警告: 没有高钾玻璃数据,跳过特征选择") gaojia_fea_df = pd.DataFrame({'化学成分': [], '特征重要性': []}) # 铅钡玻璃的特征选择 log_message("\n==== 铅钡玻璃特征选择 ====") if len(qianbai_data) > 0: qianbai_x = qianbai_data[chem_cols] qianbai_y = qianbai_data['采样点风化类型'] # 处理缺失值 for col in chem_cols: if qianbai_x[col].isna().any(): median_val = qianbai_x[col].median() qianbai_x[col].fillna(median_val, inplace=True) # 特征重要性分析 grid_search = GridSearchCV(RandomForestClassifier(random_state=42), parameters, cv=min(5, len(qianbai_data)), n_jobs=-1) grid_search.fit(qianbai_x, qianbai_y) log_message(f'铅钡玻璃特征选择精度: {grid_search.best_score_:.4f}') log_message(f'最优参数: {grid_search.best_params_}') best_model = grid_search.best_estimator_ best_model.fit(qianbai_x, qianbai_y) # 特征重要性排序 qianbai_fea_df = pd.DataFrame({ '化学成分': chem_cols, '特征重要性': best_model.feature_importances_ }).sort_values('特征重要性', ascending=False) log_message("\n铅钡玻璃特征重要性排序:") log_message(qianbai_fea_df.head(10).to_string()) else: log_message("警告: 没有铅钡玻璃数据,跳过特征选择") qianbai_fea_df = pd.DataFrame({'化学成分': [], '特征重要性': []}) return gaojia_data, qianbai_data, gaojia_fea_df, qianbai_fea_df, chem_cols except Exception as e: log_message(f"特征选择失败: {str(e)}") traceback.print_exc() return None, None, None, None, None def optimize_features_and_cluster(gaojia_data, qianbai_data, gaojia_fea_df, qianbai_fea_df, chem_cols): """特征优化和聚类分析""" try: # 高钾玻璃优化的聚类 log_message("\n==== 高钾玻璃优化聚类 ====") if len(gaojia_data) > 0: def evaluate_gaojia(pred): score = 0 for perm in permutations([0, 1]): true_labels = gaojia_data['采样点风化类型'].replace({'未风化点': perm[0], '风化点': perm[1]}) score_ = f1_score(true_labels, pred, average='weighted') score = max(score, score_) return score gaojia_fea_list = gaojia_fea_df['化学成分'].tolist() best_score = 0 best_features = [] deleted_features = [] # 特征优化 for num_features in range(1, min(15, len(gaojia_fea_list))): current_features = gaojia_fea_list[:num_features].copy() for feat in deleted_features: if feat in current_features: current_features.remove(feat) if not current_features: continue log_message(f"尝试特征数: {num_features}, 特征: {current_features}") # 数据标准化 X = gaojia_data[current_features].values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 聚类分析 kmeans = KMeans(n_clusters=2, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(X_scaled) # 评估聚类效果 score = evaluate_gaojia(cluster_labels) log_message(f"聚类评估得分: {score:.4f}") if score > best_score: best_score = score best_features = current_features.copy() log_message(f"新最佳得分: {score:.4f},特征: {best_features}") else: last_feature = gaojia_fea_list[num_features - 1] if last_feature not in deleted_features: deleted_features.append(last_feature) log_message(f"将特征 {last_feature} 添加到删除列表") log_message(f"\n高钾玻璃最终选择的特征: {best_features}") log_message(f"聚类评估得分: {best_score:.4f}") # 最终聚类 if best_features: X_final = gaojia_data[best_features].values scaler_final = StandardScaler() X_final_scaled = scaler_final.fit_transform(X_final) kmeans_final = KMeans(n_clusters=2, random_state=42, n_init=10) final_labels = kmeans_final.fit_predict(X_final_scaled) gaojia_data['聚类标签'] = final_labels # 保存聚类中心 cluster_centers = kmeans_final.cluster_centers_ gaojia_cluster_centers = pd.DataFrame(cluster_centers, columns=best_features) gaojia_cluster_centers.index = ['亚类1', '亚类2'] else: log_message("警告: 没有为高钾玻璃找到合适的聚类特征") gaojia_cluster_centers = None else: log_message("没有高钾玻璃数据,跳过聚类") gaojia_cluster_centers = None best_features = [] # 铅钡玻璃优化的聚类 log_message("\n==== 铅钡玻璃优化聚类 ====") if len(qianbai_data) > 0: def evaluate_qianbai(pred): score = 0 for perm in permutations([0, 1, 2]): true_labels = qianbai_data['采样点风化类型'].replace({ '未风化点': perm[0], '风化点': perm[1], '严重风化点': perm[2] }) score_ = f1_score(true_labels, pred, average='weighted') score = max(score, score_) return score qianbai_fea_list = qianbai_fea_df['化学成分'].tolist() best_score_qb = 0 best_features_qb = [] deleted_features_qb = [] # 特征优化 for num_features in range(1, min(15, len(qianbai_fea_list))): current_features = qianbai_fea_list[:num_features].copy() for feat in deleted_features_qb: if feat in current_features: current_features.remove(feat) if not current_features: continue log_message(f"尝试特征数: {num_features}, 特征: {current_features}") # 数据标准化 X = qianbai_data[current_features].values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 聚类分析 kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(X_scaled) # 评估聚类效果 score = evaluate_qianbai(cluster_labels) log_message(f"聚类评估得分: {score:.4f}") if score > best_score_qb: best_score_qb = score best_features_qb = current_features.copy() log_message(f"新最佳得分: {score:.4f},特征: {best_features_qb}") else: last_feature = qianbai_fea_list[num_features - 1] if last_feature not in deleted_features_qb: deleted_features_qb.append(last_feature) log_message(f"将特征 {last_feature} 添加到删除列表") log_message(f"\n铅钡玻璃最终选择的特征: {best_features_qb}") log_message(f"聚类评估得分: {best_score_qb:.4f}") # 最终聚类 if best_features_qb: X_final = qianbai_data[best_features_qb].values scaler_final = StandardScaler() X_final_scaled = scaler_final.fit_transform(X_final) kmeans_final = KMeans(n_clusters=3, random_state=42, n_init=10) final_labels = kmeans_final.fit_predict(X_final_scaled) qianbai_data['聚类标签'] = final_labels # 保存聚类中心 cluster_centers = kmeans_final.cluster_centers_ qianbai_cluster_centers = pd.DataFrame(cluster_centers, columns=best_features_qb) qianbai_cluster_centers.index = ['亚类1', '亚类2', '亚类3'] else: log_message("警告: 没有为铅钡玻璃找到合适的聚类特征") qianbai_cluster_centers = None else: log_message("没有铅钡玻璃数据,跳过聚类") qianbai_cluster_centers = None best_features_qb = [] return gaojia_data, qianbai_data, best_features, best_features_qb, gaojia_cluster_centers, qianbai_cluster_centers except Exception as e: log_message(f"聚类优化失败: {str(e)}") traceback.print_exc() return None, None, [], [], None, None def visualize_and_analyze_subclasses(gaojia_data, qianbai_data, gaojia_features, qianbai_features, gaojia_cluster_centers, qianbai_cluster_centers): """亚类划分结果可视化与分析""" try: log_message("\n==== 亚类划分结果可视化与分析 ====") # 高钾玻璃亚类划分可视化 if gaojia_data is not None and '聚类标签' in gaojia_data.columns: plt.figure(figsize=(12, 8)) plt.suptitle("高钾玻璃亚类划分(基于聚类)") # 3D散点图 ax1 = plt.subplot(121, projection='3d') use_cols = gaojia_features[:3] if len(gaojia_features) >= 3 else gaojia_features colors = get_colors('bright') if use_cols: for i in range(2): # 两个亚类 cluster_data = gaojia_data[gaojia_data['聚类标签'] == i] label = f'亚类{i + 1}' if len(use_cols) == 3: ax1.scatter(cluster_data[use_cols[0]], cluster_data[use_cols[1]], cluster_data[use_cols[2]], color=colors[i], label=label, s=50, alpha=0.7) elif len(use_cols) == 2: ax1.scatter(cluster_data[use_cols[0]], cluster_data[use_cols[1]], np.zeros(len(cluster_data)), color=colors[i], label=label, s=50, alpha=0.7) else: # 只有1个特征 ax1.scatter(cluster_data[use_cols[0]], np.zeros(len(cluster_data)), np.zeros(len(cluster_data)), color=colors[i], label=label, s=50, alpha=0.7) if len(use_cols) >= 1: ax1.set_xlabel(use_cols[0]) if len(use_cols) >= 2: ax1.set_ylabel(use_cols[1]) if len(use_cols) >= 3: ax1.set_zlabel(use_cols[2]) ax1.legend() else: ax1.text(0.5, 0.5, "没有足够的特征进行可视化", ha='center', va='center') # 箱线图比较亚类特征分布 ax2 = plt.subplot(122) if gaojia_features: box_data = gaojia_data.copy() box_data['亚类'] = box_data['聚类标签'].apply(lambda x: f'亚类{x + 1}') sns.boxplot(x='亚类', y=gaojia_features[0], data=box_data, palette=[colors[0], colors[1]]) plt.title(f'关键特征比较: {gaojia_features[0]}') else: ax2.text(0.5, 0.5, "没有可用特征", ha='center', va='center') plt.tight_layout() plt.savefig('高钾玻璃亚类划分结果.jpg', dpi=300) plt.show() else: log_message("警告: 高钾玻璃聚类结果不可用,跳过可视化") # 铅钡玻璃亚类划分可视化 if qianbai_data is not None and '聚类标签' in qianbai_data.columns: plt.figure(figsize=(12, 8)) plt.suptitle("铅钡玻璃亚类划分(基于聚类)") # 3D散点图 ax1 = plt.subplot(121, projection='3d') use_cols = qianbai_features[:3] if len(qianbai_features) >= 3 else qianbai_features colors = get_colors('bright') if use_cols: for i in range(3): # 三个亚类 cluster_data = qianbai_data[qianbai_data['聚类标签'] == i] label = f'亚类{i + 1}' if len(use_cols) == 3: ax1.scatter(cluster_data[use_cols[0]], cluster_data[use_cols[1]], cluster_data[use_cols[2]], color=colors[i], label=label, s=50, alpha=0.7) elif len(use_cols) == 2: ax1.scatter(cluster_data[use_cols[0]], cluster_data[use_cols[1]], np.zeros(len(cluster_data)), color=colors[i], label=label, s=50, alpha=0.7) else: # 只有1个特征 ax1.scatter(cluster_data[use_cols[0]], np.zeros(len(cluster_data)), np.zeros(len(cluster_data)), color=colors[i], label=label, s=50, alpha=0.7) if len(use_cols) >= 1: ax1.set_xlabel(use_cols[0]) if len(use_cols) >= 2: ax1.set_ylabel(use_cols[1]) if len(use_cols) >= 3: ax1.set_zlabel(use_cols[2]) ax1.legend() else: ax1.text(0.5, 0.5, "没有足够的特征进行可视化", ha='center', va='center') # 箱线图比较亚类特征分布 ax2 = plt.subplot(122) if qianbai_features: box_data = qianbai_data.copy() box_data['亚类'] = box_data['聚类标签'].apply(lambda x: f'亚类{x + 1}') sns.boxplot(x='亚类', y=qianbai_features[0], data=box_data, palette=colors[:3]) plt.title(f'关键特征比较: {qianbai_features[0]}') else: ax2.text(0.5, 0.5, "没有可用特征", ha='center', va='center') plt.tight_layout() plt.savefig('铅钡玻璃亚类划分结果.jpg', dpi=300) plt.show() else: log_message("警告: 铅钡玻璃聚类结果不可用,跳过可视化") # 合理性分析 - ANOVA检验特征显著差异 log_message("\n==== 合理性分析 - 亚类特征差异检验 ====") # 高钾玻璃 if gaojia_data is not None and '聚类标签' in gaojia_data.columns and gaojia_features: log_message("\n高钾玻璃:") for i, feature in enumerate(gaojia_features[:3]): try: groups = [gaojia_data[gaojia_data['聚类标签'] == j][feature] for j in range(2)] f_val, p_val = f_oneway(*groups) log_message(f"{feature}: F值={f_val:.4f}, p值={p_val:.4f}{' (显著)' if p_val < 0.05 else ''}") except Exception as e: log_message(f"无法计算特征 {feature} 的ANOVA检验: {str(e)}") # 铅钡玻璃 if qianbai_data is not None and '聚类标签' in qianbai_data.columns and qianbai_features: log_message("\n铅钡玻璃:") for i, feature in enumerate(qianbai_features[:3]): try: groups = [qianbai_data[qianbai_data['聚类标签'] == j][feature] for j in range(3)] f_val, p_val = f_oneway(*groups) log_message(f"{feature}: F值={f_val:.4f}, p值={p_val:.4f}{' (显著)' if p_val < 0.05 else ''}") except Exception as e: log_message(f"无法计算特征 {feature} 的ANOVA检验: {str(e)}") # 保存聚类中心 try: if gaojia_cluster_centers is not None: gaojia_cluster_centers.to_excel("高钾玻璃亚类聚类中心.xlsx") log_message("高钾玻璃聚类中心已保存") if qianbai_cluster_centers is not None: qianbai_cluster_centers.to_excel("铅钡玻璃亚类聚类中心.xlsx") log_message("铅钡玻璃聚类中心已保存") except Exception as e: log_message(f"保存聚类中心失败: {str(e)}") return gaojia_data, qianbai_data except Exception as e: log_message(f"可视化与分析失败: {str(e)}") traceback.print_exc() return None, None def main(): """主函数""" try: # 文件路径 file_path = r"D:\BianChen\python_studycode\tf_env\玻璃\分析结果.xlsx" log_message(f"开始执行分析,数据文件: {file_path}") # 1. 加载和处理数据 log_message("\n==== 步骤1: 数据加载与预处理 ====") data = load_and_process_data(file_path) if data is None or len(data) == 0: log_message("错误: 数据处理失败或没有有效数据,程序终止") return # 初始数据可视化 log_message("\n执行初始数据可视化...") plt.figure(figsize=(8, 6)) sns.countplot(x='类型', data=data, palette='Set2') plt.title('玻璃类型分布') plt.savefig('玻璃类型分布.jpg', dpi=300) plt.show() # 2. 亚类划分的特征选择 log_message("\n==== 步骤2: 亚类划分特征选择 ====") gaojia_data, qianbai_data, gaojia_fea_df, qianbai_fea_df, chem_cols = select_subclass_features(data) if gaojia_fea_df is not None and not gaojia_fea_df.empty: gaojia_fea_df.to_excel("高钾玻璃特征重要性.xlsx", index=False) log_message("高钾玻璃特征重要性已保存") if qianbai_fea_df is not None and not qianbai_fea_df.empty: qianbai_fea_df.to_excel("铅钡玻璃特征重要性.xlsx", index=False) log_message("铅钡玻璃特征重要性已保存") # 3. 聚类和优化特征选择 log_message("\n==== 步骤3: 特征优化与亚类聚类 ====") (gaojia_with_subclasses, qianbai_with_subclasses, gaojia_features, qianbai_features, gaojia_centers, qianbai_centers) = optimize_features_and_cluster( gaojia_data, qianbai_data, gaojia_fea_df, qianbai_fea_df, chem_cols ) log_message(f"高钾玻璃最终使用特征: {gaojia_features}") log_message(f"铅钡玻璃最终使用特征: {qianbai_features}") # 4. 亚类划分的可视化与分析 log_message("\n==== 步骤4: 亚类划分可视化与分析 ====") gaojia_final, qianbai_final = visualize_and_analyze_subclasses( gaojia_with_subclasses, qianbai_with_subclasses, gaojia_features, qianbai_features, gaojia_centers, qianbai_centers ) # 5. 保存最终结果 log_message("\n==== 步骤5: 保存结果 ====") try: if gaojia_final is not None and not gaojia_final.empty: gaojia_final.to_excel("高钾玻璃亚类划分结果.xlsx", index=False) log_message("高钾玻璃亚类划分结果已保存") if qianbai_final is not None and not qianbai_final.empty: qianbai_final.to_excel("铅钡玻璃亚类划分结果.xlsx", index=False) log_message("铅钡玻璃亚类划分结果已保存") except Exception as e: log_message(f"保存最终结果失败: {str(e)}") log_message("\n亚类划分分析完成!") except Exception as e: log_message(f"程序执行出错: {str(e)}") traceback.print_exc() if __name__ == "__main__": main()
最新发布
07-23
train() got an unexpected keyword argument 'early_stopping_rounds'报错:请修复下列代码并完整输出:import pandas as pd import numpy as np import lightgbm as lgb from lightgbm import early_stopping, log_evaluation import gc import os import chardet from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, mean_absolute_error from tqdm import tqdm # 添加进度条 # 步骤1:数据读取与预处理(优化内存管理) def load_data_safely(file_path, usecols=None, dtype=None): """安全加载大型CSV文件""" try: # 自动检测编码 with open(file_path, 'rb') as f: encoding = chardet.detect(f.read(10000))['encoding'] encoding = encoding if encoding else 'latin1' # 分批读取 chunks = [] for chunk in tqdm(pd.read_csv(file_path, encoding=encoding, usecols=usecols, dtype=dtype, chunksize=100000, low_memory=False), desc=f"Loading {os.path.basename(file_path)}"): chunks.append(chunk) return pd.concat(chunks, ignore_index=True) except Exception as e: print(f"⚠️ 加载 {file_path} 失败: {str(e)}") return pd.DataFrame() # 定义内存优化的数据类型 dtypes = { 'did': 'category', 'vid': 'category', 'item_cid': 'category', 'item_type': 'category', 'item_assetSource': 'category', 'item_classify': 'category', 'item_isIntact': 'category', 'sid': 'category', 'stype': 'category', 'play_time': 'float32' } # 加载核心数据 did_features = load_data_safely('did_features_table.csv', dtype={**dtypes, **{f'f{i}': 'float32' for i in range(88)}}) vid_info = load_data_safely('vid_info_table.csv', dtype=dtypes) # 加载历史数据(分批处理) def load_historical_data(days=32, sample_frac=0.3): """分批加载历史数据并采样""" see_list, click_list, play_list = [], [], [] for day in tqdm(range(1, days + 1), desc="加载历史数据"): day_str = f"{day:02d}" # 加载曝光数据(采样减少内存) see_path = f'see_{day_str}.csv' if os.path.exists(see_path): see = load_data_safely(see_path, usecols=['did', 'vid'], dtype=dtypes) if not see.empty: see = see.sample(frac=sample_frac) # 采样减少数据量 see['day'] = day_str see_list.append(see) del see # 加载点击数据 click_path = f'click_{day_str}.csv' if os.path.exists(click_path): click = load_data_safely(click_path, usecols=['did', 'vid', 'click_time'], dtype=dtypes) if not click.empty and 'click_time' in click.columns: click['date'] = pd.to_datetime(click['click_time'], errors='coerce').dt.date click_list.append(click[['did', 'vid', 'date']]) del click # 加载播放数据 play_path = f'playplus_{day_str}.csv' if os.path.exists(play_path): play = load_data_safely(play_path, usecols=['did', 'vid', 'play_time'], dtype=dtypes) if not play.empty and 'play_time' in play.columns: play_list.append(play) del play gc.collect() return ( pd.concat(see_list).drop_duplicates(['did', 'vid']) if see_list else pd.DataFrame(), pd.concat(click_list).drop_duplicates(['did', 'vid']) if click_list else pd.DataFrame(), pd.concat(play_list).drop_duplicates(['did', 'vid']) if play_list else pd.DataFrame() ) # 加载历史数据(采样30%减少内存) hist_exposure, hist_click, hist_play = load_historical_data(days=32, sample_frac=0.3) # 加载预测数据 to_predict_users = load_data_safely('testA_pred_did.csv', dtype=dtypes) to_predict_exposure = load_data_safely('testA_did_show.csv', dtype=dtypes) # 步骤2:构建点击预测训练集(优化内存使用) def build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1): """构建点击数据集,优化内存使用""" # 标记正样本 hist_click['label'] = 1 # 高效标记负样本(避免创建大集合) merged = hist_exposure.merge( hist_click[['did', 'vid']].assign(is_clicked=True), on=['did', 'vid'], how='left' ) merged['is_clicked'] = merged['is_clicked'].fillna(False) # 负样本采样 negative_samples = merged[~merged['is_clicked']].sample(frac=sample_ratio) negative_samples['label'] = 0 # 合并数据集 click_data = pd.concat([ hist_click[['did', 'vid', 'label']], negative_samples[['did', 'vid', 'label']] ], ignore_index=True) # 释放内存 del merged, negative_samples gc.collect() return click_data click_train_data = build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1) # 步骤3:特征工程(点击预测模型) def add_click_features(df, did_features, vid_info, hist_click, hist_play): """添加关键特征,避免内存溢出""" # 基础特征 df = df.merge(did_features, on='did', how='left') df = df.merge(vid_info, on='vid', how='left') # 用户行为统计(使用聚合避免大表连接) user_stats = pd.concat([ hist_click.groupby('did').size().rename('user_click_count'), hist_play.groupby('did')['play_time'].sum().rename('user_total_play') ], axis=1).reset_index() df = df.merge(user_stats, on='did', how='left') # 视频热度统计 video_stats = pd.concat([ hist_click.groupby('vid').size().rename('video_click_count'), hist_play.groupby('vid')['play_time'].mean().rename('avg_play_time') ], axis=1).reset_index() df = df.merge(video_stats, on='vid', how='left') # 填充缺失值(冷启动处理) fill_values = { 'user_click_count': 0, 'user_total_play': 0, 'video_click_count': df['video_click_count'].median(), 'avg_play_time': df['avg_play_time'].median() } for col, value in fill_values.items(): df[col] = df[col].fillna(value) # 添加时间相关特征 if 'date' in df: df['day_of_week'] = pd.to_datetime(df['date']).dt.dayofweek.astype('category') df['hour'] = pd.to_datetime(df['date']).dt.hour.astype('category') return df # 添加特征 click_train_data = add_click_features( click_train_data, did_features, vid_info, hist_click, hist_play ) # 步骤4:训练点击预测模型(优化类别特征处理) categorical_features = [ 'item_cid', 'item_type', 'item_assetSource', 'item_classify', 'item_isIntact', 'sid', 'stype', 'day_of_week', 'hour' ] # 明确指定分类特征 for col in categorical_features: if col in click_train_data.columns: click_train_data[col] = click_train_data[col].astype('category').cat.as_ordered() # 准备训练数据 X = click_train_data.drop(columns=['did', 'vid', 'label', 'date'], errors='ignore') y = click_train_data['label'] # 内存优化:删除不需要的列后立即释放 del click_train_data gc.collect() # 划分数据集 X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # 训练模型(优化参数) params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 63, # 增加复杂度 'learning_rate': 0.05, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'min_child_samples': 100, # 防止过拟合 'verbosity': -1 } train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features) val_data = lgb.Dataset(X_val, label=y_val, categorical_feature=categorical_features) model_click = lgb.train( params, train_data, num_boost_round=1500, # 增加轮次 valid_sets=[train_data, val_data], early_stopping_rounds=100, # 更宽松的早停 verbose_eval=50 ) # 步骤5:构建完播率训练集(优化特征工程) def build_play_dataset(hist_play, vid_info, did_features, hist_click): """构建完播率数据集""" # 基础数据 play_data = hist_play.merge( vid_info[['vid', 'item_duration']], on='vid', how='left' ) # 计算完播率 play_data['completion_rate'] = play_data['play_time'] / play_data['item_duration'] play_data['completion_rate'] = play_data['completion_rate'].clip(upper=1.0) # 添加用户特征 play_data = play_data.merge(did_features, on='did', how='left') # 添加视频特征 play_data = play_data.merge( vid_info.drop(columns=['item_duration']), on='vid', how='left' ) # 添加统计特征 # 用户平均完播率 user_stats = play_data.groupby('did')['completion_rate'].agg(['mean', 'count']).reset_index() user_stats.columns = ['did', 'user_avg_completion', 'user_play_count'] play_data = play_data.merge(user_stats, on='did', how='left') # 视频平均完播率 video_stats = play_data.groupby('vid')['completion_rate'].agg(['mean', 'std']).reset_index() video_stats.columns = ['vid', 'video_avg_completion', 'video_completion_std'] play_data = play_data.merge(video_stats, on='vid', how='left') # 用户-视频互动特征 user_video_stats = hist_click.groupby(['did', 'vid']).size().reset_index(name='user_vid_clicks') play_data = play_data.merge(user_video_stats, on=['did', 'vid'], how='left') # 填充缺失值 play_data['user_avg_completion'].fillna(play_data['completion_rate'].mean(), inplace=True) play_data['user_play_count'].fillna(1, inplace=True) play_data['video_avg_completion'].fillna(play_data['completion_rate'].median(), inplace=True) play_data['video_completion_std'].fillna(0, inplace=True) play_data['user_vid_clicks'].fillna(0, inplace=True) return play_data play_train_data = build_play_dataset(hist_play, vid_info, did_features, hist_click) # 步骤6:训练完播率模型(添加正则化) X_play = play_train_data.drop(columns=['did', 'vid', 'play_time', 'item_duration', 'completion_rate']) y_play = play_train_data['completion_rate'] # 划分数据集 X_train_play, X_val_play, y_train_play, y_val_play = train_test_split( X_play, y_play, test_size=0.2, random_state=42 ) # 训练参数 params_reg = { 'objective': 'regression', 'metric': 'mae', 'boosting_type': 'gbdt', 'num_leaves': 63, 'learning_rate': 0.03, # 降低学习率 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'lambda_l1': 0.1, # 添加L1正则化 'lambda_l2': 0.1, # 添加L2正则化 'min_data_in_leaf': 50, 'verbosity': -1 } train_data_play = lgb.Dataset(X_train_play, label=y_train_play, categorical_feature=categorical_features) val_data_play = lgb.Dataset(X_val_play, label=y_val_play, categorical_feature=categorical_features) model_play = lgb.train( params_reg, train_data_play, num_boost_round=2000, valid_sets=[train_data_play, val_data_play], early_stopping_rounds=100, verbose_eval=50 ) # 保存模型 model_click.save_model('click_model.txt') model_play.save_model('play_model.txt')
07-11
帮我修改代码,一是原始数据只有30天,不是32天,二是完播率预测结果数值一样,需要调优模型;三是预测结果的行数应该与testA_pred_did.csv保持一致。import pandas as pd import numpy as np import lightgbm as lgb from lightgbm import early_stopping, log_evaluation import gc import os import chardet from sklearn.model_selection import train_test_split from tqdm import tqdm import joblib from datetime import datetime from sklearn.metrics import roc_auc_score # 添加AUC计算 # 修复:在函数定义后添加缩进的代码块 def load_data_safely(file_path, usecols=None, dtype=None, chunksize=100000): """安全加载大型CSV文件,优化内存使用""" try: if not os.path.exists(file_path): print(f"⚠️ 文件不存在: {file_path}") return pd.DataFrame() # 自动检测编码 with open(file_path, 'rb') as f: result = chardet.detect(f.read(100000)) encoding = result['encoding'] if result['confidence'] > 0.7 else 'latin1' # 分批读取并优化内存 chunks = [] reader = pd.read_csv( file_path, encoding=encoding, usecols=usecols, dtype=dtype, chunksize=chunksize, low_memory=False ) for chunk in tqdm(reader, desc=f"加载 {os.path.basename(file_path)}"): # 优化分类列内存 if dtype: # 确保dtype不为空 for col in chunk.columns: if col in dtype and dtype[col] == 'category': chunk[col] = chunk[col].astype('category').cat.as_ordered() chunks.append(chunk) if chunks: return pd.concat(chunks, ignore_index=True) return pd.DataFrame() except Exception as e: print(f"⚠️ 加载 {file_path} 失败: {str(e)}") return pd.DataFrame() # 修复:确保所有函数都有缩进的代码块 def load_historical_data(days=32): """高效加载历史数据,支持分批处理""" see_list, click_list, play_list = [], [], [] for day in tqdm(range(1, days + 1), desc="加载历史数据"): day_str = f"{day:02d}" # 加载曝光数据 see_path = f'see_{day_str}.csv' if os.path.exists(see_path): see = load_data_safely(see_path, usecols=['did', 'vid'], dtype={'did': 'category', 'vid': 'category'}) if not see.empty and 'did' in see.columns and 'vid' in see.columns: see_list.append(see) del see gc.collect() # 加载点击数据 click_path = f'click_{day_str}.csv' if os.path.exists(click_path): click = load_data_safely(click_path, usecols=['did', 'vid', 'click_time'], dtype={'did': 'category', 'vid': 'category'}) if not click.empty and 'click_time' in click.columns and 'did' in click.columns and 'vid' in click.columns: # 优化日期处理 click['date'] = pd.to_datetime(click['click_time'], errors='coerce').dt.date click = click.drop(columns=['click_time'], errors='ignore') click_list.append(click[['did', 'vid', 'date']]) del click gc.collect() # 加载播放数据 play_path = f'playplus_{day_str}.csv' if os.path.exists(play_path): play = load_data_safely(play_path, usecols=['did', 'vid', 'play_time'], dtype={'did': 'category', 'vid': 'category'}) if not play.empty and 'play_time' in play.columns and 'did' in play.columns and 'vid' in play.columns: play_list.append(play) del play gc.collect() gc.collect() # 确保返回三个DataFrame,即使某些为空 return ( pd.concat(see_list).drop_duplicates(['did', 'vid']) if see_list else pd.DataFrame(), pd.concat(click_list).drop_duplicates(['did', 'vid']) if click_list else pd.DataFrame(), pd.concat(play_list).drop_duplicates(['did', 'vid']) if play_list else pd.DataFrame() ) # 修复:添加缺失的函数定义 def build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1): """构建点击数据集,包含负样本采样""" if hist_exposure.empty or hist_click.empty: print("⚠️ 历史曝光或点击数据为空,无法构建数据集") return pd.DataFrame() # 标记正样本 hist_click = hist_click.copy() hist_click['label'] = 1 # 高效标记负样本(使用集合操作) exposure_set = set(zip(hist_exposure['did'], hist_exposure['vid'])) click_set = set(zip(hist_click['did'], hist_click['vid'])) # 找出未点击的曝光 negative_set = exposure_set - click_set # 创建负样本DataFrame if negative_set: negative_dids, negative_vids = zip(*negative_set) negative_samples = pd.DataFrame({ 'did': list(negative_dids), 'vid': list(negative_vids), 'label': 0 }) # 采样负样本 if sample_ratio < 1.0: negative_samples = negative_samples.sample(frac=sample_ratio, random_state=42) else: negative_samples = pd.DataFrame(columns=['did', 'vid', 'label']) # 合并数据集 click_data = pd.concat([ hist_click[['did', 'vid', 'label']], negative_samples ], ignore_index=True) # 释放内存 del exposure_set, click_set, negative_set, negative_samples gc.collect() return click_data # 修复:添加缺失的函数定义 def add_click_features(df, did_features, vid_info, hist_click, hist_play): """添加关键特征,避免内存溢出""" if df.empty: return df # 基础特征(使用索引加速合并) if not did_features.empty and 'did' in did_features.columns: df = df.merge(did_features, on='did', how='left') if not vid_info.empty and 'vid' in vid_info.columns: df = df.merge(vid_info, on='vid', how='left') # 用户行为统计 user_click_count = pd.Series(dtype='int') if not hist_click.empty and 'did' in hist_click.columns: user_click_count = hist_click.groupby('did').size().rename('user_click_count') if not user_click_count.empty: df = df.merge(user_click_count, on='did', how='left') else: df['user_click_count'] = 0 user_total_play = pd.Series(dtype='float') if not hist_play.empty and 'did' in hist_play.columns and 'play_time' in hist_play.columns: user_total_play = hist_play.groupby('did')['play_time'].sum().rename('user_total_play') if not user_total_play.empty: df = df.merge(user_total_play, on='did', how='left') else: df['user_total_play'] = 0 # 视频热度统计 video_click_count = pd.Series(dtype='int') if not hist_click.empty and 'vid' in hist_click.columns: video_click_count = hist_click.groupby('vid').size().rename('video_click_count') if not video_click_count.empty: df = df.merge(video_click_count, on='vid', how='left') else: df['video_click_count'] = 0 avg_play_time = pd.Series(dtype='float') if not hist_play.empty and 'vid' in hist_play.columns and 'play_time' in hist_play.columns: avg_play_time = hist_play.groupby('vid')['play_time'].mean().rename('avg_play_time') if not avg_play_time.empty: df = df.merge(avg_play_time, on='vid', how='left') else: df['avg_play_time'] = 0 # 填充缺失值 fill_values = { 'user_click_count': 0, 'user_total_play': 0, 'video_click_count': df['video_click_count'].median() if 'video_click_count' in df else 0, 'avg_play_time': df['avg_play_time'].median() if 'avg_play_time' in df else 0 } for col, value in fill_values.items(): if col in df: df[col] = df[col].fillna(value) # 添加时间相关特征 if 'date' in df: df['day_of_week'] = pd.to_datetime(df['date']).dt.dayofweek.astype('int8') df['hour'] = pd.to_datetime(df['date']).dt.hour.astype('int8') return df # 修复:添加缺失的函数定义 def get_categorical_features(df, base_features): """动态获取存在的分类特征""" existing_features = [] for feature in base_features: if feature in df.columns: try: # 尝试转换为数值,如果是数值则跳过 pd.to_numeric(df[feature], errors='raise') except: existing_features.append(feature) # 确保转换为category类型 df[feature] = df[feature].astype('category').cat.as_ordered() return existing_features # 修复:添加缺失的函数定义 def build_play_dataset(hist_play, vid_info, did_features, hist_click): """构建完播率数据集,优化内存使用""" if hist_play.empty: print("⚠️ 历史播放数据为空,无法构建完播率数据集") return pd.DataFrame() # 基础数据 play_data = hist_play[['did', 'vid', 'play_time']].copy() # 添加视频时长信息 if not vid_info.empty and 'vid' in vid_info.columns and 'item_duration' in vid_info.columns: play_data = play_data.merge( vid_info[['vid', 'item_duration']], on='vid', how='left' ) else: play_data['item_duration'] = 1.0 # 默认值 # 计算完播率 play_data['completion_rate'] = play_data['play_time'] / play_data['item_duration'] play_data['completion_rate'] = play_data['completion_rate'].clip(upper=1.0) # 添加用户特征 if not did_features.empty and 'did' in did_features.columns: play_data = play_data.merge( did_features, on='did', how='left' ) # 添加视频特征 if not vid_info.empty and 'vid' in vid_info.columns: vid_cols = [col for col in vid_info.columns if col != 'item_duration'] play_data = play_data.merge( vid_info[vid_cols], on='vid', how='left' ) # 用户平均完播率 play_data['user_avg_completion'] = play_data.groupby('did')['completion_rate'].transform('mean') play_data['user_play_count'] = play_data.groupby('did')['completion_rate'].transform('count') # 视频平均完播率 play_data['video_avg_completion'] = play_data.groupby('vid')['completion_rate'].transform('mean') play_data['video_completion_std'] = play_data.groupby('vid')['completion_rate'].transform('std') # 用户-视频互动特征 if not hist_click.empty and 'did' in hist_click.columns and 'vid' in hist_click.columns: user_vid_clicks = hist_click.groupby(['did', 'vid']).size().reset_index(name='user_vid_clicks') play_data = play_data.merge(user_vid_clicks, on=['did', 'vid'], how='left') else: play_data['user_vid_clicks'] = 0 # 填充缺失值 play_data['user_avg_completion'].fillna(play_data['completion_rate'].mean(), inplace=True) play_data['user_play_count'].fillna(1, inplace=True) play_data['video_avg_completion'].fillna(play_data['completion_rate'].median(), inplace=True) play_data['video_completion_std'].fillna(0, inplace=True) play_data['user_vid_clicks'].fillna(0, inplace=True) return play_data # 修复:添加缺失的函数定义 def predict_for_test_data(test_users, test_exposure, did_features, vid_info): """为测试数据生成预测结果 - 修改为只保留点击概率最高的vid""" if test_users.empty or test_exposure.empty: print("⚠️ 测试数据为空,无法进行预测") return pd.DataFrame() # 合并测试数据 test_data = test_exposure.merge(test_users, on='did', how='left') # 添加特征 test_data = add_click_features( test_data, did_features, vid_info, pd.DataFrame(), # 无历史点击 pd.DataFrame() # 无历史播放 ) # 动态获取分类特征 test_categorical_features = get_categorical_features(test_data, base_categorical_features) # 预测点击率 X_test = test_data.drop(columns=['did', 'vid', 'date'], errors='ignore') click_probs = [] if model_click and not X_test.empty: click_probs = model_click.predict(X_test) else: click_probs = [0.5] * len(test_data) # 默认值 # 预测完播率 completion_rates = [] if model_play and not X_test.empty: # 添加视频时长信息 if not vid_info.empty and 'vid' in vid_info.columns and 'item_duration' in vid_info.columns: test_data = test_data.merge(vid_info[['vid', 'item_duration']], on='vid', how='left') else: test_data['item_duration'] = 1.0 completion_rates = model_play.predict(X_test) else: completion_rates = [0.7] * len(test_data) # 默认值 # 存储预测结果 test_data['click_prob'] = click_probs test_data['completion_rate'] = completion_rates # 修改:只保留每个did点击概率最高的vid result = test_data.sort_values('click_prob', ascending=False).groupby('did').head(1) # 选择需要的列 result = result[['did', 'vid', 'completion_rate']].copy() # 重命名列 result.columns = ['did', 'vid', 'predicted_completion_rate'] return result # 主程序流程 if __name__ == "__main__": # 定义内存优化的数据类型 dtypes = { 'did': 'category', 'vid': 'category', 'play_time': 'float32' } # 可选特征 - 只有在数据中存在时才添加 optional_features = { 'item_cid': 'category', 'item_type': 'category', 'item_assetSource': 'category', 'item_classify': 'category', 'item_isIntact': 'category', 'sid': 'category', 'stype': 'category' } # 添加特征字段 for i in range(88): dtypes[f'f{i}'] = 'float32' # 加载核心数据 print("开始加载核心数据...") did_features = load_data_safely('did_features_table.csv', dtype=dtypes) vid_info = load_data_safely('vid_info_table.csv', dtype=dtypes) # 添加可选特征到dtypes(仅当列存在时) for feature, dtype in optional_features.items(): if not vid_info.empty and feature in vid_info.columns: dtypes[feature] = dtype # 重新加载数据以确保所有列使用正确的数据类型 if os.path.exists('did_features_table.csv'): did_features = load_data_safely('did_features_table.csv', dtype=dtypes) else: print("⚠️ did_features_table.csv 不存在") did_features = pd.DataFrame() if os.path.exists('vid_info_table.csv'): vid_info = load_data_safely('vid_info_table.csv', dtype=dtypes) else: print("⚠️ vid_info_table.csv 不存在") vid_info = pd.DataFrame() # 加载历史数据 - 确保所有变量都被定义 print("开始加载历史数据...") hist_exposure, hist_click, hist_play = load_historical_data(days=32) # 打印历史数据状态 print(f"历史曝光数据形状: {hist_exposure.shape if not hist_exposure.empty else '空'}") print(f"历史点击数据形状: {hist_click.shape if not hist_click.empty else '空'}") print(f"历史播放数据形状: {hist_play.shape if not hist_play.empty else '空'}") # 构建点击数据集 if not hist_exposure.empty and not hist_click.empty: print("构建点击数据集...") click_train_data = build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1) else: print("⚠️ 无法构建点击数据集,因为历史曝光或点击数据为空") click_train_data = pd.DataFrame() # 添加特征 - 确保所有参数都已定义 if not click_train_data.empty: print("开始构建点击特征...") click_train_data = add_click_features( click_train_data, did_features, vid_info, hist_click, # 确保hist_click已定义 hist_play # 确保hist_play已定义 ) else: print("⚠️ 点击数据集为空,跳过特征构建") # 基础分类特征列表 base_categorical_features = [ 'item_cid', 'item_type', 'item_assetSource', 'item_classify', 'item_isIntact', 'sid', 'stype', 'day_of_week', 'hour' ] # 动态获取存在的分类特征 categorical_features = [] if not click_train_data.empty: categorical_features = get_categorical_features(click_train_data, base_categorical_features) print(f"使用的分类特征: {categorical_features}") else: print("⚠️ 点击训练数据为空,无法获取分类特征") # 准备训练数据 if not click_train_data.empty: if 'date' in click_train_data.columns: X = click_train_data.drop(columns=['did', 'vid', 'label', 'date'], errors='ignore') else: X = click_train_data.drop(columns=['did', 'vid', 'label'], errors='ignore') y = click_train_data['label'] else: X, y = pd.DataFrame(), pd.Series() print("⚠️ 点击训练数据为空") # 划分数据集 if not X.empty and not y.empty: X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) else: print("⚠️ 训练数据为空,无法进行模型训练") X_train, X_val, y_train, y_val = pd.DataFrame(), pd.DataFrame(), pd.Series(), pd.Series() # 训练模型(优化参数) params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 63, 'learning_rate': 0.05, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'min_child_samples': 100, 'verbosity': -1 } model_click = None if not X_train.empty: train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features) val_data = lgb.Dataset(X_val, label=y_val, categorical_feature=categorical_features) print("开始训练点击预测模型...") model_click = lgb.train( params, train_data, num_boost_round=1500, valid_sets=[val_data], callbacks=[ early_stopping(stopping_rounds=100, verbose=True), log_evaluation(period=50) ] ) # 计算并输出AUC if not X_val.empty and not y_val.empty and model_click: y_val_pred = model_click.predict(X_val) auc_score = roc_auc_score(y_val, y_val_pred) print(f"📊 点击率模型在验证集上的AUC: {auc_score:.6f}") with open('model_metrics.txt', 'w') as f: f.write(f"点击率模型AUC: {auc_score:.6f}\n") else: print("⚠️ 训练数据为空,跳过点击预测模型训练") # 构建完播率数据集 print("开始构建完播率数据集...") if 'hist_play' in globals() and 'vid_info' in globals() and 'did_features' in globals() and 'hist_click' in globals(): play_train_data = build_play_dataset(hist_play, vid_info, did_features, hist_click) else: print("⚠️ 无法构建完播率数据集,因为所需变量未定义") play_train_data = pd.DataFrame() # 训练完播率模型 model_play = None if not play_train_data.empty: X_play = play_train_data.drop(columns=['did', 'vid', 'play_time', 'item_duration', 'completion_rate'], errors='ignore') y_play = play_train_data['completion_rate'] else: X_play, y_play = pd.DataFrame(), pd.Series() print("⚠️ 完播率训练数据为空") if not X_play.empty and not y_play.empty: X_train_play, X_val_play, y_train_play, y_val_play = train_test_split( X_play, y_play, test_size=0.2, random_state=42 ) else: print("⚠️ 完播率训练数据为空,无法进行模型训练") X_train_play, X_val_play, y_train_play, y_val_play = pd.DataFrame(), pd.DataFrame(), pd.Series(), pd.Series() # 获取完播率模型的分类特征 play_categorical_features = [] if not play_train_data.empty: play_categorical_features = get_categorical_features(play_train_data, base_categorical_features) print(f"完播率模型使用的分类特征: {play_categorical_features}") else: print("⚠️ 完播率训练数据为空,无法获取分类特征") # 训练参数 params_reg = { 'objective': 'regression', 'metric': 'mae', 'boosting_type': 'gbdt', 'num_leaves': 63, 'learning_rate': 0.03, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'lambda_l1': 0.1, 'lambda_l2': 0.1, 'min_data_in_leaf': 50, 'verbosity': -1 } if not X_train_play.empty: train_data_play = lgb.Dataset(X_train_play, label=y_train_play, categorical_feature=play_categorical_features) val_data_play = lgb.Dataset(X_val_play, label=y_val_play, categorical_feature=play_categorical_features) print("开始训练完播率模型...") model_play = lgb.train( params_reg, train_data_play, num_boost_round=2000, valid_sets=[val_data_play], callbacks=[ early_stopping(stopping_rounds=100, verbose=True), log_evaluation(period=50) ] ) else: print("⚠️ 训练数据为空,跳过完播率模型训练") # 保存模型 if model_click: model_click.save_model('click_model.txt') if model_play: model_play.save_model('play_model.txt') joblib.dump(base_categorical_features, 'categorical_features.pkl') # 加载预测数据 print("开始加载预测数据...") to_predict_users = load_data_safely('testA_pred_did.csv', dtype={'did': 'category'}) to_predict_exposure = load_data_safely('testA_did_show.csv', dtype={'did': 'category', 'vid': 'category'}) # 执行预测 if not to_predict_users.empty and not to_predict_exposure.empty: print("开始生成预测结果...") submission = predict_for_test_data(to_predict_users, to_predict_exposure, did_features, vid_info) # 保存结果 if not submission.empty: timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") output_file = f'submission_{timestamp}.csv' # 修改:保存为无表头CSV submission.to_csv(output_file, index=False, header=False) print(f"预测结果已保存至: {output_file}") print(f"结果格式: 共 {len(submission)} 行") print(f"列信息: [did, vid, predicted_completion_rate]") else: print("⚠️ 预测结果为空,未保存文件") else: print("⚠️ 预测数据加载失败,无法生成结果")
07-13
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值