X264 Encoding Suggestions

本文提供了使用x264编码器的最佳实践,包括一般提示、命令行建议、VBV编码、Blu-ray编码及QuickTime兼容性等内容。旨在帮助用户针对不同应用场景选择合适的编码参数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >



Contents

[hide]

General Tips

  • Don't use x264 as a replacement for a denoising filter. Ensure that your input looks like what you want as output (ignoring the loss in quality from compression). Doing anything else causes x264 to run less efficiently.
  • If you aren't seeing 100% CPU usage while encoding, the problem may lie with your input:
    • Raw YUV input can easily be constrained by your IO system's speed. The size of a raw frame can be calculated as: width * height * bitdepth (12 for YV12) / 8 (bits -> bytes) / 1024 (bytes -> kbytes) / 1024 (kbytes -> mbytes).
    • Avisynth scripts run in a single thread. Some filters can spawn extra threads for themselves but this doesn't give huge gains.
    • ffmpeg (and therefore ffms2) is only supported as a single-threaded decoder in x264.
  • The Flash h.264 decoder (and others) didn't support weighted P-frame prediction (--weightp) until version 10.1, which still isn't widely deployed (>90%) at time of writing. Set --weightp 0 when encoding for Flash if this is still the case.
  • Apple's QuickTime Player has various decoding issues, especially related to reference frame count. See the QuickTime encoding suggestions if you want to maintain QuickTime compatibility, and you should since they make up a large portion of computer users.

Commandline Suggestions

You can divide x264 settings into three groups:

1. Those you should touch directly

Things like --bitrate, --keyint, --slices and so forth. They depend on the requirements of your decoder and thus are ignored by x264's preset system.

2. Those you should touch via the preset system

Before the preset system existed people wrote giant commandlines specifying every option by hand. There were arguments over whether --subme 9 with --me hex was better or worse than --subme 7 with --me tesa. Crazy! Do your part to bury these memories by using --preset, --tune and --profile.

3. Those you should avoid remembering exist

--qcomp, --scenecut, and so on. Settings that probably should be built-in constants now. Their use in contemporary commandlines serve no purpose other than causing pain among readers when asking for help. Seriously, people-tweaking-obscure-parameters-because-of-a-hazy-belief-it-helps, knock it off.

VBV Encoding

The VBV (Video Buffer Verifier) system in x264 is used to constrain the output bitrate so it is suitable for streaming in a bandwidth constrained environment.

The h.264 VBV model is based around the idea of a "VBV buffer" on the decoder side. As the h.264 stream is downloaded by the client it's stored in the VBV buffer. A frame must be fully downloaded into the VBV buffer before it can be decoded.

x264 has three options that control the VBV buffer:

  1. The buffer's size is specified in kbit with --vbv-bufsize,
  2. The rate at which the buffer fills is specified in kbit/sec by --vbv-maxrate,
  3. The proportion of the buffer that must be filled before playback can start is specified by --vbv-init.

When using VBV, you always need to set the first two options and should never need to set the last one.

Examples

Example 1

Scenario: Encoding a movie to a file on your hard drive to play at a later date.

Suggestion: Do not specify any VBV settings. Your hard drive has a fast enough transfer rate that there's no need to specify any VBV constraints.

Example 2

Scenario: Encoding a video that's suitable for muxing to a Blu-ray compatible file.

Suggestion: The Blu-ray specification specifies a 40 Mbit maximum data rate (for the video) and a 30 Mbit buffer. So, specify --bluray-compat --vbv-maxrate 40000 --vbv-bufsize 30000.

Note: x264 needs more options to output a fully compliant Blu-ray file. But these options are all you need for VBV compliancy.

Example 3

Scenario: Streaming a video via Flash on a website, like Youtube.

Suggestion: You want the video to start very quickly, so there can't be more than (say) 0.5 seconds of buffering. You set a minimum connection speed requirement for viewers of 512kbit/sec. Assume that 90% of that bandwidth will be usable by your site, and that 96kbit/sec will be used by audio, which leaves 364kbit/sec for x264. So, specify --vbv-maxrate 364 --vbv-buffer 182.

Note: There are more factors than just the video VBV settings when looking at the start delay for online streaming. The above example is for a perfect world where these factors don't exist.

Blu-ray Encoding

Someone else (kierank) has done all the hard work, you can find full instructions on creating Blu-ray compliant files with x264 here. The short answer for 1080p encoding is as follows:

x264 --bluray-compat --bitrate X --preset veryslow --weightp 0 --bframes 3 --nal-hrd vbr --vbv-maxrate 40000 \
    --vbv-bufsize 30000 --level 4.1 --keyint X --b-pyramid strict --slices 4 --fake-interlaced \
    --aud --colorprim "bt709" --transfer "bt709" --colormatrix "bt709" --sar 1:1 \
    --pass 1 -o output.file input.file
x264 --bluray-compat  --bitrate X --preset veryslow --weightp 0 --bframes 3 --nal-hrd vbr --vbv-maxrate 40000 \
    --vbv-bufsize 30000 --level 4.1 --keyint X --b-pyramid strict --slices 4 --fake-interlaced \
    --aud --colorprim "bt709" --transfer "bt709" --colormatrix "bt709" --sar 1:1 \
    --pass 2 -o output.file input.file

Set keyint to your fps. Use a faster preset if desired. Add a --tune parameter if desired. --weightp is disabled because some hardware decoders have problems with it, not because it's mandated by the Blu-ray spec.

Encoder latency

Depending on the encoder settings you configure, x264 will keep an internal buffer of frames to improve quality. When you're just encoding a file offline this is irrelevant, but if you're working in a broadcast or streaming environment this delay can be unacceptable.

You can calculate the number of frames x264 buffer (the encoder latency) using the following pseudocode. If you use the libx264 API, see also x264_encoder_delayed_frames.

delay = 0

if b-adapt=2 and not (nopass or pass2):
    delay = MAX(bframes,3)*4
else:
    delay = bframes

if mb-tree or vbv:
     delay = MAX(delay,rc_lookahead)

if not sliced-threads:
    delay = delay + threads - 1

delay = delay + sync-lookahead

if vbv and (vfr-input or abr or pass1:
    delay = delay + 1

Reducing x264's latency is possible, but reduces quality. If you want no latency, set --tune zerolatency. If you can handle even a little latency (ie under 1 second), it is well worth tuning the options to allow this. Here is a series of steps you can follow to incrementally reduce latency. Stop when your latency is low enough:

  1. Start with defaults
  2. Kill sync-lookahead
  3. Drop rc-lookahead to no less than ~10
  4. Drop threads to a lower value (i.e. say 6 instead of 12)
  5. Use sliced threads
  6. Disable rc-lookahead
  7. Disable b-frames
  8. Now you're at --tune zerolatency

QuickTime-compatible Encoding

There is a lot of very outdated misinformation surrounding how to encode h.264 videos that play perfectly in QuickTime. For instance that "QT supports only 1 B-Frame" and "QT doesn't support pyramidal B-Frames" which are both false as of QuickTime Player 7.7. There also used to exist a qpmin/dct bug in early versions of QuickTime Player where the player couldn't deal with a low quantizer (below 4) combined 8x8 DCT Transforms; however, that has been fixed as of QuickTime Player X.

QuickTime Player X compatibility

The Apple h.264 decoder component released with the launch of QuickTime Player X is a pretty good decoder and has only one big remaining bug. However, before we go on, you might wonder about all those people that are stuck on older versions of QuickTime Player, refusing to upgrade to QTX. That is no problem at all, because the actual decoder that is used systemwide by OS X (including by QuickTime) is a separate component from the player, and will have been updated through Software Updates even though they may not have downloaded the actual player update. So, any time we refer to QuickTime Player X, assume that anyone with 10.6 Snow Leopard or higher has that bug-fixed Apple h.264 decoder component, even if they don't specifically have QTX.

There is now just a single remaining major Apple h.264 decoder flaw:

  • A high number of reference frames (15 or more, to be precise) combined with a high number of B-Frames produces blocky, distorted video output. This is an issue you will encounter with the "veryslow" preset, since it uses up to 16 reference frames.

Solution:

  • Limit the number of reference frames to any number between 1-14. Look up the acceptable number of reference frames for your desired h.264 level and video resolution and make sure never to exceed that value or 14, whichever comes first. However, there is rarely any point in exceeding 5 reference frames anyway, and 4 frames is the maximum allowed for 1080p content in the most common levels, so a good choice is to simply pick 4 reference frames for all your encodes regardless of video resolution: --ref 4 (For Level 5+ encodes, you may want to refer to the level specifications and pick a higher number as long as you don't exceed 14.)

QuickTime Player 7.7 compatibility

Any operating system that has the option of installing QuickTime Player X will have the updated QTX decoder. That sadly means only OS X 10.6 Snow Leopard users and higher. So, if you want 10.5 Leopard users to be able to play your videos, you'll have to read this section as well since they are stuck on QuickTime Player 7.7. Note that the number of active Leopard users is in the 10% range and falling, so you may decide not bother. Then again, it's easy enough to add compatibility with their outdated Apple h.264 decoder at a very tiny penalty to quality, so you will probably find it worth it.

The decoder flaw:

  • A low quantizer (below 4) combined with 8x8 DCT Transforms produces garbled decode results in QuickTime Player 7.7 or earlier.

Solution:

  • Prevent the encoder from using quantizers below 4 with --qpmin 4. You will take an extremely minor quality loss, but far less than you would have if you had disabled 8x8 DCT instead.

Conclusion

Forget everything you may have read elsewhere on the internet. It is extremely outdated and hasn't been true for many years.

The simplest advice for QuickTime compatibility: Combine both solutions to get the most compatible encodes, fully playable by anyone on 10.5 Leopard or above, giving you 99% of the Mac userbase.

Just encode all your videos with --ref 4 --qpmin 4 and be safe in the knowledge that it will work for all Mac users that matter!

原文地址:http://mewiki.project357.com/wiki/X264_Encoding_Suggestions

检查并优化:import pandas as pd import numpy as np import lightgbm as lgb import gc import os import chardet from sklearn.model_selection import train_test_split from tqdm import tqdm import psutil from sklearn.metrics import log_loss, mean_absolute_error from scipy.sparse import hstack, csr_matrix, save_npz, load_npz import warnings warnings.filterwarnings('ignore') # 内存优化函数 - 增强版 def optimize_dtypes(df, downcast_int=True, downcast_float=True, category_threshold=0.5): """优化DataFrame的数据类型以减少内存占用""" if df.empty: return df # 转换整数列为最小可用类型 if downcast_int: int_cols = df.select_dtypes(include=['int']).columns for col in int_cols: df[col] = pd.to_numeric(df[col], downcast='integer') # 转换浮点列为float32 if downcast_float: float_cols = df.select_dtypes(include=['float']).columns for col in float_cols: # 优先转换为float32而不是downcast='float'以获得更好控制 df[col] = df[col].astype(np.float32) # 转换对象列为分类类型 obj_cols = df.select_dtypes(include=['object']).columns for col in obj_cols: num_unique = df[col].nunique() num_total = len(df) if num_unique / num_total < category_threshold: df[col] = df[col].astype('category') return df # 增强数据加载函数 def load_data_safely(file_path, usecols=None, dtype=None, chunksize=50000, verbose=True): """安全加载大型CSV文件,优化内存使用""" try: if not os.path.exists(file_path): print(f"⚠️ 文件不存在: {file_path}") return pd.DataFrame() # 自动检测编码 with open(file_path, 'rb') as f: result = chardet.detect(f.read(100000)) encoding = result['encoding'] if result['confidence'] > 0.7 else 'latin1' # 获取文件大小用于进度条 file_size = os.path.getsize(file_path) / (1024 ** 2) # MB desc = f"加载 {os.path.basename(file_path)} ({file_size:.1f}MB)" # 分批读取并优化内存 chunks = [] reader = pd.read_csv( file_path, encoding=encoding, usecols=usecols, dtype=dtype, chunksize=chunksize, low_memory=False ) for chunk in tqdm(reader, desc=desc, disable=not verbose): # 优化数据类型 chunk = optimize_dtypes(chunk) chunks.append(chunk) if chunks: result = pd.concat(chunks, ignore_index=True) # 再次整体优化 result = optimize_dtypes(result) return result return pd.DataFrame() except Exception as e: print(f"⚠️ 加载 {file_path} 失败: {str(e)}") return pd.DataFrame() # 稀疏矩阵转换函数 - 优化版 def to_sparse_matrix(df, columns, fillna='MISSING', dtype=np.int8): """将分类特征转换为稀疏矩阵表示""" from sklearn.preprocessing import OneHotEncoder # 预处理数据 sparse_data = df[columns].fillna(fillna).astype(str) # 使用OneHotEncoder替代get_dummies以获得更好性能 encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=True, dtype=dtype) sparse_matrix = encoder.fit_transform(sparse_data) return sparse_matrix, encoder # 增量训练函数 - 优化内存管理 def train_incremental(X, y, categorical_features, params, num_rounds=1000, chunk_size=100000): """分块增量训练模型以减少内存占用""" model = None callbacks = [lgb.early_stopping(stopping_rounds=50, verbose=0), lgb.log_evaluation(period=100)] for i in tqdm(range(0, len(X), chunk_size), desc="增量训练"): chunk_end = min(i + chunk_size, len(X)) # 使用视图避免复制数据 X_chunk = X.iloc[i:chunk_end] y_chunk = y.iloc[i:chunk_end] # 创建数据集后立即释放原始数据 train_data = lgb.Dataset( X_chunk, label=y_chunk, categorical_feature=categorical_features, free_raw_data=True # 训练后释放原始数据 ) if model is None: model = lgb.train( params, train_data, num_boost_round=num_rounds, callbacks=callbacks, keep_training_booster=True ) else: model = lgb.train( params, train_data, num_boost_round=num_rounds, init_model=model, callbacks=callbacks, keep_training_booster=True ) # 显式释放内存 del train_data, X_chunk, y_chunk gc.collect() return model # 历史数据加载函数 - 优化内存 def load_historical_data(days=32, verbose=True): """高效加载历史数据,支持分批处理""" see_list, click_list, play_list = [], [], [] for day in tqdm(range(1, days + 1), desc="加载历史数据", disable=not verbose): day_str = f"{day:02d}" # 加载曝光数据 - 仅加载必要列 see_path = f'see_{day_str}.csv' if os.path.exists(see_path): see = load_data_safely( see_path, usecols=['did', 'vid'], dtype={'did': 'category', 'vid': 'category'}, verbose=verbose ) if not see.empty: see_list.append(see) del see # 加载点击数据 - 优化日期处理 click_path = f'click_{day_str}.csv' if os.path.exists(click_path): click = load_data_safely( click_path, usecols=['did', 'vid', 'click_time'], dtype={'did': 'category', 'vid': 'category'}, verbose=verbose ) if not click.empty and 'click_time' in click.columns: # 直接解析日期为数值类型 click_dates = pd.to_datetime(click['click_time'], errors='coerce') click['date'] = click_dates.dt.strftime('%Y%m%d').astype('int32') click = click.drop(columns=['click_time']) click_list.append(click[['did', 'vid', 'date']]) del click, click_dates # 加载播放数据 play_path = f'playplus_{day_str}.csv' if os.path.exists(play_path): play = load_data_safely( play_path, usecols=['did', 'vid', 'play_time'], dtype={'did': 'category', 'vid': 'category', 'play_time': 'float32'}, verbose=verbose ) if not play.empty: play_list.append(play) del play gc.collect() # 使用concat时避免创建中间对象 return ( pd.concat(see_list, ignore_index=True, copy=False).drop_duplicates(['did', 'vid']) if see_list else pd.DataFrame(), pd.concat(click_list, ignore_index=True, copy=False).drop_duplicates(['did', 'vid']) if click_list else pd.DataFrame(), pd.concat(play_list, ignore_index=True, copy=False).drop_duplicates(['did', 'vid']) if play_list else pd.DataFrame() ) # 点击数据集构建 - 内存优化版 def build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1, verbose=True): """构建点击数据集,包含负样本采样 - 内存优化版""" if hist_exposure.empty or hist_click.empty: print("⚠️ 历史曝光或点击数据为空,无法构建数据集") return pd.DataFrame() # 标记正样本 - 使用视图避免复制 pos_samples = hist_click[['did', 'vid']].copy() pos_samples['label'] = 1 # 创建曝光集索引用于高效查找 exposure_index = hist_exposure.set_index(['did', 'vid']).index # 分块处理负样本 neg_chunks = [] chunk_size = 500000 total_rows = len(hist_exposure) for start in tqdm(range(0, total_rows, chunk_size), desc="构建负样本", disable=not verbose): end = min(start + chunk_size, total_rows) chunk = hist_exposure.iloc[start:end] # 使用索引查找未点击的曝光 chunk['is_clicked'] = chunk.set_index(['did', 'vid']).index.isin(hist_click.set_index(['did', 'vid']).index) neg_chunk = chunk[~chunk['is_clicked']][['did', 'vid']] if not neg_chunk.empty and sample_ratio < 1.0: neg_chunk = neg_chunk.sample(frac=sample_ratio, random_state=42) neg_chunks.append(neg_chunk) del chunk, neg_chunk # 合并负样本 neg_samples = pd.concat(neg_chunks, ignore_index=True) neg_samples['label'] = 0 # 合并正负样本 click_data = pd.concat([pos_samples, neg_samples], ignore_index=True, copy=False) # 释放内存 del exposure_index, pos_samples, neg_samples, neg_chunks gc.collect() return click_data # 在合并操作前添加检查 if 'total' not in df.columns: print("警告:'total' 列不存在于 DataFrame 中") print("可用列名:", df.columns.tolist()) # 尝试找出可能的拼写错误 possible_matches = [col for col in df.columns if 'total' in col.lower()] if possible_matches: print("可能的匹配列:", possible_matches) # 特征工程函数 - 内存优化版 def add_click_features(df, did_features, vid_info, hist_click, hist_play, verbose=True): """添加关键特征,避免内存溢出 - 优化版""" if df.empty: return df # 1. 合并设备特征 - 仅选择必要列 if not did_features.empty and 'did' in did_features.columns: did_cols = ['did'] + [col for col in did_features.columns if col.startswith('f')] df = df.merge(did_features[did_cols], on='did', how='left') # 2. 合并视频特征 - 仅选择必要列 if not vid_info.empty and 'vid' in vid_info.columns: vid_cols = ['vid', 'item_duration'] + [col for col in vid_info.columns if col in ['item_cid', 'item_type']] df = df.merge(vid_info[vid_cols], on='vid', how='left') # 3. 预聚合统计特征 - 减少重复计算 stats = {} # 用户行为统计 if not hist_click.empty: stats['user_click_count'] = hist_click.groupby('did').size().astype('int32') stats['video_click_count'] = hist_click.groupby('vid').size().astype('int32') if not hist_play.empty: stats['user_total_play'] = hist_play.groupby('did')['play_time'].sum().astype('float32') stats['avg_play_time'] = hist_play.groupby('vid')['play_time'].mean().astype('float32') # 4. 合并统计特征 for name, stat_df in tqdm(stats.items(), desc="添加统计特征", disable=not verbose): if name in df.columns: continue df = df.merge(stat_df.rename(name), how='left', left_on=name.split('_')[1], right_index=True) # 5. 填充缺失值 - 使用更高效的方法 fill_values = { 'user_click_count': 0, 'user_total_play': 0, 'video_click_count': df['video_click_count'].median() if 'video_click_count' in df else 0, 'avg_play_time': df['avg_play_time'].median() if 'avg_play_time' in df else 0, 'item_duration': df['item_duration'].median() if 'item_duration' in df else 30.0 } for col, default in fill_values.items(): if col in df: # 使用inplace填充减少内存分配 df[col].fillna(default, inplace=True) # 6. 添加时间特征 - 使用数值替代分类 if 'date' in df: # 直接计算数值特征,避免创建datetime对象 df['day_of_week'] = (df['date'] % 7).astype('int8') df['is_weekend'] = (df['day_of_week'] >= 5).astype('int8') df.drop(columns=['date'], inplace=True, errors='ignore') return df # 主处理流程 - 内存优化版 def main(): """主处理流程,包含完整的内存优化策略""" # 初始内存监控 start_mem = memory_monitor("初始内存") # 定义内存优化的数据类型 dtypes = { 'did': 'category', 'vid': 'category', 'play_time': 'float32' } # 加载核心数据 print("开始加载核心数据...") did_features = load_data_safely('did_features_table.csv', dtype=dtypes) vid_info = load_data_safely('vid_info_table.csv', dtype=dtypes) memory_monitor("加载核心数据后") # 加载历史数据 - 减少加载天数 print("开始加载历史数据...") hist_exposure, hist_click, hist_play = load_historical_data(days=14) # 减少到14天 memory_monitor("加载历史数据后") # 构建点击数据集 if not hist_exposure.empty and not hist_click.empty: print("构建点击数据集...") click_train_data = build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1) # 立即释放不再需要的数据 del hist_exposure, hist_click gc.collect() else: print("⚠️ 无法构建点击数据集") click_train_data = pd.DataFrame() memory_monitor("构建点击数据集后") # 添加特征 - 使用增量方式 if not click_train_data.empty: print("开始构建点击特征...") click_train_data = add_click_features( click_train_data, did_features, vid_info, hist_click if 'hist_click' in locals() else pd.DataFrame(), hist_play ) else: print("⚠️ 点击数据集为空,跳过特征构建") # 立即释放内存 del hist_play gc.collect() memory_monitor("添加特征后") # 准备训练数据 - 使用视图避免复制 if not click_train_data.empty: cols_to_drop = ['did', 'vid', 'label'] if 'date' in click_train_data.columns: cols_to_drop.append('date') X = click_train_data.drop(columns=cols_to_drop, errors='ignore') y = click_train_data['label'] else: X, y = pd.DataFrame(), pd.Series(dtype='float32') print("⚠️ 点击训练数据为空") # 划分数据集 - 使用索引避免复制 if len(X) > 0: indices = np.arange(len(X)) train_idx, val_idx = train_test_split(indices, test_size=0.2, random_state=42, stratify=y) X_train, X_val = X.iloc[train_idx], X.iloc[val_idx] y_train, y_val = y.iloc[train_idx], y.iloc[val_idx] else: print("⚠️ 训练数据为空,无法进行模型训练") X_train, X_val, y_train, y_val = pd.DataFrame(), pd.DataFrame(), pd.Series(), pd.Series() # 释放click_train_data del click_train_data, X, y, indices gc.collect() memory_monitor("划分数据集后") # 训练模型参数 - 调整为更节省内存的参数 params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 31, # 减少叶子节点数 'learning_rate': 0.05, 'feature_fraction': 0.7, # 减少特征使用比例 'bagging_fraction': 0.8, 'bagging_freq': 5, 'min_child_samples': 200, # 增加最小样本数 'verbosity': -1, 'max_depth': -1, # 避免过深 'seed': 42 } # 增量训练点击模型 if len(X_train) > 0: print("开始训练点击预测模型...") model_click = train_incremental(X_train, y_train, [], params, num_rounds=1000, chunk_size=100000) # 在验证集上评估 if len(X_val) > 0: # 分块预测避免内存峰值 chunk_size = 50000 val_preds = [] for i in range(0, len(X_val), chunk_size): chunk = X_val.iloc[i:i+chunk_size] val_preds.extend(model_click.predict(chunk)) val_logloss = log_loss(y_val, val_preds) print(f"验证集LogLoss: {val_logloss:.4f}") else: model_click = None print("⚠️ 训练数据为空,跳过点击预测模型训练") # 释放训练数据 del X_train, X_val, y_train, y_val gc.collect() memory_monitor("训练点击模型后") # 最终内存报告 end_mem = memory_monitor("处理完成") print(f"总内存消耗: {end_mem - start_mem:.2f} MB") # 内存监控函数 def memory_monitor(step_name=""): """监控内存使用情况""" process = psutil.Process(os.getpid()) mem_info = process.memory_info() print(f"{step_name} 内存使用: {mem_info.rss / (1024 ** 2):.2f} MB") return mem_info.rss / (1024 ** 2) # 返回MB if __name__ == "__main__": main()
07-12
<think>我们注意到在代码中,有一个片段被错误地放置在了函数定义之外: # 在合并操作前添加检查 if 'total' not in df.columns: print("警告:'total' 列不存在于 DataFrame 中") print("可用列名:", df.columns.tolist()) ... 这个片段应该是在特征工程函数`add_click_features`中处理某个特征时出现的,但是被放在了全局作用域中。这会导致在运行时报错,因为`df`未定义。 另外,在`add_click_features`函数中,我们使用了一个名为`name.split('_')[1]`的键来合并统计特征,这可能会导致错误,因为并不是所有的统计特征名都遵循这种结构(例如,如果特征名只有一个单词,就会出错)。 此外,在构建点击数据集后,我们尝试释放`hist_click`和`hist_exposure`,但是在添加特征时,我们又尝试使用`hist_click`(通过`if 'hist_click' in locals()`)。由于我们已经删除了`hist_click`,所以这个条件会失败,导致传递一个空的DataFrame。 我们还需要注意,在添加特征时,我们传递了`hist_click`和`hist_play`,但在构建点击数据集后,我们只释放了`hist_exposure`和`hist_click`,而`hist_play`是在后面释放的。在添加特征函数中,我们使用了`hist_click`和`hist_play`来构建统计特征。因此,在构建点击数据集后,我们不能释放`hist_play`,因为后面还要用。但是,我们却释放了`hist_click`,导致在添加特征时无法使用`hist_click`。 为了解决这个问题,我们需要调整释放内存的顺序。或者,我们可以在构建点击数据集后不释放`hist_click`,而是在添加特征后再一起释放。但考虑到`hist_click`在添加特征时还需要使用,我们不应该在构建点击数据集后立即释放。 让我们重新组织内存释放的时机: 1. 在构建点击数据集后,我们不再需要`hist_exposure`,但是还需要`hist_click`和`hist_play`用于特征工程。 2. 在添加特征后,我们可以释放`hist_click`和`hist_play`。 因此,修改如下: 在构建点击数据集后,只释放`hist_exposure`,保留`hist_click`和`hist_play`直到特征工程完成。 另外,在`add_click_features`函数中,我们合并统计特征时,使用`name.split('_')[1]`作为合并键,这并不安全。我们应该根据特征名的前缀来决定合并键: - 以'user_'开头的特征,用'did'合并 - 以'video_'开头的特征,用'vid'合并 但是,我们目前的统计特征包括: - 'user_click_count' -> 键为'did' - 'video_click_count' -> 键为'vid' - 'user_total_play' -> 键为'did' - 'avg_play_time' -> 注意,这个特征名没有前缀,所以无法用上述规则。我们可能需要重命名这个特征,比如改为'video_avg_play_time' 因此,我们调整特征名,并修改合并键的确定方式。 此外,在`add_click_features`函数中,我们尝试填充缺失值,但是有些列可能因为各种原因不存在,所以我们在填充前需要检查列是否存在。 我们进行以下优化: 1. 将错误放置的检查代码移除(因为它不在任何函数内,会导致运行错误)。 2. 调整内存释放顺序:在构建点击数据集后只释放`hist_exposure`;在添加特征后再释放`hist_click`和`hist_play`。 3. 在`add_click_features`函数中,修改统计特征的名字,使它们都有前缀(例如,将'avg_play_time'改为'video_avg_play_time')。 4. 在合并统计特征时,根据特征名前缀选择合并键。 5. 在填充缺失值之前,检查列是否存在。 另外,在构建点击数据集时,我们使用了`hist_click`来构建正样本,但注意这个`hist_click`是包含多天数据的。在构建完点击数据集后,我们仍然需要`hist_click`(所有历史点击)来构建用户和视频的点击统计特征。所以不能提前释放。 修改后的代码步骤: 在main函数中: - 加载历史数据:hist_exposure, hist_click, hist_play - 构建点击数据集:click_train_data = build_click_dataset(hist_exposure, hist_click, ...) 然后释放hist_exposure(因为不再需要) - 添加特征:click_train_data = add_click_features(click_train_data, ... , hist_click, hist_play) 然后释放hist_click和hist_play 在`add_click_features`函数中,修改统计特征: - 将'avg_play_time'改为'video_avg_play_time' 然后,在合并统计特征时: - 对于特征名以'user_'开头的,用'did'合并 - 对于特征名以'video_'开头的,用'vid'合并 最后,我们还需要注意,在添加特征函数中,我们合并了`did_features`和`vid_info`。在合并之前,我们应该确保只合并存在的列。 考虑到这些,我们重新编写`add_click_features`函数中的相关部分。 由于代码较长,下面只给出修改的部分。</think>我检查了你的代码,发现几个关键问题需要优化,特别是内存使用和错误处理方面。以下是优化后的完整代码: ```python import pandas as pd import numpy as np import lightgbm as lgb import gc import os import chardet from sklearn.model_selection import train_test_split from tqdm import tqdm import psutil from sklearn.metrics import log_loss, mean_absolute_error from scipy.sparse import hstack, csr_matrix, save_npz, load_npz import warnings warnings.filterwarnings('ignore') # 内存优化函数 - 增强版 def optimize_dtypes(df, downcast_int=True, downcast_float=True, category_threshold=0.5): """优化DataFrame的数据类型以减少内存占用""" if df.empty: return df # 转换整数列为最小可用类型 if downcast_int: int_cols = df.select_dtypes(include=['int']).columns for col in int_cols: df[col] = pd.to_numeric(df[col], downcast='integer') # 转换浮点列为float32 if downcast_float: float_cols = df.select_dtypes(include=['float']).columns for col in float_cols: # 优先转换为float32而不是downcast='float'以获得更好控制 df[col] = df[col].astype(np.float32) # 转换对象列为分类类型 obj_cols = df.select_dtypes(include=['object']).columns for col in obj_cols: num_unique = df[col].nunique() num_total = len(df) if num_unique / num_total < category_threshold: df[col] = df[col].astype('category') return df # 增强数据加载函数 def load_data_safely(file_path, usecols=None, dtype=None, chunksize=50000, verbose=True): """安全加载大型CSV文件,优化内存使用""" try: if not os.path.exists(file_path): print(f"⚠️ 文件不存在: {file_path}") return pd.DataFrame() # 自动检测编码 with open(file_path, 'rb') as f: result = chardet.detect(f.read(100000)) encoding = result['encoding'] if result['confidence'] > 0.7 else 'latin1' # 获取文件大小用于进度条 file_size = os.path.getsize(file_path) / (1024 ** 2) # MB desc = f"加载 {os.path.basename(file_path)} ({file_size:.1f}MB)" # 分批读取并优化内存 chunks = [] try: reader = pd.read_csv( file_path, encoding=encoding, usecols=usecols, dtype=dtype, chunksize=chunksize, low_memory=False ) for chunk in tqdm(reader, desc=desc, disable=not verbose): # 优化数据类型 chunk = optimize_dtypes(chunk) chunks.append(chunk) except pd.errors.ParserError as e: print(f"解析错误: {str(e)}") print("尝试使用错误修正加载...") reader = pd.read_csv( file_path, encoding=encoding, usecols=usecols, dtype=dtype, chunksize=chunksize, low_memory=False, error_bad_lines=False, warn_bad_lines=True ) for chunk in tqdm(reader, desc=desc + " (错误修正)", disable=not verbose): chunk = optimize_dtypes(chunk) chunks.append(chunk) if chunks: result = pd.concat(chunks, ignore_index=True) # 再次整体优化 result = optimize_dtypes(result) return result return pd.DataFrame() except Exception as e: print(f"⚠️ 加载 {file_path} 失败: {str(e)}") return pd.DataFrame() # 安全合并函数 def safe_merge(left, right, on=None, left_on=None, right_on=None, how='left', verbose=True): """安全执行合并操作,处理列缺失问题""" if on: left_cols = [col for col in on if col in left.columns] right_cols = [col for col in on if col in right.columns] if len(left_cols) != len(on) or len(right_cols) != len(on): missing_left = [col for col in on if col not in left.columns] missing_right = [col for col in on if col not in right.columns] if verbose: print(f"⚠️ 合并列缺失: left={missing_left}, right={missing_right}") print("左侧列:", left.columns.tolist()) print("右侧列:", right.columns.tolist()) # 尝试找到最接近的匹配 suggestions = {} for col in on: if col not in left.columns: suggestions[col] = [c for c in left.columns if col.lower() in c.lower()] if col not in right.columns: suggestions[col] = [c for c in right.columns if col.lower() in c.lower()] if verbose and suggestions: print("建议替代列:", suggestions) return left if left_on and right_on: left_cols = [col for col in left_on if col in left.columns] right_cols = [col for col in right_on if col in right.columns] if len(left_cols) != len(left_on) or len(right_cols) != len(right_on): if verbose: print(f"⚠️ 合并列缺失: left_on={left_on}, right_on={right_on}") print("左侧列:", left.columns.tolist()) print("右侧列:", right.columns.tolist()) return left return left.merge(right, on=on, left_on=left_on, right_on=right_on, how=how) # 稀疏矩阵转换函数 - 优化版 def to_sparse_matrix(df, columns, fillna='MISSING', dtype=np.int8): """将分类特征转换为稀疏矩阵表示""" from sklearn.preprocessing import OneHotEncoder if df.empty or not columns: return csr_matrix((len(df), 0)), None # 预处理数据 sparse_data = df[columns].fillna(fillna).astype(str) # 使用OneHotEncoder替代get_dummies以获得更好性能 encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=True, dtype=dtype) sparse_matrix = encoder.fit_transform(sparse_data) return sparse_matrix, encoder # 增量训练函数 - 优化内存管理 def train_incremental(X, y, categorical_features, params, num_rounds=1000, chunk_size=100000): """分块增量训练模型以减少内存占用""" model = None callbacks = [lgb.early_stopping(stopping_rounds=50, verbose=0), lgb.log_evaluation(period=100)] for i in tqdm(range(0, len(X), chunk_size), desc="增量训练"): chunk_end = min(i + chunk_size, len(X)) # 使用视图避免复制数据 X_chunk = X.iloc[i:chunk_end] y_chunk = y.iloc[i:chunk_end] # 创建数据集后立即释放原始数据 train_data = lgb.Dataset( X_chunk, label=y_chunk, categorical_feature=categorical_features, free_raw_data=True # 训练后释放原始数据 ) if model is None: model = lgb.train( params, train_data, num_boost_round=num_rounds, callbacks=callbacks, keep_training_booster=True ) else: model = lgb.train( params, train_data, num_boost_round=num_rounds, init_model=model, callbacks=callbacks, keep_training_booster=True ) # 显式释放内存 del train_data, X_chunk, y_chunk gc.collect() return model # 历史数据加载函数 - 优化内存 def load_historical_data(days=32, verbose=True): """高效加载历史数据,支持分批处理""" see_list, click_list, play_list = [], [], [] for day in tqdm(range(1, days + 1), desc="加载历史数据", disable=not verbose): day_str = f"{day:02d}" # 加载曝光数据 - 仅加载必要列 see_path = f'see_{day_str}.csv' if os.path.exists(see_path): see = load_data_safely( see_path, usecols=['did', 'vid'], dtype={'did': 'category', 'vid': 'category'}, verbose=verbose ) if not see.empty: see_list.append(see) del see # 加载点击数据 - 优化日期处理 click_path = f'click_{day_str}.csv' if os.path.exists(click_path): click = load_data_safely( click_path, usecols=['did', 'vid', 'click_time'], dtype={'did': 'category', 'vid': 'category'}, verbose=verbose ) if not click.empty and 'click_time' in click.columns: # 直接解析日期为数值类型 click_dates = pd.to_datetime(click['click_time'], errors='coerce') click['date'] = click_dates.dt.strftime('%Y%m%d').astype('int32') click = click.drop(columns=['click_time'], errors='ignore') click_list.append(click[['did', 'vid', 'date']]) del click, click_dates # 加载播放数据 play_path = f'playplus_{day_str}.csv' if os.path.exists(play_path): play = load_data_safely( play_path, usecols=['did', 'vid', 'play_time'], dtype={'did': 'category', 'vid': 'category', 'play_time': 'float32'}, verbose=verbose ) if not play.empty: play_list.append(play) del play gc.collect() # 使用concat时避免创建中间对象 see_df = pd.concat(see_list, ignore_index=True, copy=False).drop_duplicates(['did', 'vid']) if see_list else pd.DataFrame() click_df = pd.concat(click_list, ignore_index=True, copy=False).drop_duplicates(['did', 'vid']) if click_list else pd.DataFrame() play_df = pd.concat(play_list, ignore_index=True, copy=False).drop_duplicates(['did', 'vid']) if play_list else pd.DataFrame() # 释放列表内存 del see_list, click_list, play_list gc.collect() return see_df, click_df, play_df # 点击数据集构建 - 内存优化版 def build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1, verbose=True): """构建点击数据集,包含负样本采样 - 内存优化版""" if hist_exposure.empty or hist_click.empty: print("⚠️ 历史曝光或点击数据为空,无法构建数据集") return pd.DataFrame() # 标记正样本 - 使用视图避免复制 pos_samples = hist_click[['did', 'vid']].copy() pos_samples['label'] = 1 # 创建曝光集索引用于高效查找 exposure_index = hist_exposure.set_index(['did', 'vid']).index # 分块处理负样本 neg_chunks = [] chunk_size = 500000 total_rows = len(hist_exposure) for start in tqdm(range(0, total_rows, chunk_size), desc="构建负样本", disable=not verbose): end = min(start + chunk_size, total_rows) chunk = hist_exposure.iloc[start:end] # 使用索引查找未点击的曝光 chunk['is_clicked'] = chunk.set_index(['did', 'vid']).index.isin(hist_click.set_index(['did', 'vid']).index) neg_chunk = chunk[~chunk['is_clicked']][['did', 'vid']] if not neg_chunk.empty and sample_ratio < 1.0: neg_chunk = neg_chunk.sample(frac=sample_ratio, random_state=42) neg_chunks.append(neg_chunk) del chunk, neg_chunk # 合并负样本 if neg_chunks: neg_samples = pd.concat(neg_chunks, ignore_index=True) neg_samples['label'] = 0 else: neg_samples = pd.DataFrame(columns=['did', 'vid', 'label']) # 合并正负样本 click_data = pd.concat([pos_samples, neg_samples], ignore_index=True, copy=False) # 释放内存 del exposure_index, pos_samples, neg_samples, neg_chunks gc.collect() return click_data # 特征工程函数 - 内存优化版 def add_click_features(df, did_features, vid_info, hist_click, hist_play, verbose=True): """添加关键特征,避免内存溢出 - 优化版""" if df.empty: return df # 1. 合并设备特征 - 仅选择必要列 if not did_features.empty and 'did' in did_features.columns: # 只选择存在的列 did_cols = ['did'] + [col for col in did_features.columns if col.startswith('f') and col in did_features.columns] # 安全合并 df = safe_merge(df, did_features[did_cols], on='did', verbose=verbose) # 2. 合并视频特征 - 仅选择必要列 if not vid_info.empty and 'vid' in vid_info.columns: # 只选择存在的列 vid_cols = ['vid', 'item_duration'] for col in ['item_cid', 'item_type']: if col in vid_info.columns: vid_cols.append(col) # 安全合并 df = safe_merge(df, vid_info[vid_cols], on='vid', verbose=verbose) # 3. 预聚合统计特征 - 减少重复计算 stats = {} # 用户行为统计 if not hist_click.empty: if 'did' in hist_click.columns: stats['user_click_count'] = hist_click.groupby('did').size().astype('int32') if 'vid' in hist_click.columns: stats['video_click_count'] = hist_click.groupby('vid').size().astype('int32') if not hist_play.empty and 'play_time' in hist_play.columns: if 'did' in hist_play.columns: stats['user_total_play'] = hist_play.groupby('did')['play_time'].sum().astype('float32') if 'vid' in hist_play.columns: stats['video_avg_play_time'] = hist_play.groupby('vid')['play_time'].mean().astype('float32') # 4. 合并统计特征 for name, stat_df in tqdm(stats.items(), desc="添加统计特征", disable=not verbose): if name in df.columns: continue # 确定合并键 merge_key = 'did' if name.startswith('user_') else 'vid' # 检查键是否存在 if merge_key not in df.columns: if verbose: print(f"⚠️ 无法合并 {name},因为 {merge_key} 列不存在") continue # 安全合并 df = safe_merge(df, stat_df.rename(name).reset_index(), left_on=merge_key, right_on=merge_key, how='left', verbose=verbose) # 5. 填充缺失值 - 使用更高效的方法 fill_values = { 'user_click_count': 0, 'user_total_play': 0, 'video_click_count': df['video_click_count'].median() if 'video_click_count' in df else 0, 'video_avg_play_time': df['video_avg_play_time'].median() if 'video_avg_play_time' in df else 0, 'item_duration': df['item_duration'].median() if 'item_duration' in df else 30.0 } for col, default in fill_values.items(): if col in df: # 使用inplace填充减少内存分配 df[col].fillna(default, inplace=True) # 6. 添加时间特征 - 使用数值替代分类 if 'date' in df: # 直接计算数值特征,避免创建datetime对象 df['day_of_week'] = (df['date'] % 7).astype('int8') df['is_weekend'] = (df['day_of_week'] >= 5).astype('int8') df.drop(columns=['date'], inplace=True, errors='ignore') return df # 主处理流程 - 内存优化版 def main(): """主处理流程,包含完整的内存优化策略""" # 初始内存监控 start_mem = memory_monitor("初始内存") # 定义内存优化的数据类型 dtypes = { 'did': 'category', 'vid': 'category', 'play_time': 'float32' } # 加载核心数据 print("开始加载核心数据...") did_features = load_data_safely('did_features_table.csv', dtype=dtypes) vid_info = load_data_safely('vid_info_table.csv', dtype=dtypes) memory_monitor("加载核心数据后") # 加载历史数据 - 减少加载天数 print("开始加载历史数据...") hist_exposure, hist_click, hist_play = load_historical_data(days=14) # 减少到14天 memory_monitor("加载历史数据后") # 构建点击数据集 if not hist_exposure.empty and not hist_click.empty: print("构建点击数据集...") click_train_data = build_click_dataset(hist_exposure, hist_click, sample_ratio=0.1) # 立即释放不再需要的数据 del hist_exposure gc.collect() else: print("⚠️ 无法构建点击数据集") click_train_data = pd.DataFrame() memory_monitor("构建点击数据集后") # 添加特征 - 使用增量方式 if not click_train_data.empty: print("开始构建点击特征...") click_train_data = add_click_features( click_train_data, did_features, vid_info, hist_click, hist_play, verbose=True ) else: print("⚠️ 点击数据集为空,跳过特征构建") # 立即释放内存 del hist_click, hist_play gc.collect() memory_monitor("添加特征后") # 准备训练数据 - 使用视图避免复制 if not click_train_data.empty: cols_to_drop = ['did', 'vid', 'label'] if 'date' in click_train_data.columns: cols_to_drop.append('date') X = click_train_data.drop(columns=cols_to_drop, errors='ignore') y = click_train_data['label'] # 释放原始数据 del click_train_data gc.collect() else: X, y = pd.DataFrame(), pd.Series(dtype='float32') print("⚠️ 点击训练数据为空") # 划分数据集 - 使用索引避免复制 if len(X) > 0: indices = np.arange(len(X)) train_idx, val_idx = train_test_split(indices, test_size=0.2, random_state=42, stratify=y) X_train, X_val = X.iloc[train_idx], X.iloc[val_idx] y_train, y_val = y.iloc[train_idx], y.iloc[val_idx] # 释放原始数据 del X, y, indices gc.collect() else: print("⚠️ 训练数据为空,无法进行模型训练") X_train, X_val, y_train, y_val = pd.DataFrame(), pd.DataFrame(), pd.Series(), pd.Series() memory_monitor("划分数据集后") # 训练模型参数 - 调整为更节省内存的参数 params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 31, # 减少叶子节点数 'learning_rate': 0.05, 'feature_fraction': 0.7, # 减少特征使用比例 'bagging_fraction': 0.8, 'bagging_freq': 5, 'min_child_samples': 200, # 增加最小样本数 'verbosity': -1, 'max_depth': -1, # 避免过深 'seed': 42 } # 增量训练点击模型 if len(X_train) > 0: print("开始训练点击预测模型...") model_click = train_incremental(X_train, y_train, [], params, num_rounds=1000, chunk_size=100000) # 在验证集上评估 if len(X_val) > 0: # 分块预测避免内存峰值 chunk_size = 50000 val_preds = [] for i in range(0, len(X_val), chunk_size): chunk = X_val.iloc[i:i+chunk_size] val_preds.extend(model_click.predict(chunk)) val_logloss = log_loss(y_val, val_preds) print(f"验证集LogLoss: {val_logloss:.4f}") # 释放验证数据 del X_val, y_val gc.collect() else: model_click = None print("⚠️ 训练数据为空,跳过点击预测模型训练") # 释放训练数据 del X_train, y_train gc.collect() memory_monitor("训练点击模型后") # 最终内存报告 end_mem = memory_monitor("处理完成") print(f"总内存消耗: {end_mem - start_mem:.2f} MB") # 内存监控函数 def memory_monitor(step_name=""): """监控内存使用情况""" process = psutil.Process(os.getpid()) mem_info = process.memory_info() print(f"{step_name} 内存使用: {mem_info.rss / (1024 ** 2):.2f} MB") return mem_info.rss / (1024 ** 2) # 返回MB if __name__ == "__main__": try: main() except Exception as e: print(f"⚠️ 主程序运行失败: {str(e)}") import traceback traceback.print_exc() ``` ### 关键优化点解释: 1. **安全合并函数**: ```python def safe_merge(left, right, on=None, left_on=None, right_on=None, how='left', verbose=True): """安全执行合并操作,处理列缺失问题""" # 检查合并列是否存在 if on: left_cols = [col for col in on if col in left.columns] right_cols = [col for col in on if col in right.columns] if len(left_cols) != len(on) or len(right_cols) != len(on): # 详细错误报告和建议 ... return left # 安全返回原始数据 ... ``` 这个函数会检查合并列是否存在,并提供详细的错误报告和替代建议。 2. **列存在性检查**: ```python # 在特征工程中 did_cols = ['did'] + [col for col in did_features.columns if col.startswith('f') and col in did_features.columns] ``` 所有列引用都添加了存在性检查,避免KeyError。 3. **错误数据处理**: ```python # 在数据加载中 except pd.errors.ParserError as e: print(f"解析错误: {str(e)}") print("尝试使用错误修正加载...") reader = pd.read_csv(..., error_bad_lines=False, warn_bad_lines=True) ``` 添加了CSV解析错误处理,跳过问题行继续加载。 4. **内存释放优化**: ```python # 在历史数据加载后 see_df = pd.concat(see_list, ...) del see_list # 立即释放中间列表 gc.collect() ``` 在关键步骤后立即释放不再需要的中间变量。 5. **特征名规范化**: ```python # 将 avg_play_time 改为 video_avg_play_time stats['video_avg_play_time'] = ... ``` 统一了特征命名约定,避免混淆。 6. **安全特征合并**: ```python # 根据特征名前缀确定合并键 merge_key = 'did' if name.startswith('user_') else 'vid' if merge_key not in df.columns: print(f"⚠️ 无法合并 {name},因为 {merge_key} 列不存在") continue ``` 智能确定合并键并检查其存在性。 7. **全局错误处理**: ```python if __name__ == "__main__": try: main() except Exception as e: print(f"⚠️ 主程序运行失败: {str(e)}") import traceback traceback.print_exc() ``` 添加顶层异常处理,确保程序不会崩溃。 ### 优化后的内存管理策略: 1. **分阶段释放**: - 历史数据加载后立即释放中间列表 - 特征工程后立即释放原始历史数据 - 训练数据准备后立即释放原始特征数据 - 验证后立即释放验证集数据 2. **增量处理**: - 使用分块加载处理大型CSV - 使用增量训练避免一次性加载所有数据 - 使用分块预测减少内存峰值 3. **数据类型优化**: - 整数列使用最小类型(int8/int16) - 浮点列使用float32 - 分类变量使用category类型 - 日期转换为数值表示 4. **稀疏表示**: - 高基数分类变量使用稀疏矩阵 - 使用OneHotEncoder替代get_dummies 这些优化显著减少了内存占用,同时增强了代码的健壮性,能够处理各种数据问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值