数据挖掘实战:特征发现与特征提取技术详解
一、技术概念解析
1.1 特征发现(Feature Discovery)
特征发现是探索性数据分析的核心环节,通过以下方法识别潜在有效特征:
- 统计分布分析:偏度、峰度、标准差
- 数据可视化:箱线图、热力图、散点矩阵
- 领域知识驱动:医疗数据中的BMI指数、电商中的购买频次
- 关联规则挖掘:Apriori算法发现特征组合
1.2 特征提取(Feature Extraction)
将原始数据转换为机器学习友好格式的关键步骤:
- 数值处理:标准化、归一化、非线性变换
- 类别编码:One-Hot、Target Encoding、Embedding
- 维度压缩:PCA、t-SNE、Autoencoder
- 时序特征:滑动窗口统计、傅里叶变换
- 文本向量化:TF-IDF、Word2Vec、BERT
- 图像特征:HOG、SIFT、CNN特征图
二、结构化数据处理实战
2.1 数据准备
使用开源信用卡欺诈检测数据集:
import pandas as pd
from sklearn.datasets import fetch_openml
data = fetch_openml('creditcard', version=1, as_frame=True)
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Class'] = data.target
2.2 特征发现流程
- 异常值检测:
Q1 = df['Amount'].quantile(0.25)
Q3 = df['Amount'].quantile(0.75)
df['Amount_Outlier'] = (df['Amount'] > (Q3 + 1.5*(Q3-Q1))).astype(int)
- 交互特征生成:
df['V1_V2_Interaction'] = df['V1'] * df['V2']
df['Time_Amount_Ratio'] = df['Time'] / (df['Amount'] + 1e-5)
- 时间序列特征:
from tsfresh import extract_features
time_features = extract_features(df[['Time', 'Amount']], column_id='Time')
2.3 特征提取技术
- 非线性变换:
df['Amount_log'] = np.log1p(df['Amount'])
df['V3_square'] = df['V3'] ** 2
- 分箱处理:
from sklearn.preprocessing import KBinsDiscretizer
est = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df['Amount_bin'] = est.fit_transform(df[['Amount']])
- 自动特征生成:
from featuretools import dfs
features = dfs(entityset=es,
target_entity='transactions',
agg_primitives=['sum', 'mean', 'count'],
trans_primitives=['hour', 'is_weekend'])
三、非结构化数据处理
3.1 文本特征提取
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
# TF-IDF
tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
X_tfidf = tfidf.fit_transform(texts)
# 词向量
model = Word2Vec(sentences, vector_size=100, window=5, min_count=3)
word_vectors = model.wv
# BERT特征
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
3.2 图像特征提取
from keras.applications import ResNet50
from keras.preprocessing import image
model = ResNet50(weights='imagenet', include_top=False)
img = image.load_img('image.jpg', target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
features = model.predict(x)
# 可视化特征图
import matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
for i in range(16):
plt.subplot(4,4,i+1)
plt.imshow(features[0,:,:,i], cmap='viridis')
plt.axis('off')
四、特征评估与选择
4.1 特征重要性分析
from sklearn.ensemble import RandomForestClassif

最低0.47元/天 解锁文章
416

被折叠的 条评论
为什么被折叠?



