学习链接
https://www.kaggle.com/learn/intermediate-machine-learning
3.Missing Values
1.drop columns
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
2.impute(插值)
e.g. 平均值插值,中位数插值
# Fill in the lines below: imputation
my_imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
4.Categorical Variables
1.ordinal var/nominal var
序数变量,热编码one-hot(一般不用于超过15 different values)
label encoding: 转化为有大小顺序的数字
one-hot: 多少个子类别就增加多少column
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
也可以得到one-hot编码
pd.get_dummies( df, columns = cols_to_transform )
11.XGBOOST
xgboost的简单使用,e.g. n_estimator,n_jobs,eval_set
Other
1.tf-idf
term-frequency * inverse-document-frequency
tf是词频,idf逆文档频率
note:
每一个词语在不同的文档的tf是不同的, 但是其idf是相同的
2.Doc2vec
单词之间的位置关系(word2vec) + 单词来自的文档关系(doc2vec)
https://zhuanlan.zhihu.com/p/36886191
https://github.com/nfmcclure/tensorflow_cookbook/tree/master/07_Natural_Language_Processing/07_Sentiment_Analysis_With_Doc2Vec
训练document_embeding:
1.concat to the end of word_embeding(文档向量和单词向量的长度不一定相等)
2.在权重矩阵中追加文档id对应的embeding(限制文档的vec长度 == word embeding长度,较小数据集使用)
学习两个嵌入矩阵
# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
doc_embeddings = tf.Variable(tf.random_uniform([len(texts), doc_embedding_size], -1.0, 1.0))