概述
在做机器学习的过程中,可视化能帮助我们了解模型训练状态,评估模型效果,还能了解数据,帮助我们了解算法模型,改善模型,在论文中好的可视化也是大大加分的~下面我总结了一下我在做机器学习时候常用的可视化。这篇博文会不断更新,目前水平有限,欢迎小伙伴们补充与指正!!!
降维可视化
- 通过PCA、LDA或SVD矩阵分解,将高纬数据转换为2维,便于可视化查看数据分布,了解数据特性。
作为重点关注在算法上,而不是可视化实现的人,可视化只是辅助分析的工具,我们只需要会用即可,我觉得scikit-learn官网是个很好的学习资源,上面有很多可视化的模版。下面的代码来自scikit-learn官网中的例子:鸢尾花数据集降维可视化import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis iris = datasets.load_iris() X = iris.data y = iris.target target_names = iris.target_names pca = PCA(n_components=2) X_r = pca.fit(X).transform(X) lda = LinearDiscriminantAnalysis(n_components=2) X_r2 = lda.fit(X, y).transform(X) # Percentage of variance explained for each components print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_)) plt.figure() colors = ['navy', 'turquoise', 'darkorange'] lw = 2 for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8, lw=lw, label=target_name) plt.legend(loc='best', shadow=False, scatterpoints=1) plt.title('PCA of IRIS dataset') plt.figure() for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color, label=target_name) plt.legend(loc='best', shadow=False, scatterpoints=1) plt.title('LDA of IRIS dataset') plt.show()输出


最近做文本分类,为了形象比较词袋模型和TF-IDF模型这两种文本表示模型的效果,我借鉴了上面的降维可视化。代码如下:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def cv(data):
count_vectorizer = CountVectorizer()
emb = count_vectorizer.fit_transform(data)
return emb, count_vectorizer
def tfidf(data):
tfidf_vectorizer = TfidfVectorizer()
train = tfidf_vectorizer.fit_transform(data)
return train, tfidf_vectorizer
X = df_news["content"]
y = df_news["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train_counts, count_vectorizer = cv(X_train)
X_test_counts = count_vectorizer.transform(X_test)
X_train_tfidf, tfidf_vectorizer = tfidf(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

最低0.47元/天 解锁文章
670

被折叠的 条评论
为什么被折叠?



