初学者垃圾邮件检测

最新推荐文章于 2023-11-06 17:03:00 发布

翻译最新推荐文章于 2023-11-06 17:03:00 发布 · 512 阅读

2 ·

CC 4.0 BY-SA版权

原文链接：https://medium.com/@rohan.jayesh18/beginner-spam-detection-63ad2b8a91ca

文章标签：

#python

这是一个使用Python机器学习实现的垃圾邮件检测项目。作者在完成Coursera的'Python应用文本挖掘'课程后创建了它。博客中详细解释了代码的每个步骤，并提供了GitHub链接供读者获取完整代码。

This is a small project in python using machine learning to detect whether a given text is spam or ham(not spam). I made this project after completing the course of “Applied Text Mining in Python” by the University of Michigan on Coursera, the link to the course is given at the end of the blog. Here, I’ll try to explain each step in my code and if you want the whole code, the GitHub link too will be available at the end.

这是python中的一个小项目，使用机器学习来检测给定的文本是垃圾邮件还是垃圾邮件(不是垃圾邮件)。在完成密西根大学Coursera的“ Python中的应用文本挖掘”课程之后，我完成了这个项目，该课程的链接在博客的末尾给出。在这里，我将尝试解释代码中的每个步骤，如果您需要整个代码，那么GitHub链接也将在最后可用。

I used my local machine for this project and the specification is mentioned below. Though its a very small project so you wouldn't need a lot of computation power for it.

我将本地计算机用于此项目，并且在下面提到了规范。尽管这是一个很小的项目，所以您不需要很多计算能力。

规范 (Specification)

Name: Acer Predator Helios 300 (2019)

名称：宏cer捕食者Helios 300(2019)

Graphics Card: NVIDIA GeForce GTX 1660 Ti

显卡：NVIDIA GeForce GTX 1660 Ti

Processor Name: Intel Core i7–9750H

处理器名称：Intel Core i7–9750H

RAM: 16 GB

内存：16 GB

导入库 (Importing The Libraries)

I used Pandas and NumPy for data manipulation, matplotlib for graph plotting, and sklearn for preprocessing, model creation and model evaluation. I’ll explain what each library is doing when I use them in my code moving forward.

我使用Pandas和NumPy进行数据操作，使用matplotlib进行图形绘制，并使用sklearn进行预处理，模型创建和模型评估。我将解释在以后的代码中使用它们时每个库的功能。

P.S. — %matplotlib notebook, is a jupyter notebook magic function.

PS —％matplotlib笔记本，是jupyter笔记本的魔术功能。

演示地址

分析数据 (Analysing The Data)

Using the df.head(), df being the Pandas DataFrame object where I have loaded the data from the CSV, function to view the data, here we see that there are only two columns. text, containing the text for detection and target, as the label to tell whether the text is spam or not.

使用df.head()，df是Pandas DataFrame对象，我在其中从CSV加载了数据，该函数用于查看数据，在这里我们看到只有两列。包含检测文本和目标文本的文本，作为标识文本是否为垃圾邮件的标签。

演示地址

Now, I have plotted how many data points in the dataset are spam or ham. Here, we can see that it is an imbalanced dataset since the number of spam texts are considerably smaller than ham texts.

现在，我已绘制出数据集中有多少个数据点是垃圾邮件或火腿。在这里，我们可以看到它是一个不平衡的数据集，因为垃圾邮件文本的数量大大少于火腿文本。

演示地址

Though I have not done any preprocessing of the data like dealing with imbalances of the data or dealing with stop words, words like ‘the’, ‘a’, etc and symbols like ‘.’, ‘;’, etc. You can do those things for better results.

尽管我没有对数据进行任何预处理，例如处理数据不平衡或处理停用词，“ the”，“ a”等词和“。”，“;”等符号等。您可以做这些东西可以带来更好的结果。

特征工程 (Feature Engineering)

I have split the data on a ratio of 80% — 20%, training and test respectively. Then I have initialized a count vectorizer with the following parameters. Though you can mess with those to get even better results, keeping in mind to not overfit your model.

我分别以80％– 20％的比例分别对数据进行了训练和测试。然后，我使用以下参数初始化了计数矢量化器。尽管您可以将其弄乱以获得更好的结果，但请记住不要过度拟合模型。

演示地址

After initialising the vectorizer, I have fitted and transformed the training and test data, and added a few more features to them. The added features are the total length of the text, the total number of numeric characters in the text and the number of words in the text.

初始化矢量化程序后，我对训练和测试数据进行了拟合和转换，并为其添加了更多功能。添加的功能包括文本的总长度，文本中的数字字符总数和文本中的单词数。

演示地址

The add_feature() function is something I copied from the course, it’s a convenient function which does what it's named after, adds features to the existing data(vectorized).

add_feature()函数是我从课程中复制的东西，它是一个方便的函数，它以其名字命名，将特征添加到现有数据中(向量化)。

模型制作 (Model Creation)

Before creating my own model, I used a dummy classifier to get the baseline performance. Fitted it with the training data and made a prediction using the test data.

在创建自己的模型之前，我使用了虚拟分类器来获得基准性能。将其与训练数据拟合，并使用测试数据进行预测。

演示地址

Now, I chose Logistic Regression as my model with the hyperparameters C=100 and max_iter=1000, fitted it with the training data and made the prediction on the test data. You can try SVC or any other classifier and see how the results change.

现在，我选择Logistic回归作为超参数C = 100和max_iter = 1000的模型，将其与训练数据拟合并根据测试数据进行预测。您可以尝试SVC或任何其他分类器，然后查看结果如何变化。

演示地址

模型评估 (Model Evaluation)

Using a confusion matrix for evaluating both the models, Dummy and Logistic Regression. We can see that the Logistic Regression did a lot better than our Dummy model. But still it's not perfect, since detecting spam is a more precision-based problem according to me, so there's still room for improvement.

使用混淆矩阵来评估虚拟模型和逻辑回归模型。我们可以看到，逻辑回归比我们的虚拟模型要好得多。但是，这仍然不是完美的，因为根据我的说法，检测垃圾邮件是基于精度的问题，因此仍有改进的空间。

演示地址

结果 (Result)

This is it for the project, and yes it is a small one with nothing too complicated. Right now, it's more of a beginner level project but has the potential to be much more complicated and advanced. Since detecting spam is a pretty common problem and this was a simple take on it, but you can add a lot more preprocessing using libraries like nltk and get even better results.

这就是项目的内容，是的，它很小，没有什么太复杂的。现在，它更像是一个初学者级的项目，但有可能变得更加复杂和高级。由于检测垃圾邮件是一个非常普遍的问题，这很简单，但是您可以使用nltk之类的库添加更多预处理功能，并获得更好的结果。