IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问

最新推荐文章于 2023-11-28 16:43:50 发布

原创最新推荐文章于 2023-11-28 16:43:50 发布 · 2.7k 阅读

7 ·

CC 4.0 BY-SA版权

seo 专栏收录该内容

12 篇文章

订阅专栏

网站访问慢、CPU和服务器负载高，原因是不知名蜘蛛爬行。作者根据情况写规则屏蔽后负载下降，并整理了IIS、nginx及apache环境下屏蔽不知名蜘蛛ua的方法，还列出各大蜘蛛名字和常见垃圾UA列表。

最近网站访问非常慢，cpu占用非常高，服务器负载整体也非常高，打开日志发现有很多不知名的蜘蛛一直在爬行我的站点，根据经验肯定是这里的问题，于是根据我的情况写了规则做了屏蔽，屏蔽后负载降下来了，下面整理下iis及nginx及apache环境下如何屏蔽不知名的蜘蛛ua。海宁育婴师

注意（请根据自己的情况调整删除或增加ua信息，我提供的规则中包含了不常用的蜘蛛ua，几乎用不着，若您的网站比较特殊，需要不同的蜘蛛爬取，建议仔细分析规则，将指定ua删除即可）

IIS7.5测试ok

指定特征禁止UA访问，返回代码403

<rule name="NoUserAgent" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="|特征1|特征2|特征3" />
</conditions>
<action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You did not present a User-Agent header which is required for this site" />
</rule>

例如只禁止空UA

<add input="{HTTP_USER_AGENT}" pattern="|^$|特征2|特征3" />

例如禁止其他UA+空UA

<add input="{HTTP_USER_AGENT}" pattern="^$|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot" />

禁止特定蜘蛛

<rewrite>  
<rules>  
<rule name="Block Some Ip Adresses OR Bots" stopProcessing="true">  
<match url="(.*)" />  
<conditions logicalGrouping="MatchAny">  
<add input="{HTTP_USER_AGENT}" pattern="蜘蛛名称" ignoreCase="true" /> <!-- 来禁止特定蜘蛛 -->  
<add input="{HTTP_USER_AGENT}" pattern="^$" /> <!-- 禁止空 UA 访问 -->  
<add input="{REMOTE_ADDR}" pattern="单独IP或使用正则表达的IP地址" />  
</conditions>  
<!--  你也可以使用＜action type="AbortRequest" /＞来直接代替下面这段代码  -->  
<action type="CustomResponse" statusCode="403" statusReason="Access is forbidden." statusDescription="Access is forbidden." />  
</rule>  
</rules>  
</rewrite>

禁止浏览某文件

<rule name="Block spider">  
      <match url="(^robotssss.txt$)" ignoreCase="false" negate="true" /> <!-- 禁止浏览某文件 -->  
      <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />  
</rule>

1、nginx禁止垃圾蜘蛛访问，把下列代码放到你的nginx配置文件里面。
#禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "opensiteexplorer|MauiBot|FeedDemon|SemrushBot|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|semrushbot|alphaseobot|semrush|Feedly|UniversalFeedParser|webmeup-crawler|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

2、IIS7/IIS8/IIS10及以上web服务请在网站根目录下创建web.config文件,并写入如下代码即可;

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="Block spider">
<match url="(^robots.txt$)" ignoreCase="false" negate="true" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$"
ignoreCase="true" />
</conditions>
<action type="AbortRequest" />
</rule>
</rules>
</rewrite>
</system.webServer>
</configuration>

3、apache请在.htaccess文件中添加如下规则即可：

<IfModule mod_rewrite.c>
RewriteEngine On
#Block spider
RewriteCond %{HTTP_USER_AGENT} "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" [NC]
RewriteRule !(^robots\.txt$) - [F]
</IfModule>