最近发现服务器的负载一直居高不下,但实际上网站的流量并没有那么高,一开始以为是预加载缓存的原因导致的,结果把缓存预加载都停了之后还是没有什么改善。
后来想到可以查看下nginx
日志看下到底有没有流量进来。
不看不知道,一看吓一跳,一堆蜘蛛爬虫一刻不停的在爬……
既然是爬虫,那我加一下robots.txt
是不是就好了?
于是,就把看到的爬虫加了一下:
User-agent: SemrushBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent:PetalBot
Disallow: /
User-agent:DataForSeoBot
Disallow: /
那成想这些爬虫大部分都不讲武德,居然强制爬取。
既然robots.txt
不管用,那就在nginx
配置下拦截吧。
配置文件如下:
http {
# 其他 http 配置...
# 定义 User-Agent 映射(放在 http 块中)
map $http_user_agent $blocked_user_agent {
default 0;
# 允许的搜索引擎爬虫
"~*(Googlebot|bingbot|Yandex|BaiduSpider|Sogou|360Spider|ByteDanceBot)" 0;
# 禁止的爬虫
"~*(SemrushBot|DotBot|MegaIndex\.ru|MauiBot|AhrefsBot|MJ12bot|BLEXBot|PetalBot|DataForSeoBot|Amazonbot|GPTBot|Bytespider|meta-externalagent|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|heritrix|EasouSpider|Ezooms)" 1;
# 空 User-Agent 或可疑 UA
"~*^$" 1;
"~*(libwww-perl|curl|wget|python|nikto|scan|java|winhttp|clshttp|loader)" 1;
}
server {
listen 80;
server_name example.com;
# 其他 server 配置...
# 阻止被标记的 User-Agent(放在 server 块中)
if ($blocked_user_agent = 1) {
return 403;
}
# 其他位置配置...
location / {
# ...
}
}
}
需要注意的是:
- 确认 map 指令位于 http 块中
- 确认 if 指令位于 server 或 location 块中
然后重新加载配置文件:
nginx -t
nginx -s reload
替代方案(如果无法修改 http 块)
如果你无法修改主配置文件中的 http 块,可以使用纯 if 语句替代 map(不推荐,性能较差):
server {
listen 80;
server_name example.com;
# 直接在 server 块中使用 if 进行 User-Agent 匹配
if ($http_user_agent ~* "SemrushBot|DotBot|MegaIndex\.ru|MauiBot|AhrefsBot|MJ12bot|BLEXBot|PetalBot|DataForSeoBot|Amazonbot|GPTBot|Bytespider|meta-externalagent|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|heritrix|EasouSpider|Ezooms") {
return 403;
}
# 空 User-Agent 或可疑 UA
if ($http_user_agent ~* "^$|libwww-perl|curl|wget|python|nikto|scan|java|winhttp|clshttp|loader") {
return 403;
}
# 其他配置...
}
之后就可以在nginx
日志中看到一堆的403
,服务器负载也下来了:
144.76.19.149 - - [29/Jun/2025:12:44:08 +0800] "GET /feed/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:16 +0800] "GET /about/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:23 +0800] "GET /page/2/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:29 +0800] "GET /page/3/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:42 +0800] "GET /page/4/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:48 +0800] "GET /page/9/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:55 +0800] "GET /tag/AI/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:00 +0800] "GET /2024/09/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
121.237.36.29 - - [29/Jun/2025:12:45:03 +0800] "GET / HTTP/1.1" 200 26607 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; cs-CZ) AppleWebKit/525.28.3 (KHTML, like Gecko) Version/3.2.3 Safari/525.29"
144.76.19.149 - - [29/Jun/2025:12:45:06 +0800] "GET /2024/10/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:12 +0800] "GET /2024/11/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:17 +0800] "GET /2024/12/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:23 +0800] "GET /2025/01/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:30 +0800] "GET /2025/02/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:35 +0800] "GET /2025/03/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:42 +0800] "GET /2025/05/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:48 +0800] "GET /2025/06/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:54 +0800] "GET /tag/DDD/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:59 +0800] "GET /tag/MFA/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:46:11 +0800] "GET /tag/SQL/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:46:15 +0800] "GET /tag/TCP/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:46:20 +0800] "GET /tag/VPN/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
除非注明,否则均为李锋镝的博客原创文章,转载必须以链接形式标明本文链接
文章评论