李锋镝的博客

  • 首页
  • 时间轴
  • 留言
  • 插件
  • 左邻右舍
  • 关于我
    • 关于我
    • 另一个网站
    • 我的导航站
    • 网站地图
  • 赞助
Destiny
自是人生长恨水长东
  1. 首页
  2. 原创
  3. 运维
  4. 正文

服务器差点被一群垃圾爬虫搞挂了

2025年6月29日 71点热度 0人点赞 0条评论

最近发现服务器的负载一直居高不下,但实际上网站的流量并没有那么高,一开始以为是预加载缓存的原因导致的,结果把缓存预加载都停了之后还是没有什么改善。

后来想到可以查看下nginx日志看下到底有没有流量进来。

不看不知道,一看吓一跳,一堆蜘蛛爬虫一刻不停的在爬……

既然是爬虫,那我加一下robots.txt是不是就好了?

于是,就把看到的爬虫加了一下:

User-agent: SemrushBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent:PetalBot
Disallow: /
User-agent:DataForSeoBot
Disallow: /

那成想这些爬虫大部分都不讲武德,居然强制爬取。

既然robots.txt不管用,那就在nginx配置下拦截吧。

配置文件如下:

http {
    # 其他 http 配置...

    # 定义 User-Agent 映射(放在 http 块中)
    map $http_user_agent $blocked_user_agent {
        default 0;

        # 允许的搜索引擎爬虫
        "~*(Googlebot|bingbot|Yandex|BaiduSpider|Sogou|360Spider|ByteDanceBot)" 0;

        # 禁止的爬虫
        "~*(SemrushBot|DotBot|MegaIndex\.ru|MauiBot|AhrefsBot|MJ12bot|BLEXBot|PetalBot|DataForSeoBot|Amazonbot|GPTBot|Bytespider|meta-externalagent|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|heritrix|EasouSpider|Ezooms)" 1;

        # 空 User-Agent 或可疑 UA
        "~*^$" 1;
        "~*(libwww-perl|curl|wget|python|nikto|scan|java|winhttp|clshttp|loader)" 1;
    }

    server {
        listen 80;
        server_name example.com;

        # 其他 server 配置...

        # 阻止被标记的 User-Agent(放在 server 块中)
        if ($blocked_user_agent = 1) {
            return 403;
        }

        # 其他位置配置...
        location / {
            # ...
        }
    }
}

需要注意的是:

  • 确认 map 指令位于 http 块中
  • 确认 if 指令位于 server 或 location 块中

然后重新加载配置文件:

nginx -t

nginx -s reload

替代方案(如果无法修改 http 块)

如果你无法修改主配置文件中的 http 块,可以使用纯 if 语句替代 map(不推荐,性能较差):

server {
    listen 80;
    server_name example.com;

    # 直接在 server 块中使用 if 进行 User-Agent 匹配
    if ($http_user_agent ~* "SemrushBot|DotBot|MegaIndex\.ru|MauiBot|AhrefsBot|MJ12bot|BLEXBot|PetalBot|DataForSeoBot|Amazonbot|GPTBot|Bytespider|meta-externalagent|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|heritrix|EasouSpider|Ezooms") {
        return 403;
    }

    # 空 User-Agent 或可疑 UA
    if ($http_user_agent ~* "^$|libwww-perl|curl|wget|python|nikto|scan|java|winhttp|clshttp|loader") {
        return 403;
    }

    # 其他配置...
}

之后就可以在nginx日志中看到一堆的403,服务器负载也下来了:

144.76.19.149 - - [29/Jun/2025:12:44:08 +0800] "GET /feed/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:16 +0800] "GET /about/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:23 +0800] "GET /page/2/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:29 +0800] "GET /page/3/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:42 +0800] "GET /page/4/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:48 +0800] "GET /page/9/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:44:55 +0800] "GET /tag/AI/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:00 +0800] "GET /2024/09/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
121.237.36.29 - - [29/Jun/2025:12:45:03 +0800] "GET / HTTP/1.1" 200 26607 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; cs-CZ) AppleWebKit/525.28.3 (KHTML, like Gecko) Version/3.2.3 Safari/525.29"
144.76.19.149 - - [29/Jun/2025:12:45:06 +0800] "GET /2024/10/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:12 +0800] "GET /2024/11/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:17 +0800] "GET /2024/12/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:23 +0800] "GET /2025/01/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:30 +0800] "GET /2025/02/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:35 +0800] "GET /2025/03/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:42 +0800] "GET /2025/05/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:48 +0800] "GET /2025/06/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:54 +0800] "GET /tag/DDD/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:45:59 +0800] "GET /tag/MFA/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:46:11 +0800] "GET /tag/SQL/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:46:15 +0800] "GET /tag/TCP/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
144.76.19.149 - - [29/Jun/2025:12:46:20 +0800] "GET /tag/VPN/ HTTP/1.1" 403 146 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)"
除非注明,否则均为李锋镝的博客原创文章,转载必须以链接形式标明本文链接

本文链接:https://www.lifengdi.com/article/yun-wei/4484

相关文章

  • Nginx开启brotli压缩
  • 使用WireGuard在Ubuntu 24.04系统搭建VPN
  • 应用型负载均衡(ALB)和网络型负载均衡(NLB)区别
  • 什么是Helm?
  • 准备入手个亚太的ECS,友友们有什么建议吗?
本作品采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 进行许可
标签: nginx 服务器 爬虫 蜘蛛
最后更新:2025年6月29日

李锋镝

既然选择了远方,便只顾风雨兼程。

打赏 点赞
< 上一篇

文章评论

1 2 3 4 5 6 7 8 9 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 46 47 48 49 50 51 52 53 54 55 57 58 60 61 62 63 64 65 66 67 69 72 74 76 77 78 79 80 81 82 85 86 87 90 92 93 94 95 96 97 98 99
取消回复

COPYRIGHT © 2025 lifengdi.com. ALL RIGHTS RESERVED.

Theme Kratos Made By Dylan

津ICP备2024022503号-3