如何让你的scrapy爬虫不再被ban

释放双眼，带上耳机，听听看~！

根据scrapy官方文档： http://doc.scrapy.org/en/master/topics/practices.html\#avoiding-getting-banned 里面的描述，要防止scrapy被ban，主要有以下几个策略。

动态设置user agent
禁用cookies
设置延迟下载
使用 Google cache
使用IP地址池（ Tor project 、VPN和代理IP）
使用 Crawlera

由于Google cache受国内网络的影响，你懂得；Crawlera的分布式下载，我们可以在下次用一篇专门的文章进行讲解。所以本文主要从动态随机设置user agent、禁用cookies、设置延迟下载和使用代理IP这几个方式。好了，入正题：

1、创建middlewares.py

scrapy代理IP、user agent的切换都是通过DOWNLOADER_MIDDLEWARES进行控制，下面我们创建middlewares.py文件。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1[root@bogon cnblogs]# vi cnblogs/middlewares.py

2import random

3import base64

4from settings import PROXIES

5classRandomUserAgent(object):

6  &quot;&quot;&quot;Randomly rotate user agents based on a list of predefined ones&quot;&quot;&quot;

7  def __init__(self, agents):

8    self.agents = agents

9  @classmethod

10  deffrom_crawler(cls, crawler):

11    return cls(crawler.settings.getlist(&#x27;USER_AGENTS&#x27;))

12  defprocess_request(self, request, spider):

13    #print &quot;**************************&quot; + random.choice(self.agents)

14    request.headers.setdefault(&#x27;User-Agent&#x27;, random.choice(self.agents))

15classProxyMiddleware(object):

16  defprocess_request(self, request, spider):

17    proxy = random.choice(PROXIES)

18    if proxy[&#x27;user_pass&#x27;] is not None:

19      request.meta[&#x27;proxy&#x27;] = &quot;http://%s&quot; % proxy[&#x27;ip_port&#x27;]

20      encoded_user_pass = base64.encodestring(proxy[&#x27;user_pass&#x27;])

21      request.headers[&#x27;Proxy-Authorization&#x27;] = &#x27;Basic &#x27; + encoded_user_pass

22      print &quot;**************ProxyMiddleware have pass************&quot; + proxy[&#x27;ip_port&#x27;]

23    else:

24      print &quot;**************ProxyMiddleware no pass************&quot; + proxy[&#x27;ip_port&#x27;]

25      request.meta[&#x27;proxy&#x27;] = &quot;http://%s&quot; % proxy[&#x27;ip_port&#x27;]

26

27

类RandomUserAgent主要用来动态获取user agent，user agent列表USER_AGENTS在settings.py中进行配置。

类ProxyMiddleware用来切换代理，proxy列表PROXIES也是在settings.py中进行配置。

2、修改settings.py配置USER_AGENTS和PROXIES

a)：添加USER_AGENTS

b)：添加代理IP设置PROXIES

代理IP可以网上搜索一下，上面的代理IP获取自：http://www.xici.net.co/。

c)：禁用cookies


1
2
1COOKIES_ENABLED=False

2

d)：设置下载延迟


1
2
1DOWNLOAD_DELAY=3

2

保存settings.py

{{userData.name}}已认证

如何让你的scrapy爬虫不再被ban

1、创建middlewares.py

2、修改settings.py配置USER_AGENTS和PROXIES

职场中的那些话那些事

MMORPG服务器架构

{{userData.name}}已认证

1、创建middlewares.py

2、修改settings.py配置USER_AGENTS和PROXIES

Related posts:

职场中的那些话那些事

MMORPG服务器架构

如何构建一个分布式爬虫：实战篇

如何构建一个分布式爬虫：理论篇

新浪微博爬虫分享（一天可抓取 1300 万条数据）

分布式爬虫