爬虫scrapy抓取w3school课程列表

释放双眼，带上耳机，听听看~！

首先还是先创建scrapy项目，当然这都是在你安装scrapy之后啊，这个scrapy还是在linux系统下最省事，一行指令瞬间安装，这酸爽。。。。。

言归正传啊，创建scrapy文件。


1
2
1&lt;span style=&quot;font-size:14px;&quot;&gt;scrapy startproject w3school&lt;/span&gt;

2

之后可以查看一下，这个文件的结构，上一次抓取天气的时候还记得吗，我们采用的是tree命令。


1
2
1&lt;span style=&quot;font-size:14px;&quot;&gt;tree w3school&lt;/span&gt;

2

你就能看见一个树形的文件目录。

现在我们开始编写爬虫中的文件，其实需要编写的只有4个文件，items.py pipelines.py setting.py spider.py。现在只是最初级啊，因为现在我还只是一个菜鸟，哇咔咔。

items文件是一个容器，所谓Item容器就是将在网页中获取的数据结构化保存的数据结构，类似于python中字典。按我的理解就是你需要提取什么就写一个Field()。当然这个理解还是很初步的啊，但我坚信在实践中成长才是最快速的，刚开始的时候就是要不求甚解，用的次数多了，随着深入的学习自然就明白
了。


1
2
3
4
5
6
7
1import scrapy

2

3class W3SchoolItem(scrapy.Item):

4    title = scrapy.Field()

5    link = scrapy.Field()

6    desc = scrapy.Field()

7


1
2
1  之后我们可以编写爬虫文件或者pipelines文件都可以。pipelines文件主要完成数据的查重、丢弃，验证item中数据，将得到的item数据保存等工作。  也就是说抓取的数据都是会返回到pipilines文件中.爬虫文件中会有一个parse()函数，最后有一个return items会返回一个列表。其中的codec模块是专门用来转换编码用的。 

2

process_item函数参数的含义:
item (Item object) – 由parse方法返回的Item对象
spider (BaseSpider object) – 抓取到这个Item对象对应的爬虫对象


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1# -*- coding: utf-8 -*-

2

3import json  

4import codecs 

5# Define your item pipelines here

6#

7# Don&#x27;t forget to add your pipeline to the ITEM_PIPELINES setting

8# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

9

10class W3SchoolPipeline(object):

11  def __init__(self):

12      self.file = codecs.open(&#x27;W3School_data.json&#x27;,&#x27;wb&#x27;,encoding = &#x27;utf-8&#x27;)#初始化创建一个W3School_data.json的文件

13      def process_item(self, item, spider):

14          line = json.dumps(dict(item)) + &#x27;\n&#x27;#对数据类型进行编码

15          print line

16          self.file.write(line.decode(&quot;unicode_escape&quot;))

17          return item

18

19


1
2
1写完pipelines文件之后就一定要在setting文件中添加一行代码。从Spider的parse返回的Item数据将依次被ITEM_PIPELINES列表中的Pipeline类处理。   

2


1
2
1ITEM_PIPELINES = {&#x27;w3school.pipelines.W3SchoolPipeline&#x27;:300}

2


1
2
1  最后开始编写我们的爬虫文件，就是从网页源代码中提取我们想要的信息。scrapy官方推荐使用xpath来提取，比较方便快捷。爬虫文件中有三个参数一定要写，一个是name这个相当于爬虫的代号，因为爬虫你可以写很多个，但是调用的时候只能调用一个爬虫为你工作，name在调用的时候会用的到。还有一个allowed_domains属性，scrapy有一個offsite機制，即是否抓其他域名，需要將allowed_domains屬性添加上sina.com.cn。最后还有一个start_urls属性，包含了Spider在启动时进行爬取的url列表。 因此，第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。 

2

scrapy还提供了log功能，具体我也不知道它有什么功能，应该是可以捕捉scrapy的日志。里面的msg()函数可以记录信息，然后会将信息显示在终端窗口中，有点C++里messagebox的意思。

Scrapy提供5层logging级别:
:data:

~scrapy.log.CRITICAL

– 严重错误(critical)
:data:

~scrapy.log.ERROR

– 一般错误(regular errors)
:data:

~scrapy.log.WARNING

– 警告信息(warning messages)
:data:

~scrapy.log.INFO

– 一般信息(informational messages)
:data:

~scrapy.log.DEBUG

– 调试信息(debugging messages)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1from scrapy.spider import Spider

2from scrapy.selector import Selector

3from scrapy import log

4from w3school.items import W3SchoolItem

5

6class W3schoolSpider(Spider):

7   name = &#x27;w3school&#x27;

8   allowed_domains = [&quot;w3school.com.cn&quot;]

9   start_urls = [&#x27;http://www.w3school.com.cn/&#x27;]

10

11  def parse(self,response):

12      sel = Selector(response)

13      sites = sel.xpath(&#x27;//div[@id=&quot;navfirst&quot;]/ul[1]/li&#x27;)#找到id=navfirst的div，找到其中ul里面li里的信息。

14      items = []

15

16      for site in sites:

17          item = W3SchoolItem()#每个item都相当于一个字典

18

19          title = site.xpath(&#x27;a/text()&#x27;).extract()#提取&lt;a&gt;&lt;/a&gt;中间的内容

20          link = site.xpath(&#x27;a/@href&#x27;).extract()#提取&lt;a&gt;中href的内容

21          desc = site.xpath(&#x27;a/@title&#x27;).extract()#提取&lt;a&gt;中title的内容

22

23          item[&#x27;title&#x27;] = [t.encode(&#x27;utf-8&#x27;) for t in title]

24          item[&#x27;link&#x27;] = [l.encode(&#x27;utf-8&#x27;) for l in link]

25          item[&#x27;desc&#x27;] = [d.encode(&#x27;utf-8&#x27;)for d in desc]

26

27          items.append(item)

28

29          log.msg(&quot;Appending item...&quot;,level = &quot;INFO&quot;)

30      log.msg(&quot;Append done.&quot;,level = &#x27;INFO&#x27;)

31      return items

32

33


1
2
1  最后我们可以检验一下我们的劳动成果啦，哇咔咔，先进入w3school文件运行下列的代码就ok了。 

2


1
2
1scrapy crawl w3school

2


1
2
1  之后文件夹中会出现一个W3School_data.json的文件里面的内容就是。其实你也可以抓取其他的东西，只要更改xpath的内容就ok了。 

2


1
2
3
4
5
6
7
8
9
1{&quot;title&quot;: [&quot;HTML 系列教程&quot;], &quot;link&quot;: [&quot;/h.asp&quot;], &quot;desc&quot;: [&quot;HTML 系列教程&quot;]}

2{&quot;title&quot;: [&quot;浏览器脚本&quot;], &quot;link&quot;: [&quot;/b.asp&quot;], &quot;desc&quot;: [&quot;浏览器脚本教程&quot;]}

3{&quot;title&quot;: [&quot;服务器脚本&quot;], &quot;link&quot;: [&quot;/s.asp&quot;], &quot;desc&quot;: [&quot;服务器脚本教程&quot;]}

4{&quot;title&quot;: [&quot;ASP.NET 教程&quot;], &quot;link&quot;: [&quot;/d.asp&quot;], &quot;desc&quot;: [&quot;ASP.NET 教程&quot;]}

5{&quot;title&quot;: [&quot;XML 系列教程&quot;], &quot;link&quot;: [&quot;/x.asp&quot;], &quot;desc&quot;: [&quot;XML 系列教程&quot;]}

6{&quot;title&quot;: [&quot;Web Services 系列教程&quot;], &quot;link&quot;: [&quot;/ws.asp&quot;], &quot;desc&quot;: [&quot;Web Services 系列教程&quot;]}

7{&quot;title&quot;: [&quot;建站手册&quot;], &quot;link&quot;: [&quot;/w.asp&quot;], &quot;desc&quot;: [&quot;建站手册&quot;]}

8

9

{{userData.name}}已认证

爬虫scrapy抓取w3school课程列表

职场中的那些话那些事

awk分析nginx日志，获取pv

{{userData.name}}已认证

Related posts:

职场中的那些话那些事

awk分析nginx日志，获取pv

负载均衡器技术Nginx和F5的优缺点对比

jenkins+ansible+gitlab自动化部署三剑客

搭建一个高可用负载均衡的集群架构（一）

nginx反向代理，负载均衡，redis session共享，keepalived高可用