用python实现一个抓取电影的爬虫

释放双眼，带上耳机，听听看~！

实现思路：

抓取一个电影网站中的所有电影的思路如下：

根据一个URL得到电影网站的所有分类
得到每个分类中的电影的页数
根据其电影分类的URL规律构造每个分类中每个页面的URL
分析每个页面中的html，并用正则把电影信息过滤出来

准备工作：

安装python（我用的是mac系统，默认的版本是Python 2.7.1

）

安装mongodb，从官网下载最新版本，然后启动即可，注意如放在外网的话，要设定验证密码或绑定地址为127.0.0.1，否则黑客轻易就进去了
安装BeautifulSoup和pymongo模块
安装一个python编辑器，我个人喜欢用sublime text2

编写部分：

这次以腾讯视频为例，其他视频网站只是换一下正则表达式。

根据视频所有分类的URL获取网站中所有视频分类

腾讯的所有视频的URL为：http://v.qq.com/list/1_-1_-1_-1_1_0_0_20_0_-1_0.html

首先我们import urllib2包，封装一个读取url中html的方法，详细代码如下：

导入需要的模块并定义全局变量：


1
2
3
4
5
6
7
8
9
10
11
12
1# -*- coding: utf-8 -*-

2import re

3import urllib2

4from bs4 import BeautifulSoup

5import string

6import pymongo

7

8NUM     = 0            #全局变量,电影数量

9m_type     = u&#x27;&#x27;        #全局变量,电影类型

10m_site     = u&#x27;qq&#x27;    #全局变量,电影网站

11

12

gethtml方法，传入一个url，返回这个url的html内容：


1
2
3
4
5
6
7
8
1#根据指定的URL获取网页内容

2def gethtml(url):

3    req = urllib2.Request(url) 

4    response = urllib2.urlopen(req) 

5    html = response.read()

6    return html

7

8

然后查看这个URL的源码文件，得知其电影分类的信息在
<ul

class
="
clearfix _group
"

gname
="
mi_type
"

gtype
="
1
">标签内部，每条电信分类的格式为：

ok，我们再写一个gettags方法，将所有的电影分类及url存放于一个字典中，代码如下：
#
从电影分类列表页面获取电影分类


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1def gettags(html):

2    global m_type

3    soup = BeautifulSoup(html)        #过滤出分类内容

4    #print soup

5    #&lt;ul class=&quot;clearfix _group&quot; gname=&quot;mi_type&quot; gtype=&quot;1&quot;&gt;

6    tags_all = soup.find_all(&#x27;ul&#x27;, {&#x27;class&#x27; : &#x27;clearfix _group&#x27; , &#x27;gname&#x27; : &#x27;mi_type&#x27;})  

7

8　　#分离出包含电影分类信息的html，接下来用正则表达式过滤出来，只用(.+?)匹配我们兴趣的字段即可  

9

10    #&lt;a _hot=&quot;tag.sub&quot; class=&quot;_gtag _hotkey&quot; href=&quot;http://v.qq.com/list/1_0_-1_-1_1_0_0_20_0_-1_0.html&quot; title=&quot;动作&quot; tvalue=&quot;0&quot;&gt;动作&lt;/a&gt;

11    re_tags = r&#x27;&lt;a _hot=\&quot;tag\.sub\&quot; class=\&quot;_gtag _hotkey\&quot; href=\&quot;(.+?)\&quot; title=\&quot;(.+?)\&quot; tvalue=\&quot;(.+?)\&quot;&gt;.+?&lt;/a&gt;&#x27;

12    p = re.compile(re_tags, re.DOTALL)

13

14    tags = p.findall(str(tags_all[0]))

15    if tags:

16        tags_url = {}

17        #print tags

18        for tag in tags:

19            tag_url = tag[0].decode(&#x27;utf-8&#x27;)

20            #print tag_url

21            m_type = tag[1].decode(&#x27;utf-8&#x27;)

22            tags_url[m_type] = tag_url 

23            

24    else:

25            print &quot;Not Find&quot;

26    return tags_url

27

28

接下来用一个循环分类获取每个分类下电影的页数，代码如下：


1
2
3
4
5
6
1    for url in tag_urls.items():

2        print  str(url[1]).encode(&#x27;utf-8&#x27;) #,url[0]

3        maxpage = int(get_pages(str(url[1]).encode(&#x27;utf-8&#x27;)))

4        print maxpage

5

6

获取每个分类中有多少页电影的代码如下：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1#获取每个分类的页数

2def get_pages(tag_url):

3    tag_html = gethtml(tag_url)

4    #div class=&quot;paginator

5    soup = BeautifulSoup(tag_html)        #过滤出标记页面的html

6    #print soup

7    #&lt;div class=&quot;mod_pagenav&quot; id=&quot;pager&quot;&gt;

8    div_page = soup.find_all(&#x27;div&#x27;, {&#x27;class&#x27; : &#x27;mod_pagenav&#x27;, &#x27;id&#x27; : &#x27;pager&#x27;})

9    #print div_page #len(div_page), div_page[0]

10

11    #&lt;a class=&quot;c_txt6&quot; href=&quot;http://v.qq.com/list/1_2_-1_-1_1_0_24_20_0_-1_0.html&quot; title=&quot;25&quot;&gt;&lt;span&gt;25&lt;/span&gt;&lt;/a&gt;

12    re_pages = r&#x27;&lt;a class=.+?&gt;&lt;span&gt;(.+?)&lt;/span&gt;&lt;/a&gt;&#x27;

13    p = re.compile(re_pages, re.DOTALL)

14    pages = p.findall(str(div_page[0]))

15    #print pages

16    if len(pages) &gt; 1:

17        return pages[-2]

18    else:

19        return 1

20

21

然后在每个分类中，根据其URL的规律生成具体的每页的URL，详细代码如下：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1for url in tag_urls.items():

2        print  str(url[1]).encode(&#x27;utf-8&#x27;) #,url[0]

3        maxpage = int(get_pages(str(url[1]).encode(&#x27;utf-8&#x27;)))

4        print maxpage

5

6        for x in range(0, maxpage):

7            #http://v.qq.com/list/1_0_-1_-1_1_0_0_20_0_-1_0.html

8            m_url = str(url[1]).replace(&#x27;0_20_0_-1_0.html&#x27;, &#x27;&#x27;)

9            movie_url = &quot;%s%d_20_0_-1_0.html&quot; % (m_url, x)

10            print movie_url

11            movie_html = gethtml(movie_url.encode(&#x27;utf-8&#x27;))

12            #print movie_html

13            getmovielist(movie_html)

14

15

getmovielist函数的作用是将每页地址返回的html传入，然后从中过滤出电影信息所在的html块，详细代码如下：


1
2
3
4
5
6
7
8
9
10
11
12
1def getmovielist(html):

2    soup = BeautifulSoup(html)

3

4    #&lt;ul class=&quot;mod_list_pic_130&quot;&gt;

5    divs = soup.find_all(&#x27;ul&#x27;, {&#x27;class&#x27; : &#x27;mod_list_pic_130&#x27;})

6    #print divs

7    for div_html in divs:

8        div_html = str(div_html).replace(&#x27;\n&#x27;, &#x27;&#x27;)

9        #print div_html

10        getmovie(div_html)

11

12

将过滤出来的包含电影信息的html代码块传入getmovie函数来分离出具体的电影信息并入库，详细代码如下：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1def getmovie(html):

2    global NUM        #电影数量

3    global m_type     #电影类型

4    global m_site     #电影所在网站

5

6    #&lt;h6 class=&quot;caption&quot;&gt; &lt;a href=&quot;http://www.tudou.com/albumcover/Z7eF_40EL4I.html&quot; target=&quot;_blank&quot; title=&quot;徒步旅行队&quot;&gt;徒步旅行队&lt;/a&gt; &lt;/h6&gt; &lt;ul class=&quot;info&quot;&gt; &lt;li class=&quot;desc&quot;&gt;法国卖座喜剧片&lt;/li&gt; &lt;li class=&quot;cast&quot;&gt; &lt;/li&gt; &lt;/ul&gt; &lt;/div&gt; &lt;div class=&quot;ext ext_last&quot;&gt; &lt;div class=&quot;ext_txt&quot;&gt; &lt;h3 class=&quot;ext_title&quot;&gt;徒步旅行队&lt;/h3&gt; &lt;div class=&quot;ext_info&quot;&gt; &lt;span class=&quot;ext_area&quot;&gt;地区: 法国&lt;/span&gt; &lt;span class=&quot;ext_cast&quot;&gt;导演: &lt;/span&gt; &lt;span class=&quot;ext_date&quot;&gt;年代: 2009&lt;/span&gt; &lt;span class=&quot;ext_type&quot;&gt;类型: 喜剧&lt;/span&gt; &lt;/div&gt; &lt;p class=&quot;ext_intro&quot;&gt;理查德·达奇拥有一家小的旅游公司，主要经营法国游客到非洲大草原的旅游服务。六个法国游客决定参加理查德·达奇组织的到非洲的一...&lt;/p&gt;

7

8    re_movie = r&#x27;&lt;li&gt;&lt;a class=\&quot;mod_poster_130\&quot; href=\&quot;(.+?)\&quot; target=\&quot;_blank\&quot; title=\&quot;(.+?)\&quot;&gt;&lt;img.+?&lt;/li&gt;&#x27;

9    p = re.compile(re_movie, re.DOTALL)

10    movies = p.findall(html)

11    if movies:

12        conn = pymongo.Connection(&#x27;localhost&#x27;, 27017)

13        movie_db = conn.dianying

14        playlinks = movie_db.playlinks

15        #print movies

16        for movie in movies:

17            #print movie

18            NUM += 1

19            print &quot;%s : %d&quot; % (&quot;=&quot; * 70, NUM)

20            values = dict(

21                movie_title = movie[1],

22                movie_url     = movie[0],

23                movie_site        = m_site,

24                movie_type        = m_type

25                )

26            print values

27            playlinks.insert(values)

28            print &quot;_&quot; * 70

29            NUM += 1

30            print &quot;%s : %d&quot; % (&quot;=&quot; * 70, NUM)

31

32

总结一下：爬虫实现的原理就是通过对其网页内容规律的观察，然后分离出包含我们感兴趣的html代码块，然后用正则表达式从将这些代码块中将想要的信息分离出来。

最后就是实现一个搜索或者展现的平台了，只是普通的从库中查询相应的信息并以Web方式展现出来，大家可用php、python、ruby、node.js等语言各显神通。

{{userData.name}}已认证

用python实现一个抓取电影的爬虫

职场中的那些话那些事

6种微服务RPC框架，你知道几个？

{{userData.name}}已认证

Related posts:

职场中的那些话那些事

6种微服务RPC框架，你知道几个？

新浪微博爬虫分享（一天可抓取 1300 万条数据）

AWStats日志分析系统

nginx日志分析利器GoAccess

Redis+Keepalived高可用方案详细分析