释放双眼，带上耳机，听听看~！

python爬虫Pragmatic系列II

By 白熊花田(http://blog.csdn.net/whiterbear)

说明：

在上一篇博客中，我们已经学会了如何下载一个网页，并进行简单的分析它。

本次目标：

下载赶集网上其中一家公司的信息，将网页保存到文本文件中，然后我们从网页中提取有用的公司信息，并存储到Excel中。（注意，本节比上一节难度更大）

下载网页：

利用前一篇博客的下载代码，将url初始设为“http://bj.ganji.com/fuwu_dian/354461215x/”（该链接为赶集网上目前处于第一列第一家公司），运行即可得到65kb大小的存储该公司信息的file.txt文本文件。

代码：略。

分析网页：

这次的目标是提取前面url页面的联系店主模块下的信息，有公司名称，服务特色，提供服务等等共八个信息（略去工作时间这一项）。如下图：

由于网页比较复杂，如果只是单纯的使用正则表达式对整个网页进行匹配难度较大（我水平不好，这样做在找到了仅一半的数据就实在做不下去了）。所以，我们开始使用更高端大的工具，BeautifulSoup。学习这个工具的可以点这里：BeautifulSoup分析HTML和使用Soup在HTML中查找。

BeautifulSoup可以将整个网页解析成一棵文档树，接着，我们可以按照html文档树的结构对其成员进行访问，哈哈，比只使用正则表达式容易多了。

在将获取的信息存入Excel时，我们使用了xwlt（写入Excel文件的扩展工具），学习Excel的读写请点这里：python操作Excel读写。

代码：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
1#-*-coding:utf-8-*-

2import re

3from bs4 import BeautifulSoup

4import xlwt

5import re

6import sys

7reload(sys)

8sys.setdefaultencoding(&#x27;utf-8&#x27;)

9

10def analysis():

11    &#x27;&#x27;&#x27;

12    分析网页源码，并提取出公司相关信息

13    &#x27;&#x27;&#x27;

14    #打开文件，读文件到lines中，关闭文件对象

15    f = open(&quot;file.txt&quot;,&#x27;r&#x27;)

16    lines = f.readlines()

17    f.close()

18

19    #建立一个BeautifulSoup解析树，并利用这课解析树依次按照

20    #soup--&gt;body--&gt;(id为wrapper的div层)--&gt;(class属性为clearfix的div层)

21    #--&gt;(id为dzcontactus的div层)--&gt;(class属性为con的div层)--&gt;ul--&gt;(ul下的每个li)

22    soup = BeautifulSoup(&#x27;&#x27;.join(lines))

23    body = soup.body #body2 = soup.find(&#x27;body&#x27;)

24    wrapper = soup.find(id=&quot;wrapper&quot;)

25    clearfix = wrapper.find_all(attrs={&#x27;class&#x27;:&#x27;clearfix&#x27;})[6]

26    dzcontactus = clearfix.find(id=&quot;dzcontactus&quot;)

27    con = dzcontactus.find(attrs={&#x27;class&#x27;:&#x27;con&#x27;})

28    ul = con.find(&#x27;ul&#x27;)

29    li = ul.find_all(&#x27;li&#x27;)

30

31    #记录一家公司的所有信息，用字典存储，可以依靠键值对存取，也可以换成列表存储

32    record = {} 

33

34

35    #公司名称

36    companyName = li[1].find(&#x27;h1&#x27;).contents[0]

37    #print companyName

38    #record.append(companyName)

39    record[&#x27;companyName&#x27;] = companyName

40

41    #服务特色

42    serviceFeature = li[2].find(&#x27;p&#x27;).contents[0]

43    #print serviceFeature

44    #record.append(serviceFeature)

45    record[&#x27;serviceFeature&#x27;] = serviceFeature

46    

47    #服务提供

48    serviceProvider = []

49    serviceProviderResultSet = li[3].find_all(&#x27;a&#x27;)

50    for service in serviceProviderResultSet:

51        serviceProvider.append(service.contents[0])

52        #print service.contents[0]

53    #print serviceProvider[0]

54    #record.append(serviceProvider)

55    record[&#x27;serviceProvider&#x27;] = serviceProvider

56

57    #服务范围

58    serviceScope = [] 

59    serviceScopeResultSet = li[4].find_all(&#x27;a&#x27;)

60    for scope in serviceScopeResultSet:

61        serviceScope.append(scope.contents[0])

62        #print scope.contents[0],

63    #print serviceScope[0]

64    #record.append(serviceScope)

65    record[&#x27;serviceScope&#x27;] = serviceScope

66

67    #联系人

68    contacts = li[5].find(&#x27;p&#x27;).contents[0]

69    #contacts = contacts.replace(&quot; &quot;,&#x27;&#x27;)

70    contacts = str(contacts).strip().encode(&quot;utf-8&quot;)

71    #print contacts

72    #record.append(contacts)

73    record[&#x27;contacts&#x27;] = contacts

74

75    #商家地址

76    addressResultSet = li[6].find(&#x27;p&#x27;)

77    re_h=re.compile(&#x27;&lt;/?\w+[^&gt;]*&gt;&#x27;)#HTML标签

78    address = re_h.sub(&#x27;&#x27;, str(addressResultSet))

79    #print address

80    #record.append(address)

81    record[&#x27;address&#x27;] = address.encode(&quot;utf-8&quot;)

82

83    #商家QQ

84    qqNumResultSet = li[8]

85    qq_regex = &#x27;(\d{5,10})&#x27;

86    qqNum = re.search(qq_regex,str(qqNumResultSet))

87    qqNum = qqNum.group()

88    #print qqNum

89    #record.append(qqNum)

90    record[&#x27;qqNum&#x27;] = qqNum

91    

92    #联系电话

93    phoneNum = li[9].find(&#x27;p&#x27;).contents[0]

94    phoneNum = int(phoneNum)

95    #print phoneNum

96    #record.append(phoneNum)

97    record[&#x27;phoneNum&#x27;] = phoneNum

98

99    #公司网址

100    companySite = li[10].find(&#x27;a&#x27;).contents[0]

101    #print companySite

102    #record.append(companySite)

103    record[&#x27;companySite&#x27;] = companySite

104

105    return record

106

107def writeToExcel(record):

108    #print(sys.stdout.encoding)

109    #print(sys.stdin.encoding)

110    &#x27;&#x27;&#x27;for r in record.keys():

111        print record[r]

112    &#x27;&#x27;&#x27;

113    wb = xlwt.Workbook()

114    ws = wb.add_sheet(&#x27;CompanyInfoSheet&#x27;)

115

116    #写入公司名称

117    companyName = record[&#x27;companyName&#x27;]

118    ws.write(0,0,companyName)

119

120    

121    #写入服务特色

122    serviceFeature = record[&#x27;serviceFeature&#x27;]

123    ws.write(0,1,serviceFeature)

124

125    #写入服务范围

126    serviceScope = &#x27;,&#x27;.join(record[&#x27;serviceScope&#x27;])

127    ws.write(0,2,serviceScope)

128

129    #写入联系人

130    contacts = record[&#x27;contacts&#x27;]

131    ws.write(0,3,contacts.decode(&quot;utf-8&quot;))

132    

133    #写入商家地址

134    address = record[&#x27;address&#x27;]

135    ws.write(0,4,address.decode(&quot;utf-8&quot;))

136    

137    #写入聊天QQ

138    qqNum = record[&#x27;qqNum&#x27;]

139    ws.write(0,5,qqNum)

140    

141    #写入联系电话

142    phoneNum = record[&#x27;phoneNum&#x27;]

143    phoneNum = str(phoneNum).encode(&quot;utf-8&quot;)

144    ws.write(0,6,phoneNum.decode(&quot;utf-8&quot;))

145    

146    #写入网址

147    companySite = record[&#x27;companySite&#x27;]

148    ws.write(0,7,companySite)

149    wb.save(&#x27;xinrui.xls&#x27;)

150    

151

152if __name__ == &#x27;__main__&#x27;:

153    writeToExcel(analysis())

154    

155    

156

运行结果Excel截图：

过程体会：

做的过程遇到了很多问题，最头疼的还是编码问题，一直报：UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe5 in position108: ordinal not in range(128)问题，找到好些方案，都没能解决掉，最后不得已使用string类中encode和decode终于摆脱掉中文存储问题了。

听说python3.x区分了 unicode str 和 byte arrary，并且默认编码不再是 ascii（似乎该转向3了）。

未完待续。

{{userData.name}}已认证

python爬虫Pragmatic系列II

python爬虫Pragmatic系列II

By 白熊花田(http://blog.csdn.net/whiterbear)

说明：

本次目标：

下载网页：

分析网页：

代码：

运行结果Excel截图：

过程体会：

职场中的那些话那些事

IIS日志代码分析,常见的200,404,301,302分别代表什么意思

{{userData.name}}已认证

python爬虫Pragmatic系列II

By 白熊花田(http://blog.csdn.net/whiterbear)

说明：

本次目标：

下载网页：

分析网页：

代码：

运行结果Excel截图：

过程体会：

Related posts:

职场中的那些话那些事

IIS日志代码分析,常见的200,404,301,302分别代表什么意思

Google Adsense 技巧提示100条

网站排名流量下降的原因有哪些？

python爬虫Pragmatic系列III

nginx日志分析利器GoAccess