Python urllib库

释放双眼，带上耳机，听听看~！

urllib是python内置的HTTP请求库：

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块


1
2
3
1urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadeffault=False,context=None)

2

3

decode(“utf-8”):转换为字符串（utf-8)编码


1
2
3
4
5
1import urllib.request

2html=urllib.request.urlopen(&quot;http://www.baidu.com/&quot;)

3print(html.read().decode(&quot;utf-8&quot;))

4

5

以post方式访问： 其中http://httpbin.org (供我们做http测试的网址)


1
2
3
4
5
6
7
1import urllib.parse

2import urllib.request

3data=bytes(urllib.parse.urlencode({&quot;word&quot;:&quot;hello&quot;}),encoding=&quot;utf-8&quot;) #encoding=&quot;&quot;以指定的编码方式

4response=urllib.request.urlopen(&quot;http://httpbin.org/post&quot;,data=data)

5print(response.read())

6

7

结果


1
2
3
1b&#x27;{\n  &quot;args&quot;: {}, \n  &quot;data&quot;: &quot;&quot;, \n  &quot;files&quot;: {}, \n  &quot;form&quot;: {\n    &quot;word&quot;: &quot;hello&quot;\n  }, \n  &quot;headers&quot;: {\n    &quot;Accept-Encoding&quot;: &quot;identity&quot;, \n    &quot;Content-Length&quot;: &quot;10&quot;, \n    &quot;Content-Type&quot;: &quot;application/x-www-form-urlencoded&quot;, \n    &quot;Host&quot;: &quot;httpbin.org&quot;, \n    &quot;User-Agent&quot;: &quot;Python-urllib/3.6&quot;\n  }, \n  &quot;json&quot;: null, \n  &quot;origin&quot;: &quot;113.105.12.153, 113.105.12.153&quot;, \n  &quot;url&quot;: &quot;https://httpbin.org/post&quot;\n}\n&#x27;

2

3

设置超时


1
2
3
4
5
6
7
8
9
10
1   import urllib.error

2   import socket

3   import urllib.request

4   try:

5       response=urllib.request.urlopen(&quot;http://httpbin.org/get&quot;,timeout=0.1)

6   except urllib.error.URLError as e:

7       if isinstance(e.reason,socket.timeout):

8           print(&quot;TIME OUT&quot;)

9

10

结果


1
2
3
1   TIME OUT

2

3

响应

响应类型


1
2
3
4
5
6
7
1import urllib.request

2html=urllib.request.urlopen(&quot;http://www.baidu.com/&quot;)

3print(html.read().decode(&quot;utf-8&quot;))

4

5&lt;http.client.HTTPResponse object at 0x0000024B3B676080&gt;

6

7

响应码，响应头


1
2
3
4
5
6
7
8
9
10
11
12
1import urllib.request

2html=urllib.request.urlopen(&quot;http://www.baidu.com/&quot;) 

3print(html.status)

4print(html.getheaders())

5print(html.getheader(&#x27;server&#x27;))

6

7

8200

9[(&#x27;Bdpagetype&#x27;, &#x27;1&#x27;), (&#x27;Bdqid&#x27;, &#x27;0x8afac3a8000c1dba&#x27;), (&#x27;Cache-Control&#x27;, &#x27;private&#x27;), (&#x27;Content-Type&#x27;, &#x27;text/html&#x27;), (&#x27;Cxy_all&#x27;, &#x27;baidu+8ec69d29edd1ec53e9faabc8051e2fd7&#x27;), (&#x27;Date&#x27;, &#x27;Sun, 17 Mar 2019 07:12:33 GMT&#x27;), (&#x27;Expires&#x27;, &#x27;Sun, 17 Mar 2019 07:11:47 GMT&#x27;), (&#x27;P3p&#x27;, &#x27;CP=&quot; OTI DSP COR IVA OUR IND COM &quot;&#x27;), (&#x27;Server&#x27;, &#x27;BWS/1.1&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;BAIDUID=5F61E86C65F2F415AE669543617A67B2:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;BIDUPSID=5F61E86C65F2F415AE669543617A67B2; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;PSTM=1552806753; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;delPer=0; path=/; domain=.baidu.com&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;BDSVRTM=0; path=/&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;BD_HOME=0; path=/&#x27;), (&#x27;Set-Cookie&#x27;, &#x27;H_PS_PSSID=1438_21118_28558_28607_28584_26350_28604_28606; path=/; domain=.baidu.com&#x27;), (&#x27;Vary&#x27;, &#x27;Accept-Encoding&#x27;), (&#x27;X-Ua-Compatible&#x27;, &#x27;IE=Edge,chrome=1&#x27;), (&#x27;Connection&#x27;, &#x27;close&#x27;), (&#x27;Transfer-Encoding&#x27;, &#x27;chunked&#x27;)]

10BWS/1.1

11

12

read() 获取响应体的内容：


1
2
3
1html.read()

2

3

Request

request.Request(url-url,data=data,headers=headers,methon=“POST”)
url:网址地址
data:提交的表单数据
headers:响应头
methon:访问方式


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1from urllib import parse,request

2

3url=&#x27;http://httpbin.org/post&#x27;

4headers={

5    &#x27;User-Agent&#x27;:&#x27;Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Mobile Safari/537.36&#x27;,

6    &#x27;Host&#x27;:&#x27;httpbin.org&#x27;

7}

8dict={

9    &#x27;name&#x27;:&#x27;Germey&#x27;

10}

11data=bytes(urllib.parse.urlencode(dict),encoding=&quot;utf-8&quot;)

12req=request.Request(url=url,data=data,headers=headers,method=&#x27;POST&#x27;)

13response=request.urlopen(req)

14print(response.read().decode(&#x27;utf-8&#x27;))

15

16

结果


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1{

2  &quot;args&quot;: {}, 

3  &quot;data&quot;: &quot;&quot;, 

4  &quot;files&quot;: {}, 

5  &quot;form&quot;: {

6    &quot;name&quot;: &quot;Germey&quot;

7  }, 

8  &quot;headers&quot;: {

9    &quot;Accept-Encoding&quot;: &quot;identity&quot;, 

10    &quot;Content-Length&quot;: &quot;11&quot;, 

11    &quot;Content-Type&quot;: &quot;application/x-www-form-urlencoded&quot;, 

12    &quot;Host&quot;: &quot;httpbin.org&quot;, 

13    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Mobile Safari/537.36&quot;

14  }, 

15  &quot;json&quot;: null, 

16  &quot;origin&quot;: &quot;113.105.12.153, 113.105.12.153&quot;, 

17  &quot;url&quot;: &quot;https://httpbin.org/post&quot;

18}

19

20

handler

代理

方法一：


1
2
3
4
5
6
7
8
9
10
11
1import urllib.request

2proxy_handler=urllib.request.ProxyHandler(

3{

4   &#x27;https&#x27;:&#x27;219.131.240.200:9797&#x27;（千万注意http后面没有点）

5

6})

7opener=urllib.request.build_opener(proxy_handler,urllib.request.HTTPHandler)

8response=opener.open(&quot;https://httpbin.org/get&quot;)

9print(response.read())

10

11

结果


1
2
3
1b&#x27;{\n  &quot;args&quot;: {}, \n  &quot;headers&quot;: {\n    &quot;Accept-Encoding&quot;: &quot;identity&quot;, \n    &quot;Host&quot;: &quot;httpbin.org&quot;, \n    &quot;User-Agent&quot;: &quot;Python-urllib/3.6&quot;\n  }, \n  &quot;origin&quot;: &quot;219.131.240.200, 219.131.240.200&quot;, \n  &quot;url&quot;: &quot;https://httpbin.org/get&quot;\n}\n&#x27;

2

3

方法二：


1
2
3
4
5
6
7
8
9
10
11
1import urllib.request

2proxy_handler=urllib.request.ProxyHandler(

3{

4   &#x27;https&#x27;:&#x27;219.131.240.200:9797&#x27;

5})

6opener=urllib.request.build_opener(proxy_handler)

7urllib.request.install_opener(opener)

8response=urllib.request.urlopen(&quot;https://httpbin.org/get&quot;)

9print(response.read())

10

11

结果


1
2
3
1b&#x27;{\n  &quot;args&quot;: {}, \n  &quot;headers&quot;: {\n    &quot;Accept-Encoding&quot;: &quot;identity&quot;, \n    &quot;Host&quot;: &quot;httpbin.org&quot;, \n    &quot;User-Agent&quot;: &quot;Python-urllib/3.6&quot;\n  }, \n  &quot;origin&quot;: &quot;219.131.240.200, 219.131.240.200&quot;, \n  &quot;url&quot;: &quot;https://httpbin.org/get&quot;\n}\n&#x27;

2

3

!!!注意http的代理只能代理HTTP开头的，https的代理只能代理 HTTPS的

cookie

cookie 可以保持登录会话信息
导入处理cookie 的库 http.cookiejar


1
2
3
4
5
6
7
8
9
10
1import http.cookiejar,urllib.request

2

3cookie =http.cookiejar.CookieJar()#注意大小写

4handler=urllib.request.HTTPCookieProcessor(cookie)

5opener=urllib.request.build_opener(handler)

6response=opener.open(&#x27;http://www.baidu.com&#x27;)

7for item in cookie:

8    print(item.name+&#x27;*&#x27;+item.value)

9

10

结果：


1
2
3
4
5
6
7
8
9
1BAIDUID*C31837787335FED26959A1D8CCE1030F:FG=1

2BIDUPSID*C31837787335FED26959A1D8CCE1030F

3H_PS_PSSID*1450_21085_28557_28608_28584_26350_28603_28606

4PSTM*1552816088

5delPer*0

6BDSVRTM*0 

7BD_HOME*0

8

9

cookie保存为文本文件
第一种方法：


1
2
3
4
5
6
7
8
9
10
1import http.cookiejar,urllib.request

2

3filename=&#x27;C:/Users/hanson/Desktop/1/cookie.txt&#x27; #保存的文件位置和文件名，默认为工程目录

4cookie=http.cookiejar.MozillaCookieJar(filename) #cookie声明为http.cookiejar的子类对象MozillCookieJar，因为其带有save（）方法

5handler=urllib.request.HTTPCookieProcessor(cookie)

6opener=urllib.request.build_opener(handler)

7response=opener.open(&#x27;http://www.baidu.com&#x27;)

8cookie.save(ignore_discard=True,ignore_expires=True)

9

10

结果


1
2
3
4
5
6
7
8
9
10
11
12
13
1 Netscape HTTP Cookie File

2# http://curl.haxx.se/rfc/cookie_spec.html

3 This is a generated file!  Do not edit.

4

5.baidu.com TRUE    /   FALSE   3700363349  BAIDUID D3E2F4A0A280B33C6E7C5558F8A6DB34:FG=1

6.baidu.com TRUE    /   FALSE   3700363349  BIDUPSID    D3E2F4A0A280B33C6E7C5558F8A6DB34

7.baidu.com TRUE    /   FALSE       H_PS_PSSID  28629_1444_21119_28558_28607_28584_28603_28626_28605

8.baidu.com TRUE    /   FALSE   3700363349  PSTM    1552879705

9.baidu.com TRUE    /   FALSE       delPer  0

10www.baidu.com FALSE   /   FALSE       BDSVRTM 0

11www.baidu.com FALSE   /   FALSE       BD_HOME 0

12

13

第二种方法：


1
2
3
4
5
6
7
8
9
10
1import http.cookiejar,urllib.request

2

3filename=&#x27;C:/Users/hanson/Desktop/1/cookie1.txt&#x27;

4cookie=http.cookiejar.LWPCookieJar(filename)   把MozillCookieJar改为LWPCookieJar

5handler=urllib.request.HTTPCookieProcessor(cookie)

6opener=urllib.request.build_opener(handler)

7response=opener.open(&#x27;http://www.baidu.com&#x27;)

8cookie.save(ignore_discard=True,ignore_expires=True)

9

10

结果


1
2
3
4
5
6
7
8
9
10
1#LWP-Cookies-2.0

2Set-Cookie3: BAIDUID=&quot;AFA15173D5BB3D6F2CA1645B51A149C4:FG=1&quot;; path=&quot;/&quot;; domain=&quot;.baidu.com&quot;; path_spec; domain_dot; expires=&quot;2087-04-05 06:49:31Z&quot;; version=0

3Set-Cookie3: BIDUPSID=AFA15173D5BB3D6F2CA1645B51A149C4; path=&quot;/&quot;; domain=&quot;.baidu.com&quot;; path_spec; domain_dot; expires=&quot;2087-04-05 06:49:31Z&quot;; version=0

4Set-Cookie3: H_PS_PSSID=1438_21113_28558_28607_28584_28604_28625_28606; path=&quot;/&quot;; domain=&quot;.baidu.com&quot;; path_spec; domain_dot; discard; version=0

5Set-Cookie3: PSTM=1552880127; path=&quot;/&quot;; domain=&quot;.baidu.com&quot;; path_spec; domain_dot; expires=&quot;2087-04-05 06:49:31Z&quot;; version=0

6Set-Cookie3: delPer=0; path=&quot;/&quot;; domain=&quot;.baidu.com&quot;; path_spec; domain_dot; discard; version=0

7Set-Cookie3: BDSVRTM=0; path=&quot;/&quot;; domain=&quot;www.baidu.com&quot;; path_spec; discard; version=0

8Set-Cookie3: BD_HOME=0; path=&quot;/&quot;; domain=&quot;www.baidu.com&quot;; path_spec; discard; version=0

9

10

用cookie打开网址
用哪种cookie保存就用哪种打开


1
2
3
4
5
6
7
8
9
10
1import http.cookiejar,urllib.request

2

3cookie=http.cookiejar.LWPCookieJar() 用哪种cookie就用哪种cookie保存方式

4cookie.load(&#x27;C:/Users/hanson/Desktop/1/cookie1.txt&#x27;,ignore_discard=True,ignore_expires=True)

5handler=urllib.request.HTTPCookieProcessor(cookie)

6opener=urllib.request.build_opener(handler)

7response=opener.open(&#x27;http://www.baidu.com&#x27;)

8print(response.read().decode(&#x27;utf-8&#x27;))

9

10

异常处理：

父类：URLError
子类：HTTPError


1
2
3
4
1try:

2except

3

4

URL解析

{{userData.name}}已认证

urllib是python内置的HTTP请求库：

响应

Request

handler

cookie

Java虚拟机性能管理神器 - VisualVM（7）排查JAVA应用程序线程泄漏

复盘的三重境界

{{userData.name}}已认证

urllib是python内置的HTTP请求库：

响应

Request

handler

cookie

Related posts:

Java虚拟机性能管理神器 - VisualVM（7） 排查JAVA应用程序线程泄漏

复盘的三重境界

Mina、Netty、Twisted一起学（八）：HTTP服务器

python | 正则表达式&re模块

More than React（五）异步编程真的好吗？

Python 3基础教程31-urllib模块

Java虚拟机性能管理神器 - VisualVM（7）排查JAVA应用程序线程泄漏