人生苦短，Python当歌。

简单爬虫架构

简单爬虫架构——运行流程

URL管理器和实现方法

URL管理器：管理待抓取URL集合和已抓取URL集合
实现方式

网页下载器和urllib2模块

网页下载器：将互联网上URL对应的网页下载到本地的工具
Python的几种网页下载器

urllib2——Python官方基础模块
requests——第三方包更强大

urllib2下载网页的三种方法

以下代码基于Python2。

最简洁方法

url—>urllib2.urlopen(url)

代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


import urllib2

# 直接请求
response = urllib2.urlopen('http://www.baidu.com')

# 获取状态码，如果是200表示获取成功
print response.getcode()

# 读取内容
cont = response.read()

法2：添加data、http header

urllib2法2

代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


import urllib2

# 创建Request对象
request = urllib2.Request(url)

#添加数据
request.add_data('a','1')
#添加http的header
request.add_header('User-Agent','Mozilla/5.0')

# 发送请求获取结果
response = urllib2.urlopen(request)

法3：添加特殊情景的处理器

urllib2法3

代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import urllib2, cookielib

#创建cookie容器
cj = cookielib.CookieJar()

# 创建1个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# 给urllib2安装opener
urllib2.install_opener(opener)

# 使用带有cookie的urllib2访问网页
response = urllib2.urlopen("http://www/baidu.com/")

Python3代码

Python版本：3.7

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


# coding:utf8

from urllib import request
import http.cookiejar

url="http://www.baidu.com"

print('第一种方法')
response1=request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('第二种方法')
req=request.Request(url)
req.add_header('user-agent','Mozilla/5.0')
response2=request.urlopen(req)
print(response2.getcode())
print(len(response2.read()))

print('第三种方法')
cj=http.cookiejar.CookieJar()
opener=request.build_opener(request.HTTPCookieProcessor(cj))
request.install_opener(opener)
response3=request.urlopen(url)
print(response3.getcode())
print(cj)
print(response3.read())

注意：python3要使用urllib.request替换urllib2。

网页解析器和BeautifulSoup第三方模块

网页解析器：从网页中提取有价值数据的工具
Python的几种网页解析器
结构化解析-DOM(Document Object Model)树

Beautiful Soup安装

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

安装：pip install beautifulsoup4

pip install bs4 安装bs4

Beautiful Soup语法

BS语法 eg.

1

<a href='123.html' class='article_link'> Python </a>

节点名称：a 节点属性：href=‘123.html’ 节点属性：class=‘article_link’ 节点内容：Python

代码

创建BeautifulSoup对象

1
2
3
4
5
6
7
8


from bs4 import BeautifulSoup

# 根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
html_doc,# HTML文档字符串
'html.parser'# HTML解析器
from_encoding='utf8'# HTML文档的编码
)

搜索节点（find_all,find）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# 方法：find_all(name, attrs, string)

# 查找所有标签为a的节点
soup.find_all('a')

# 查找所有标签为a，链接符合/view/123.htm形式的节点
soup.find_all('a', href='/view/123.htm')
soup.find_all('a', href=re.compile(r'/view/\d+\.htm'))

# 查找所有标签为div，class为abc，文字为Python的节点
soup.find_all('div', class_='abc', string='Python')

访问节点信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# 得到节点：<a href='1.html'>Python</a>

#获取查找到的节点的标签名称
node.name

# 获取查找到的a节点的href属性
node['href']

# 获取查找到的a节点的链接文字
node.get_text()

Python3测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


# coding:utf8
from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup=BeautifulSoup(html_doc, 'html.parser')

print('获取所有的链接')
links = soup.find_all('a')
for link in links:
    print(link.name, link['href'], link.get_text())
    
print('获取lacie的链接')
link_node = soup.find('a', href='http://example.com/lacie')
print(link_node.name, link_node['href'], link_node.get_text())

print('正则匹配')
link_node = soup.find('a', href=re.compile(r"ill"))
print(link_node.name, link_node['href'], link_node.get_text())


print('获取p段落文字')
p_node = soup.find('p', class_="title")
print(p_node.name, p_node.get_text())

结果：

爬取百度百科1000个页面的数据

分析

确定目标—>分析目标（URL格式、数据格式、网页编码）—>编写代码—>执行爬虫

目标：百度百科Python词条相关词条网页——标题和简介

入口页：http://baike.baidu.com/item/Python/

URL格式：

词条页面URL：/item/%E8%87%AA%E7%94%B1%E8%BD%AF%E4%BB%B6

数据格式：

标题： ```

***

``` > - 简介： ```

***

```

页面编码：UTF-8

易错处及解决

url易变，以后需结合最新url修改正则表达式或总入口url。
若只能爬取到一条数据，注意有些url是没有summary的，需添加

1
2
3
4


summary_node = soup.find('div', class_='lemma-summary')
#添加判空判断
if summary_node is None:
return

outputer.html为空时，检查代码outputer代码是否有误。这种情况通常是data为空，可以断点调试一下，定位问题。比如我的问题在html_parser中_get_new_data函数中将lemma-summary写为了lemmasummary。
outputer.html乱码时，检查编码问题。Python3将编码写在open里。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


       fout = open('output.html', 'w', encoding='utf-8')
        
        fout.write("<html>")
        fout.write("<body>")
        fout.write("<table>")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td>%s</td>" % data['url'])
            fout.write("<td>%s</td>" % data['title'])
            fout.write("<td>%s</td>" % data['summary'])
            fout.write("</tr>")
        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()
        ```
# 结果
- 经测试，爬1000个页面用时为12min（比较慢），全部爬取成功。
- 控制台
- ![爬虫数据](https://github.com/hubojing/BlogImages/blob/master/Python%E5%BC%80%E5%8F%91%E7%AE%80%E5%8D%95%E7%88%AC%E8%99%AB%E2%80%94%E2%80%94%E7%88%AC%E8%99%AB%E7%BB%93%E6%9E%9C.png?raw=true)
- 输出的html
- ![html](https://github.com/hubojing/BlogImages/blob/master/Python%E5%BC%80%E5%8F%91%E7%AE%80%E5%8D%95%E7%88%AC%E8%99%AB%E2%80%94%E2%80%94html.png?raw=true)

# 全部代码下载地址
https://github.com/hubojing/PythonSpider

---

感谢慕课平台。