人生苦短,Python当歌。
简单爬虫架构
简单爬虫架构——运行流程
URL管理器和实现方法
URL管理器:管理待抓取URL集合和已抓取URL集合
实现方式
网页下载器和urllib2模块
urllib2——Python官方基础模块
requests——第三方包更强大
urllib2下载网页的三种方法
以下代码基于Python2。
最简洁方法
url—>urllib2.urlopen(url)
代码
1 2 3 4 5 6 7 8 9 10 import urllib2response = urllib2.urlopen('http://www.baidu.com' ) print response.getcode()cont = response.read()
代码
1 2 3 4 5 6 7 8 9 10 11 12 import urllib2request = urllib2.Request(url) request.add_data('a' ,'1' ) request.add_header('User-Agent' ,'Mozilla/5.0' ) response = urllib2.urlopen(request)
法3:添加特殊情景的处理器
代码
1 2 3 4 5 6 7 8 9 10 11 12 13 import urllib2, cookielibcj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) response = urllib2.urlopen("http://www/baidu.com/" )
Python3代码
Python版本:3.7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from urllib import requestimport http.cookiejarurl="http://www.baidu.com" print ('第一种方法' )response1=request.urlopen(url) print (response1.getcode())print (len (response1.read()))print ('第二种方法' )req=request.Request(url) req.add_header('user-agent' ,'Mozilla/5.0' ) response2=request.urlopen(req) print (response2.getcode())print (len (response2.read()))print ('第三种方法' )cj=http.cookiejar.CookieJar() opener=request.build_opener(request.HTTPCookieProcessor(cj)) request.install_opener(opener) response3=request.urlopen(url) print (response3.getcode())print (cj)print (response3.read())
注意:python3要使用urllib.request替换urllib2。
网页解析器和BeautifulSoup第三方模块
Beautiful Soup安装
官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
安装:pip install beautifulsoup4
pip install bs4
Beautiful Soup语法
eg.
1 <a href='123.html' class='article_link'> Python </a>
节点名称:a
节点属性:href=‘123.html’
节点属性:class=‘article_link’
节点内容:Python
代码
创建BeautifulSoup对象
1 2 3 4 5 6 7 8 from bs4 import BeautifulSoupsoup = BeautifulSoup( html_doc, 'html.parser' from_encoding='utf8' )
搜索节点(find_all,find)
1 2 3 4 5 6 7 8 9 10 11 soup.find_all('a' ) soup.find_all('a' , href='/view/123.htm' ) soup.find_all('a' , href=re.compile (r'/view/\d+\.htm' )) soup.find_all('div' , class_='abc' , string='Python' )
访问节点信息
1 2 3 4 5 6 7 8 9 10 node.name node['href' ] node.get_text()
Python3测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 from bs4 import BeautifulSoupimport rehtml_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup=BeautifulSoup(html_doc, 'html.parser' ) print ('获取所有的链接' )links = soup.find_all('a' ) for link in links: print (link.name, link['href' ], link.get_text()) print ('获取lacie的链接' )link_node = soup.find('a' , href='http://example.com/lacie' ) print (link_node.name, link_node['href' ], link_node.get_text())print ('正则匹配' )link_node = soup.find('a' , href=re.compile (r"ill" )) print (link_node.name, link_node['href' ], link_node.get_text())print ('获取p段落文字' )p_node = soup.find('p' , class_="title" ) print (p_node.name, p_node.get_text())
结果:
爬取百度百科1000个页面的数据
分析
确定目标—>分析目标(URL格式、数据格式、网页编码)—>编写代码—>执行爬虫
目标:百度百科Python词条相关词条网页——标题和简介
入口页:http://baike.baidu.com/item/Python/
URL格式:
词条页面URL:/item/%E8%87%AA%E7%94%B1%E8%BD%AF%E4%BB%B6
数据格式:
标题:
1 2 3 <dd class="lemmaWgt-lemmaTitle-title"> <h1>***</h1> </dd>
简介:
1 <div class="lemma-summary" label-module="lemmaSummary">***</div>
页面编码:UTF-8
易错处及解决
url易变,以后需结合最新url修改正则表达式或总入口url。
若只能爬取到一条数据,注意有些url是没有summary的,需添加
1 2 3 4 summary_node = soup.find('div' , class_='lemma-summary' ) if summary_node is None :return
outputer.html为空时,检查代码outputer代码是否有误。这种情况通常是data为空,可以断点调试一下,定位问题。比如我的问题在html_parser中_get_new_data
函数中将lemma-summary
写为了lemmasummary
。
outputer.html乱码时,检查编码问题。Python3将编码写在open里。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 fout = open ('output.html' , 'w' , encoding='utf-8' ) fout.write("<html>" ) fout.write("<body>" ) fout.write("<table>" ) for data in self.datas: fout.write("<tr>" ) fout.write("<td>%s</td>" % data['url' ]) fout.write("<td>%s</td>" % data['title' ]) fout.write("<td>%s</td>" % data['summary' ]) fout.write("</tr>" ) fout.write("</table>" ) fout.write("</body>" ) fout.write("</html>" ) fout.close()
结果
经测试,爬1000个页面用时为12min(比较慢),全部爬取成功。
控制台
输出的html
全部代码下载地址
https://github.com/hubojing/PythonSpider
感谢慕课平台。