Python小记: 爬虫开发相关技术的清单

涉及到的标准库和第三方库

下载数据 - urllib / requests / aiohttp / httpx。
解析数据 - re / lxml / beautifulsoup4 / pyquery。
缓存和持久化 - mysqlclient / sqlalchemy / peewee / redis / pymongo。
生成数字签名 - hashlib。
序列化和压缩 - pickle / json / zlib。
调度器 - multiprocessing / threading / concurrent.futures。

使用requests获取页面

在上一节课的代码中我们使用了三方库requests来获取页面，下面我们对requests库的用法做进一步说明。

GET请求和POST请求。

import requests

resp = requests.get('http://www.baidu.com/index.html')
print(resp.status_code)
print(resp.headers)
print(resp.cookies)
print(resp.content.decode('utf-8'))

resp = requests.post('http://httpbin.org/post', data={'name': 'Hao', 'age': 40})
print(resp.text)
data = resp.json()
print(type(data))

URL参数和请求头。

resp = requests.get(
    url='https://movie.douban.com/top250',
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/83.0.4103.97 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;'
                  'q=0.9,image/webp,image/apng,*/*;'
                  'q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    }
)
print(resp.status_code)

复杂的POST请求（文件上传）。

resp = requests.post(
	url='http://httpbin.org/post',
    files={'file': open('data.xlsx', 'rb')}
)
print(resp.text)

操作Cookie。

cookies = {'key1': 'value1', 'key2': 'value2'}
resp = requests.get('http://httpbin.org/cookies', cookies=cookies)
print(resp.text)

jar = requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
resp = requests.get('http://httpbin.org/cookies', cookies=jar)
print(resp.text)

设置代理服务器。

requests.get('https://www.taobao.com', proxies={
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
})

说明：关于requests库的相关知识，还是强烈建议大家自行阅读它的官方文档。

设置请求超时。

requests.get('https://github.com', timeout=10)

页面解析

几种解析方式的比较

解析方式	对应的模块	速度	使用难度	备注
正则表达式解析	re	快	困难	常用正则表达式在线正则表达式测试
XPath解析	lxml	快	一般	需要安装C语言依赖库唯一支持XML的解析器
CSS选择器解析	bs4 / pyquery	不确定	简单

说明：BeautifulSoup可选的解析器包括：Python标准库中的html.parser、lxml的HTML解析器、lxml的XML解析器和html5lib。

🎨 本文大部分内容转载自他人文章，若有问题可点击文章作者查看原文

python