Python爬虫小练习-个人在线分享

当前位置：个人在线分享 > Python爬虫小练习

文章介绍

有疑问？请点击复制链接咨询！

爬虫的本质

爬虫的本质就是通过程序模拟正常人向网站发送请求获取信息。

关于爬虫的一些闲聊

按照我们的常识来说，我们不可能在1秒钟访问这个网站100次，请求100次数据，所以过多的请求很有可能会被网站认为你在使用脚本进行爬虫，可能会封你IP，或者说当你爬虫不修改UA头时，python会默认告诉网站自己是一个爬虫脚本，相当于明牌告诉别人自己是来爬你网站。而且要是网站有WAF，也可能设置有策略，不让你爬，当然要是你手法够硬，也是可以绕的。当然作者水平有限，并且各位在爬虫的时候要清楚哪些东西可以爬，哪些东西不能爬，不要触及了法律的红线。年纪轻轻就吃上了国家饭.QAQ

爬取某小说网站（URL已码）

import requests
import os
import parsel
end=input("你想爬取多少章？（阿拉伯数字输入，大于2）:")
end= int(end)
print("正在爬取请稍等...")
#爬取第一页的内容
print("==============正在爬取第1章==============")
url = f'xxxxxxx'
response = requests.get(url=url)
response.encoding = response.apparent_encoding
html = response.text
# 解析html
selector = parsel.Selector(html)
# 获取文章标题
title = selector.css('.content h1::text').get()
#print(title)
# 获取小标题内容
content_1 = selector.css('#chaptercontent::text').get()
#print(content_1)
# passage = ''.join(selector.xpath('//div[@id="chaptercontent"]//text()').getall()).strip()
passage = ''.join(selector.css('#chaptercontent').xpath('./text()').getall()[:-4])
passage = passage.replace('　　', '
')
#print(passage)
filename = 'xxxxxx\'
if not os.path.exists(filename):
os.mkdir(filename)
with open(filename +title+ '.txt', mode='wb') as f:
f.write(passage.encode('utf-8'))
#后续内容
for page in range(2,end+1):
print(f"==============正在爬取第{page}章==============")
url = f'xxxxxxxxxxxxxxxxxxx'
response=requests.get(url=url)
response.encoding = response.apparent_encoding
html=response.text
# 解析html
selector=parsel.Selector(html)
# 获取文章标题
title=selector.css('.content h1::text').get()
#print(title)
# 获取小标题内容
#content_1 = selector.css('#chaptercontent::text').get()
#print(content_1)
#passage = ''.join(selector.xpath('//div[@id="chaptercontent"]//text()').getall()).strip()
passage = ''.join(selector.css('#chaptercontent').xpath('./text()').getall()[:-4])
passage=passage.replace('　　','
')
#print(passage)
with open(filename + title+ '.txt', mode='wb') as f:
f.write(passage.encode('utf-8'))
print("爬取完成，已保存在同目录下")

代码思路

总之，我觉得无论是爬那个网站，思路上都大体差不多。

获取网站的html源码
用html解析器解析（我这里用的是parsel模块）
分析网页的结构，用解析器提炼出你想获取的东西
先获取单个，然后根据相同结构的页面用循环来实现翻页获取大量资源
如果爬取的资源过多，建议设置一个sleep函数，不要让服务器的负担过大，给网站的管理者造成负担。

ps：注意编码格式，很多时候打印不出东西都是因为格式的问题。本人还是学生，也是初学者，代码写的比较草率，旨在记录学习。如有错误欢迎指出改正。

python 开发语言爬虫

本站无任何商业行为
个人在线分享 » Python爬虫小练习

admin 钻石

分享到：

E-->