本文共 1496 字,大约阅读时间需要 4 分钟。
学习Python爬虫过程中的心得体会以及知识点的整理,方便我自己查找,也希望可以和大家一起交流。
import requestsfrom bs4 import BeautifulSoup
PATH = "/Users/Ezy/Documents/Test/freebuf/"
url = "https://www.freebuf.com/"
r = requests.get(url, "html.parser")#“lxml”是Python 3 新增的参数,需要设定解析器,本地一般为‘lxml’soup = BeautifulSoup(r.content,"lxml")#寻找源代码中所有‘img’标签其中的‘src’参数,获取图片名称和地址img_url = soup.find('img')['src']#将图片名称提取出来,之所以要‘+1’是为了去掉‘/’,使其不会出现在图片名字中img_name = img_url[img_url.rfind('/')+1:]
r = requests.get(img_url, "html.parser")con = r.content#注意:在Python 3 中要使用二进制写入模式(‘wb’)来开启待操作文件,而不能像原来Python 2 那样,采用字符写入模式(‘w’)o = open(PATH + img_name, 'wb')o.write(con)o.close()
效果是:
def getimg(img_url):def main(url):
def getimg(img_url): img_name = img_url[img_url.rfind('/')+1:] #本地的图片地址 file = PATH + img_name r = requests.get(img_url, "html.parser") con = r.content o = open(file, 'wb') o.write(con) o.close() return file
def main(url): r = requests.get(url, "html.parser") soup = BeautifulSoup(r.content,"lxml") imgs = soup.find_all('img') for img in imgs: img_url = img['src'] print (img_url) img['src'] = getimg(img_url) o = open("D:/MyProject/image/test.html",'wb') o.write(str(soup)) o.close()if __name__ == "__main__": url = "https://www.freebuf.com/" main(url)
效果如图:
——————完整代码请点击查看。
转载地址:http://bbazi.baihongyu.com/