爬虫：爬取新闻内容及图片，存入数据库

一、需求

二、代码

一、需求

1、对新闻主页上的新闻进行爬取，要求解析出标题、内容、新闻类型、图片并存入数据库。

2、只爬取带有图片的新闻，一张即可。

二、代码

以下是对新华网爬取的代码示例。

import requests as rq
from bs4 import BeautifulSoup
import re,os
import datetime
from datetime import timedelta
from difflib import SequenceMatcher
from gbase import GBASE_DB 
from conf import IMGPATH,LOCALPATH,PICSIZExinhua_dict = {'politics':1, #时政'culture':2, #文化'health':3,  #健康'fortune':4,  #财经'world':5,  #国际}def classify_news(s, news_list):for li in news_list:if li in s:return lireturn Nonedef get_xinhua_news(url):'''爬取新华网标题、内容、分类、图片'''newsWeb = rq.get(url)newsWeb.encoding = 'utf-8'soup = BeautifulSoup(newsWeb.text,'html.parser')#获取标题title_element = soup.find('span', class_='title')title = title_element.get_text(strip=True)#获取分类news_type = xinhua_dict[classify_news(url,xinhua_dict.keys())]#获取内容content_element = soup.find('div', id='detail')paragraphs = content_element.find_all('p')content = '\n'.join(paragraph.get_text(strip=True) for paragraph in paragraphs)content = re.sub('\n+', '\n', content).replace('"', '\\"').replace('\n', '\\n')#获取图片jpg_element = soup.find_all('img')jpg_pattern = re.compile(r'src="([^"]*1n\.(jpg|jpeg))"')j_list = jpg_pattern.findall(str(jpg_element))for j in j_list:jpg_path = os.path.basename(j[0])jpg_url = url[:url.find('c_')] + jpg_pathpicture = rq.get(jpg_url)if picture.status_code==200:if len(picture.content)>PICSIZE:with open(LOCALPATH+jpg_path,"wb") as f:f.write(picture.content)return_path = IMGPATH+jpg_pathbreakelse:passreturn title,news_type,content,return_pathdef main():db = GBASE_DB()newsUrl = 'http://www.xinhuanet.com/'newsWeb = rq.get(newsUrl)newsWeb.encoding = 'utf-8'soup = BeautifulSoup(newsWeb.text,'lxml')#获取新闻网址列表link_list = []li_elements = soup.find_all('li')for li_element in li_elements:a_element = li_element.find('a')if a_element:url = a_element.get('href')if url.startswith("http://www.news.cn/") and url.endswith(".htm") and 'c_' in url and classify_news(url,xinhua_dict.keys())!=None:link_list.append(url)#逐个解析新闻网址for link in link_list:try:title,news_type,content,jpg_path = get_xinhua_news(link)sql = '''insert into table_name(title,type,content,image) values ('{}', {},'{}','{}')'''.format(title,news_type,content,jpg_path)db.execute_sql(sql)print('（成功）爬取新华网：',title)except Exception as e:print('爬取失败：',link,' :',e)continueif __name__ == '__main__':main()

首先，对新华网主页进行爬取，获取页面上所有的新闻链接，存放进入link_list列表中。

然后，依次访问每一个新闻链接，并解析标题、内容，需要对空格、特殊字符等做一下清洗。根据子频道路径进行分类，并爬取像素值大于阈值的图片（避免爬取到页面上的二维码等小图），图片保存在服务器本地某个文件夹下，如果没有符合条件的图片，则会报错，在main函数中抛出异常，跳过此新闻链接的爬取。

最后，存入数据库。