爬取糗事百科段子

确定URL并抓取页面代码,首先我们确定好页面的URL是http://www.qiushibaike.com/8hr/page/4,其中最后一个数字1代表页数,我们可以传入不同的值来获得某一页的段子内容。

我们初步构建如下的代码来打印页面代码内容试试看,先构造最基本的页面抓取方式,看看会不会成功

在Composer raw 模拟发送数据

  1. GET http://www.qiushibaike.com/8hr/page/2/ HTTP/1.1
  2. Host: www.qiushibaike.com
  3. User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
  4. Accept-Language: zh-CN,zh;q=0.8

在删除了User-Agent、Accept-Language报错

应该是headers验证的问题,加上一个headers验证试试看

  1. # -*- coding:utf-8 -*-
  2. import urllib
  3. import requests
  4. import re
  5. import chardet
  6. from lxml import etree
  7. page = 2
  8. url = 'http://www.qiushibaike.com/8hr/page/' + str(page) + "/"
  9. headers = {
  10. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  11. 'Accept-Language': 'zh-CN,zh;q=0.8'}
  12. try:
  13. response = requests.get(url, headers=headers)
  14. resHtml = response.text
  15. html = etree.HTML(resHtml)
  16. result = html.xpath('//div[contains(@id,"qiushi_tag")]')
  17. for site in result:
  18. #print etree.tostring(site,encoding='utf-8')
  19. item = {}
  20. imgUrl = site.xpath('./div/a/img/@src')[0].encode('utf-8')
  21. username = site.xpath('./div/a/@title')[0].encode('utf-8')
  22. #username = site.xpath('.//h2')[0].text
  23. content = site.xpath('.//div[@class="content"]')[0].text.strip().encode('utf-8')
  24. vote = site.xpath('.//i')[0].text
  25. #print site.xpath('.//*[@class="number"]')[0].text
  26. comments = site.xpath('.//i')[1].text
  27. print imgUrl, username, content, vote, comments
  28. except Exception, e:
  29. print e

好啦,大家来测试一下吧,点一下回车会输出一个段子,包括发布人,发布时间,段子内容以及点赞数,是不是感觉爽爆了!

案例(五)爬取糗事百科段子 - 图1

案例(五)爬取糗事百科段子 - 图2

案例(五)爬取糗事百科段子 - 图3

案例(五)爬取糗事百科段子 - 图4