Requests基本用法与药品监督管理局

Requests

Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用

urllib2

urllib2是python自带的模块

自定义 'Connection': 'keep-alive',通知服务器交互结束后,不断开连接,即所谓长连接。当然这也是urllib2不支持keep-alive的解决办法之一,另一个方法是Requests。

安装 Requests

优点:

Requests 继承了urllib2的所有特性。Requests支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自动确定响应内容的编码,支持国际化的 URL 和 POST 数据自动编码。

缺陷:

requests不是python自带的库,需要另外安装 easy_install or pip install

直接使用不能异步调用,速度慢(自动确定响应内容的编码)

  1. pip install requests

文档:

http://cn.python-requests.org/zh_CN/latest/index.html

http://www.python-requests.org/en/master/#

使用方法:

  1. requests.get(url, data={'key1': 'value1'},headers={'User-agent','Mozilla/5.0'})
  2. requests.post(url, data={'key1': 'value1'},headers={'content-type': 'application/json'})

药品监督管理局 为例

http://app1.sfda.gov.cn/

采集分类 国产药品商品名(6994) 下的所有的商品信息

商品列表页:http://app1.sfda.gov.cn/datasearch/face3/base.jsp?tableId=32&tableName=TABLE32&title=%B9%FA%B2%FA%D2%A9%C6%B7%C9%CC%C6%B7%C3%FB&bcId=124356639813072873644420336632

商品详情页:http://app1.sfda.gov.cn/datasearch/face3/content.jsp?tableId=32&tableName=TABLE32&tableView=%B9%FA%B2%FA%D2%A9%C6%B7%C9%CC%C6%B7%C3%FB&Id=211315

源码
  1. # -*- coding: utf-8 -*-
  2. import urllib
  3. from lxml import etree
  4. import re
  5. import json
  6. import chardet
  7. import requests
  8. curstart = 2
  9. values = {
  10. 'tableId': '32',
  11. 'State': '1',
  12. 'bcId': '124356639813072873644420336632',
  13. 'State': '1',
  14. 'tableName': 'TABLE32',
  15. 'State': '1',
  16. 'viewtitleName': 'COLUMN302',
  17. 'State': '1',
  18. 'viewsubTitleName': 'COLUMN299,COLUMN303',
  19. 'State': '1',
  20. 'curstart': str(curstart),
  21. 'State': '1',
  22. 'tableView': urllib.quote("国产药品商品名"),
  23. 'State': '1',
  24. }
  25. post_headers = {
  26. 'Content-Type': 'application/x-www-form-urlencoded',
  27. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
  28. }
  29. url = "http://app1.sfda.gov.cn/datasearch/face3/search.jsp"
  30. response = requests.post(url, data=values, headers=post_headers)
  31. resHtml = response.text
  32. print response.status_code
  33. # print resHtml
  34. Urls = re.findall(r'callbackC,\'(.*?)\',null', resHtml)
  35. for url in Urls:
  36. # 坑
  37. print url.encode('gb2312')

查看运行结果,感受一下。案例(三)Requests基本用法与药品监督管理局 - 图1

  • 总结

    • User-Agent伪装Chrome,欺骗web服务器
    • urlencode字典类型Dict、元祖 转化成 url query 字符串

案例(三)Requests基本用法与药品监督管理局 - 图2

练习

  • 完成商品详情页采集
  • 完成整个项目的采集

详情页

  1. # -*- coding: utf-8 -*-
  2. from lxml import etree
  3. import re
  4. import json
  5. import requests
  6. url ='http://app1.sfda.gov.cn/datasearch/face3/content.jsp?tableId=32&tableName=TABLE32&tableView=%B9%FA%B2%FA%D2%A9%C6%B7%C9%CC%C6%B7%C3%FB&Id=211315'
  7. get_headers = {
  8. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  9. 'Connection': 'keep-alive',
  10. }
  11. item = {}
  12. response = requests.get(url,headers=get_headers)
  13. resHtml = response.text
  14. print response.encoding
  15. html = etree.HTML(resHtml)
  16. for site in html.xpath('//tr')[1:]:
  17. if len(site.xpath('./td'))!=2:
  18. continue
  19. name = site.xpath('./td')[0].text
  20. if not name:
  21. continue
  22. # value =site.xpath('./td')[1].text
  23. value = re.sub('<.*?>', '', etree.tostring(site.xpath('./td')[1],encoding='utf-8'))
  24. item[name.encode('utf-8')] = value
  25. json.dump(item,open('sfda.json','w'),ensure_ascii=False)

完整项目

  1. # -*- coding: utf-8 -*-
  2. import urllib
  3. from lxml import etree
  4. import re
  5. import json
  6. import requests
  7. def ParseDetail(url):
  8. # url = 'http://app1.sfda.gov.cn/datasearch/face3/content.jsp?tableId=32&tableName=TABLE32&tableView=%B9%FA%B2%FA%D2%A9%C6%B7%C9%CC%C6%B7%C3%FB&Id=211315'
  9. get_headers = {
  10. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  11. 'Connection': 'keep-alive',
  12. }
  13. item = {}
  14. response = requests.get(url, headers=get_headers)
  15. resHtml = response.text
  16. print response.encoding
  17. html = etree.HTML(resHtml)
  18. for site in html.xpath('//tr')[1:]:
  19. if len(site.xpath('./td')) != 2:
  20. continue
  21. name = site.xpath('./td')[0].text
  22. if not name:
  23. continue
  24. # value =site.xpath('./td')[1].text
  25. value = re.sub('<.*?>', '', etree.tostring(site.xpath('./td')[1], encoding='utf-8'))
  26. value = re.sub('', '', value)
  27. item[name.encode('utf-8').strip()] = value.strip()
  28. # json.dump(item, open('sfda.json', 'a'), ensure_ascii=False)
  29. fp = open('sfda.json', 'a')
  30. str = json.dumps(item, ensure_ascii=False)
  31. fp.write(str + '\n')
  32. fp.close()
  33. def main():
  34. curstart = 2
  35. values = {
  36. 'tableId': '32',
  37. 'State': '1',
  38. 'bcId': '124356639813072873644420336632',
  39. 'State': '1',
  40. 'tableName': 'TABLE32',
  41. 'State': '1',
  42. 'viewtitleName': 'COLUMN302',
  43. 'State': '1',
  44. 'viewsubTitleName': 'COLUMN299,COLUMN303',
  45. 'State': '1',
  46. 'curstart': str(curstart),
  47. 'State': '1',
  48. 'tableView': urllib.quote("国产药品商品名"),
  49. 'State': '1',
  50. }
  51. post_headers = {
  52. 'Content-Type': 'application/x-www-form-urlencoded',
  53. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
  54. }
  55. url = "http://app1.sfda.gov.cn/datasearch/face3/search.jsp"
  56. response = requests.post(url, data=values, headers=post_headers)
  57. resHtml = response.text
  58. print response.status_code
  59. # print resHtml
  60. Urls = re.findall(r'callbackC,\'(.*?)\',null', resHtml)
  61. for url in Urls:
  62. # 坑
  63. url = re.sub('tableView=.*?&', 'tableView=' + urllib.quote("国产药品商品名") + "&", url)
  64. ParseDetail('http://app1.sfda.gov.cn/datasearch/face3/' + url.encode('gb2312'))
  65. if __name__ == '__main__':
  66. main()