Logging

Scrapy提供了log功能。您可以通过 logging 模块使用。

Log levels

Scrapy提供5层logging级别:

  • CRITICAL - 严重错误(critical)
  • ERROR - 一般错误(regular errors)
  • WARNING - 警告信息(warning messages)
  • INFO - 一般信息(informational messages)
  • DEBUG - 调试信息(debugging messages)默认情况下python的logging模块将日志打印到了标准输出中,且只显示了大于等于WARNING级别的日志,这说明默认的日志级别设置为WARNING(日志级别等级CRITICAL > ERROR > WARNING > INFO > DEBUG,默认的日志格式为DEBUG级别

如何设置log级别

您可以通过终端选项(command line option) —loglevel/-L 或 LOG_LEVEL 来设置log级别。

  • scrapy crawl tencent_crawl -L INFO

  • 可以修改配置文件settings.py,添加

LOG_LEVEL='INFO'

Logging - 图1

在Spider中添加log

Scrapy为每个Spider实例记录器提供了一个logger,可以这样访问:

  1. import scrapy
  2. class MySpider(scrapy.Spider):
  3. name = 'myspider'
  4. start_urls = ['http://scrapinghub.com']
  5. def parse(self, response):
  6. self.logger.info('Parse function called on %s', response.url)

logger是用Spider的名称创建的,但是你可以用你想要的任何自定义logging例如:

  1. import logging
  2. import scrapy
  3. logger = logging.getLogger('zhangsan')
  4. class MySpider(scrapy.Spider):
  5. name = 'myspider'
  6. start_urls = ['http://scrapinghub.com']
  7. def parse(self, response):
  8. logger.info('Parse function called on %s', response.url)

Logging设置

以下设置可以被用来配置logging:

LOG_ENABLED

  1. 默认: True,启用logging

LOG_ENCODING

  1. 默认: 'utf-8'logging使用的编码

LOG_FILE

  1. 默认: Nonelogging输出的文件名

LOG_LEVEL

  1. 默认: 'DEBUG'log的最低级别

LOG_STDOUT

  1. 默认: False
  2. 如果为 True,进程所有的标准输出(及错误)将会被重定向到log中。例如,执行 print 'hello' ,其将会在Scrapy log中显示。

案例(一)

tencent_crawl.py添加日志信息如下:

  1. '''
  2. 添加日志信息
  3. '''
  4. print 'print',response.url
  5. self.logger.info('info on %s', response.url)
  6. self.logger.warning('WARNING on %s', response.url)
  7. self.logger.debug('info on %s', response.url)
  8. self.logger.error('info on %s', response.url)

完整版如下:

  1. # -*- coding:utf-8 -*-
  2. import scrapy
  3. from tutorial.items import RecruitItem
  4. from scrapy.spiders import CrawlSpider, Rule
  5. from scrapy.linkextractors import LinkExtractor
  6. class RecruitSpider(CrawlSpider):
  7. name = "tencent_crawl"
  8. allowed_domains = ["hr.tencent.com"]
  9. start_urls = [
  10. "http://hr.tencent.com/position.php?&start=0#a"
  11. ]
  12. #提取匹配 'http://hr.tencent.com/position.php?&start=\d+'的链接
  13. page_lx = LinkExtractor(allow=('start=\d+'))
  14. rules = [
  15. #提取匹配,并使用spider的parse方法进行分析;并跟进链接(没有callback意味着follow默认为True)
  16. Rule(page_lx, callback='parseContent',follow=True)
  17. ]
  18. def parseContent(self, response):
  19. #print("print settings: %s" % self.settings['LOG_FILE'])
  20. '''
  21. 添加日志信息
  22. '''
  23. print 'print',response.url
  24. self.logger.info('info on %s', response.url)
  25. self.logger.warning('WARNING on %s', response.url)
  26. self.logger.debug('info on %s', response.url)
  27. self.logger.error('info on %s', response.url)
  28. for sel in response.xpath('//*[@class="even"]'):
  29. name = sel.xpath('./td[1]/a/text()').extract()[0]
  30. detailLink = sel.xpath('./td[1]/a/@href').extract()[0]
  31. catalog =None
  32. if sel.xpath('./td[2]/text()'):
  33. catalog = sel.xpath('./td[2]/text()').extract()[0]
  34. recruitNumber = sel.xpath('./td[3]/text()').extract()[0]
  35. workLocation = sel.xpath('./td[4]/text()').extract()[0]
  36. publishTime = sel.xpath('./td[5]/text()').extract()[0]
  37. #print name, detailLink, catalog,recruitNumber,workLocation,publishTime
  38. item = RecruitItem()
  39. item['name']=name
  40. item['detailLink']=detailLink
  41. if catalog:
  42. item['catalog']=catalog
  43. item['recruitNumber']=recruitNumber
  44. item['workLocation']=workLocation
  45. item['publishTime']=publishTime
  46. yield item
  • 在settings文件中,修改添加信息
  1. LOG_FILE='ten.log'
  2. LOG_LEVEL='INFO'

接下来执行:scrapy crawl tencent_crawl

  • 或者command line命令行执行:
  1. scrapy crawl tencent_crawl --logfile 'ten.log' -L INFO

输出如下

  1. print http://hr.tencent.com/position.php?start=10
  2. print http://hr.tencent.com/position.php?start=1340
  3. print http://hr.tencent.com/position.php?start=0
  4. print http://hr.tencent.com/position.php?start=1320
  5. print http://hr.tencent.com/position.php?start=1310
  6. print http://hr.tencent.com/position.php?start=1300
  7. print http://hr.tencent.com/position.php?start=1290
  8. print http://hr.tencent.com/position.php?start=1260

ten.log文件中记录,可以看到级别大于INFO日志输出

  1. 2016-08-15 23:10:57 [tencent_crawl] INFO: info on http://hr.tencent.com/position.php?start=70
  2. 2016-08-15 23:10:57 [tencent_crawl] WARNING: WARNING on http://hr.tencent.com/position.php?start=70
  3. 2016-08-15 23:10:57 [tencent_crawl] ERROR: info on http://hr.tencent.com/position.php?start=70
  4. 2016-08-15 23:10:57 [tencent_crawl] INFO: info on http://hr.tencent.com/position.php?start=1320
  5. 2016-08-15 23:10:57 [tencent_crawl] WARNING: WARNING on http://hr.tencent.com/position.php?start=1320
  6. 2016-08-15 23:10:57 [tencent_crawl] ERROR: info on http://hr.tencent.com/position.php?start=1320

案例(二)

tencent_spider.py添加日志信息如下:

  1. logger = logging.getLogger('zhangsan')
  1. '''
  2. 添加日志信息
  3. '''
  4. print 'print',response.url
  5. self.logger.info('info on %s', response.url)
  6. self.logger.warning('WARNING on %s', response.url)
  7. self.logger.debug('info on %s', response.url)
  8. self.logger.error('info on %s', response.url)

完整版如下:

  1. import scrapy
  2. from tutorial.items import RecruitItem
  3. import re
  4. import logging
  5. logger = logging.getLogger('zhangsan')
  6. class RecruitSpider(scrapy.spiders.Spider):
  7. name = "tencent"
  8. allowed_domains = ["hr.tencent.com"]
  9. start_urls = [
  10. "http://hr.tencent.com/position.php?&start=0#a"
  11. ]
  12. def parse(self, response):
  13. #logger.info('spider tencent Parse function called on %s', response.url)
  14. '''
  15. 添加日志信息
  16. '''
  17. print 'print',response.url
  18. logger.info('info on %s', response.url)
  19. logger.warning('WARNING on %s', response.url)
  20. logger.debug('info on %s', response.url)
  21. logger.error('info on %s', response.url)
  22. for sel in response.xpath('//*[@class="even"]'):
  23. name = sel.xpath('./td[1]/a/text()').extract()[0]
  24. detailLink = sel.xpath('./td[1]/a/@href').extract()[0]
  25. catalog =None
  26. if sel.xpath('./td[2]/text()'):
  27. catalog = sel.xpath('./td[2]/text()').extract()[0]
  28. recruitNumber = sel.xpath('./td[3]/text()').extract()[0]
  29. workLocation = sel.xpath('./td[4]/text()').extract()[0]
  30. publishTime = sel.xpath('./td[5]/text()').extract()[0]
  31. #print name, detailLink, catalog,recruitNumber,workLocation,publishTime
  32. item = RecruitItem()
  33. item['name']=name
  34. item['detailLink']=detailLink
  35. if catalog:
  36. item['catalog']=catalog
  37. item['recruitNumber']=recruitNumber
  38. item['workLocation']=workLocation
  39. item['publishTime']=publishTime
  40. yield item
  41. nextFlag = response.xpath('//*[@id="next"]/@href')[0].extract()
  42. if 'start' in nextFlag:
  43. curpage = re.search('(\d+)',response.url).group(1)
  44. page =int(curpage)+10
  45. url = re.sub('\d+',str(page),response.url)
  46. print url
  47. yield scrapy.Request(url, callback=self.parse)
  • 在settings文件中,修改添加信息
  1. LOG_FILE='tencent.log'
  2. LOG_LEVEL='WARNING'

接下来执行:scrapy crawl tencent

  • 或者command line命令行执行:
  1. scrapy crawl tencent --logfile 'tencent.log' -L WARNING

输出信息

  1. print http://hr.tencent.com/position.php?&start=0
  2. http://hr.tencent.com/position.php?&start=10
  3. print http://hr.tencent.com/position.php?&start=10
  4. http://hr.tencent.com/position.php?&start=20
  5. print http://hr.tencent.com/position.php?&start=20
  6. http://hr.tencent.com/position.php?&start=30

tencent.log文件中记录,可以看到级别大于INFO日志输出

  1. 2016-08-15 23:22:59 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=0
  2. 2016-08-15 23:22:59 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=0
  3. 2016-08-15 23:22:59 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=10
  4. 2016-08-15 23:22:59 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=10

小试 LOG_STDOUT

settings.py

  1. LOG_FILE='tencent.log'
  2. LOG_STDOUT=True
  3. LOG_LEVEL='INFO'
  1. scrapy crawl tencent

输出: 空

tencent.log日志文件

  1. 2016-08-15 23:28:32 [stdout] INFO: http://hr.tencent.com/position.php?&start=110
  2. 2016-08-15 23:28:32 [stdout] INFO: print
  3. 2016-08-15 23:28:32 [stdout] INFO: http://hr.tencent.com/position.php?&start=110
  4. 2016-08-15 23:28:32 [zhangsan] INFO: info on http://hr.tencent.com/position.php?&start=110
  5. 2016-08-15 23:28:32 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=110
  6. 2016-08-15 23:28:32 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=110
  7. 2016-08-15 23:28:32 [stdout] INFO: http://hr.tencent.com/position.php?&start=120
  8. 2016-08-15 23:28:33 [stdout] INFO: print
  9. 2016-08-15 23:28:33 [stdout] INFO: http://hr.tencent.com/position.php?&start=120
  10. 2016-08-15 23:28:33 [zhangsan] INFO: info on http://hr.tencent.com/position.php?&start=120
  11. 2016-08-15 23:28:33 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=120
  12. 2016-08-15 23:28:33 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=120

scrapy之Logging使用

  1. #coding:utf-8
  2. ######################
  3. ##Logging的使用
  4. ######################
  5. import logging
  6. '''
  7. 1. logging.CRITICAL - for critical errors (highest severity) 致命错误
  8. 2. logging.ERROR - for regular errors 一般错误
  9. 3. logging.WARNING - for warning messages 警告+错误
  10. 4. logging.INFO - for informational messages 消息+警告+错误
  11. 5. logging.DEBUG - for debugging messages (lowest severity) 低级别
  12. '''
  13. logging.warning("This is a warning")
  14. logging.log(logging.WARNING,"This is a warning")
  15. #获取实例对象
  16. logger=logging.getLogger()
  17. logger.warning("这是警告消息")
  18. #指定消息发出者
  19. logger = logging.getLogger('SimilarFace')
  20. logger.warning("This is a warning")
  21. #在爬虫中使用log
  22. import scrapy
  23. class MySpider(scrapy.Spider):
  24. name = 'myspider'
  25. start_urls = ['http://scrapinghub.com']
  26. def parse(self, response):
  27. #方法1 自带的logger
  28. self.logger.info('Parse function called on %s', response.url)
  29. #方法2 自己定义个logger
  30. logger.info('Parse function called on %s', response.url)
  31. '''
  32. Logging 设置
  33. • LOG_FILE
  34. • LOG_ENABLED
  35. • LOG_ENCODING
  36. • LOG_LEVEL
  37. • LOG_FORMAT
  38. • LOG_DATEFORMAT • LOG_STDOUT
  39. 命令行中使用
  40. --logfile FILE
  41. Overrides LOG_FILE
  42. --loglevel/-L LEVEL
  43. Overrides LOG_LEVEL
  44. --nolog
  45. Sets LOG_ENABLED to False
  46. '''
  47. import logging
  48. from scrapy.utils.log import configure_logging
  49. configure_logging(install_root_handler=False)
  50. #定义了logging的些属性
  51. logging.basicConfig(
  52. filename='log.txt',
  53. format='%(levelname)s: %(levelname)s: %(message)s',
  54. level=logging.INFO
  55. )
  56. #运行时追加模式
  57. logging.info('进入Log文件')
  58. logger = logging.getLogger('SimilarFace')
  59. logger.warning("也要进入Log文件")