Python2.1Scarpy日志

Scarpy日志是Scrapy在爬虫结束时输出的统计信息汇总。当Scrapy爬虫完成其任务或因某种原因关闭时，它会记录一系列统计信息，帮助开发者了解爬虫在整个运行过程中发生了什么。这些统计信息可以用来诊断问题、优化爬虫性能或评估爬虫的效果。

具体来说，“Dumping Scrapy stats”这一行后面跟着的是一个键值对列表，每一项都提供了关于爬虫运行情况的具体细节。以下是该日志中的一些常见统计数据及其含义：

1 主要统计数据解释

downloader/request_bytes：
- 表示爬虫发送的所有请求的总字节数。
downloader/request_count：
- 表示爬虫发送的总请求数量。
downloader/request_method_count/GET：
- 表示爬虫发送的HTTP GET请求的数量。
downloader/response_bytes：
- 表示爬虫接收到的所有响应的总字节数。
downloader/response_count：
- 表示爬虫接收到的总响应数量。
downloader/response_status_count/：
- 按HTTP状态码分类的响应数量，例如downloader/response_status_count/200表示返回状态码为200的成功响应数量。
elapsed_time_seconds：
- 表示从爬虫启动到结束所经过的时间，单位为秒。
finish_reason：
- 表示爬虫结束的原因。如果是finished，则表示正常完成；如果是其他值，则可能表示有异常或其他原因导致爬虫提前结束。
finish_time：
- 表示爬虫结束的时间戳。
log_count/：
- 表示不同级别的日志记录数量，如log_count/INFO表示INFO级别的日志数量。
response_received_count：
- 表示爬虫实际接收到的响应数量。
robotstxt/：
- 关于robots.txt请求的相关统计信息。
scheduler/：
- 关于调度队列的操作次数，如scheduler/dequeued表示从队列中取出的请求数量。

2 示例解释

2024-10-18 10:58:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 892,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 43064,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 20.943191,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 10, 18, 2, 58, 40, 765782),
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 10, 18, 2, 58, 19, 822591)}

downloader/request_count: 发送了4个请求。
downloader/response_count: 收到了4个响应。
downloader/response_status_count/200: 收到了1个状态码为200的响应。
downloader/response_status_count/302: 收到了2个状态码为302（重定向）的响应。
downloader/response_status_count/404: 收到了1个状态码为404（未找到）的响应。
elapsed_time_seconds: 爬虫运行了大约20.9秒。
finish_reason: 爬虫正常结束。
log_count/INFO: 记录了10条INFO级别的日志。
log_count/WARNING: 记录了1条WARNING级别的日志。
scheduler/dequeued: 从调度器中取出了2个请求。
scheduler/enqueued: 调度器中放入了2个请求。

通过这些统计数据，您可以了解到爬虫的整体运行状况，并据此进行调试或优化。如果发现某些统计数值异常（例如大量的404响应或长时间的运行），则需要进一步检查爬虫代码或目标网站的状态。