scrapy-redis 实现分布式爬虫

scrapy-redis 架构

1，安装scrapy-redis

pip install scrapy-redis

2，启用settings.py里面的组件

注意settings里面的中文注释会报错，换成英文

# 启用在redis中调度存储请求队列。
SCHEDULER  =  "scrapy_redis.scheduler.Scheduler"
# 不要清理redis队列，允许暂停/恢复爬网。
SCHEDULER_PERSIST = True
# 指定排序爬取地址时使用的队列，默认是按照优先级排序
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
# 可选的先进先出排序
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'
# 可选的后进先出排序
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderStack'
# 只在使用SpiderQueue或者SpiderStack是有效的参数,，指定爬虫关闭的最大空闲时间
SCHEDULER_IDLE_BEFORE_CLOSE = 10
# 指定RedisPipeline用以在redis中保存item
ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400
}

# 指定redis的连接参数
# REDIS_PASS是我自己加上的redis连接密码，需要简单修改scrapy-redis的源代码以支持使用密码连接redis
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
# Custom redis client parameters (i.e.: socket timeout, etc.)
REDIS_PARAMS  = {}
#REDIS_URL = 'redis://user:pass@hostname:9001'
#REDIS_PARAMS['password'] = 'itcast.cn'
LOG_LEVEL = 'DEBUG'

DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'

#The class used to detect and filter duplicate requests.

#The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).

#By default, RFPDupeFilter only logs the first duplicate request. Setting DUPEFILTER_DEBUG to True will make it log all duplicate requests.
DUPEFILTER_DEBUG =True

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Connection': 'keep-alive',
    'Accept-Encoding': 'gzip, deflate, sdch',
}

代理的中间件

class ProxyMiddleware(object):
    def __init__(self, settings):
        self.queue = 'Proxy:queue'
        # 初始化代理列表
        # self.r = redis.Redis(host=settings.get('REDIS_HOST'),port=settings.get('REDIS_PORT'),db=1,password=settings.get('REDIS_PARAMS')['password'])
        self.r = redis.Redis(host=settings.get('REDIS_HOST'), port=settings.get('REDIS_PORT'), db=1)
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):
        proxy={}
        source, data = self.r.blpop(self.queue)
        proxy['ip_port']=data
        proxy['user_pass']=None

        if proxy['user_pass'] is not None:
            #request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            #proxy_user_pass = "USERNAME:PASSWORD"
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
            print("********ProxyMiddleware have pass*****" + proxy['ip_port'])
        else:
            #ProxyMiddleware no pass
            print(request.url, proxy['ip_port'])
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

    def process_response(self, request, response, spider):
        """
        检查response.status, 根据status是否在允许的状态码中决定是否切换到下一个proxy, 或者禁用proxy
        """
        print("-------%s %s %s------" % (request.meta["proxy"], response.status, request.url))
        # status不是正常的200而且不在spider声明的正常爬取过程中可能出现的
        # status列表中, 则认为代理无效, 切换代理
        if response.status == 200:
            print('rpush',request.meta["proxy"])
            self.r.rpush(self.queue, request.meta["proxy"].replace('http://',''))
        return response

    def process_exception(self, request, exception, spider):
        """
        处理由于使用代理导致的连接异常
        """
        proxy={}
        source, data = self.r.blpop(self.queue)
        proxy['ip_port']=data
        proxy['user_pass']=None

        request.meta['proxy'] = "http://%s" % proxy['ip_port']
        new_request = request.copy()
        new_request.dont_filter = True
        return new_request

All posts

2024-01-02 16:34:31 分享一个hyperf的grpc 日志切面
2023-10-20 10:04:42 常用便捷linux命令
2023-06-20 23:36:28 自动化部署流程
2023-06-06 15:19:37 快速清空MySQL的一张表
2023-04-11 15:06:39 MySQL事务的四大特性以及实现原理
2023-03-31 10:06:44 使用GitLab CI/CD的基本步骤
2022-09-30 15:58:36 git 脑图
2022-04-18 17:18:25 phpstan 使用指南
2021-06-19 17:02:51 wsl/wsl2 和proxifier 冲突解决
2021-04-10 18:23:27 进程间有哪些通信方式
2021-03-29 18:29:11 go学习笔记
2021-03-27 21:30:54 slice 实现原理及使用技巧
2021-03-24 12:47:38 mysql 锁
2021-03-24 12:45:23 子弹redis
2020-12-18 14:40:19 go 得到程序运行的垃圾收集器更多细节
2020-12-05 15:44:20 进程小工具
2020-11-17 16:26:54 mongodb的一些查询
2020-11-08 17:43:56 composer 2
2020-05-18 21:32:04 gitlab的介绍和安装
2020-05-04 16:34:19 redis的持久化和选择
2020-04-13 00:56:47 跳表 skip list
2020-04-06 23:30:28 MySQL主从备份
2020-04-06 23:30:00 MySQL处理常见需求
2020-04-06 23:29:05 MySQL设计规范
2020-04-01 01:49:50 排序算法
2020-03-30 01:26:59 docker的镜像与容器
2020-03-28 21:43:07 docker 安装
2020-03-26 01:54:26 看操作系统_清华大学(向勇、陈渝) 笔记 1.2 1.3
2020-03-02 15:41:24 mysql 批量更新
2020-01-15 01:11:15 递归删除
2019-11-20 16:58:40 理解inode
2019-10-05 14:32:02 TCP的三次握手
2019-10-03 09:17:24 套接字
2019-09-30 16:17:50 网络工具
2019-09-08 02:48:27 html
2019-09-07 19:10:07 认识客户端-服务端网络模型的基本概念
2019-06-20 16:56:17 mysql从myisam迁移到innodb全过程
2018-03-12 16:22:56 MVC的工作原理
2018-03-08 16:22:02 web资源防盗链
2018-03-07 10:55:04 数据库缓存层的优化
2018-03-06 08:28:27 高并发解决方案
2018-03-04 09:00:38 php的会话控制技术 session与cookie
2018-02-28 10:04:24 php的常量及其数据类型
2018-02-27 16:23:12 php的变量以及引用变量的工作原理
2018-02-26 08:34:42 laravel上手
2018-02-22 06:15:51 腾讯php面试题目训练
2018-02-07 01:01:07 centos升级gcc4.4.7到4.8.2
2018-01-29 02:49:47 window下swoole拓展的安装
2018-01-25 03:04:23 mysql启动错误，The server quit without updating PID file (/usr/local/mysql/var/VM_157_231_centos.pid)
2018-01-22 09:32:54 dht爬虫
2018-01-21 06:53:08 从想下载知乎上的视频衍生的ffmpeg
2018-01-19 01:45:54 股票估值
2018-01-17 09:49:20 鸟哥的linux私房菜读书笔记-记录一点linux的小知识
2018-01-11 02:40:47 市盈率
2018-01-05 03:14:16 自然语言处理
2017-12-26 05:55:11 netbeans的debug功能
2017-12-13 07:11:08 git比较好理解的记录
2017-12-04 07:08:49 centos,shadowsocks 服务端搭建
2017-12-01 08:31:22 使用python的you-get下载油管视频
2017-11-28 11:20:21 fping window
2017-11-28 08:09:16 linux centos配置Nginx支持HTTPS访问 Let’s Encrypt
2017-11-15 08:37:44 apache2.2 支持中文url mod_encoding的扩展的编译与安装
2017-11-10 03:01:09 linux awk,sort,uniq,head 分析apache日志文件
2017-11-08 06:42:46 php的session机制
2017-11-08 02:18:12 linux下安装phpredis拓展
2017-11-08 01:52:12 linux tar 命令详解
2017-11-06 03:16:45 Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)
2017-11-03 05:51:50 linux设置SSH无密码登录
2017-11-02 10:02:13 lsyncd 记录
2017-11-02 03:33:33 httpd开机启动以及“service httpd does not support chkconfig”
2017-10-31 02:50:42 阿里云oss上传文件及挂载硬盘
2017-10-23 01:49:52 linux三大利器–grep|sed|awk
2017-10-11 08:34:44 elasticsearch suggester学习记录
2017-10-06 16:25:40 kibana的操作指南CURD
2017-10-06 15:10:37 倒排索引
2017-10-05 02:14:25 Elasticsearch 分布式概念
2017-09-30 09:11:45 Elasticsearch学习记录
2017-09-21 01:01:00 scrapy-redis 实现分布式爬虫
2017-09-20 13:25:22 基于window下使用tor作为python爬虫切换IP
2017-09-13 14:08:39 蜘蛛侠英雄归来
2017-09-08 08:13:51 畅想未来
2017-08-23 02:48:47 关于window底下用pip 安装 scipy
2017-08-15 10:04:09 scapy教程初接触(3)—xpath语法和css语法
2017-08-12 09:23:08 scapy教程初接触(2)—编辑main.py文件
2017-08-10 17:05:57 scapy教程初接触(1)
2017-08-08 08:34:22 nginx的重新编译
2017-08-07 15:37:02 sqlmap +dvwa +Proxifier 完整（大概）教程
2017-08-01 17:32:58 正则表达式-贪婪与懒惰 python篇
2017-08-01 08:05:49 pip install scrapy 遇到的坑
2017-07-31 14:31:33 使用virtualenv和virtualenvwrapper搭建python的虚拟环境
2017-07-30 16:44:03 爱在三部曲《爱在黎明破晓前》《爱在日落黄昏前》《爱在午夜降临前》
2017-07-27 17:07:23 极速风流 rush
2017-07-10 06:09:34 自己用的vim配置
2017-07-09 07:57:37 路边野餐有感
2017-07-06 06:09:24 线程与进程
2017-06-25 08:53:14 Restful API 实战
2017-06-17 19:37:38 牯岭街少年杀人事件–一部没有反派的电影
2017-06-17 08:33:10 nginx wordpress 伪静态规则配置
2017-05-06 17:56:44 观《摔跤吧！爸爸》
2017-04-20 11:39:03 记科目二
2017-04-18 02:30:59 香港两天之旅
2017-04-12 17:57:26 纪念重要的4月15日
2017-04-04 06:14:26 奥斯卡的游戏规则
2017-04-03 19:33:28 首次观看奇葩说
2017-03-29 16:58:21 微信ai 自动回复机器人教程
2017-03-28 17:35:00 重新开始好好做人

Other pages

Deprecated: 文件没有 comments.php 的主题自版本 3.0.0 起已弃用，且没有可用的替代。请在您的主题中包含一个 comments.php 模板。 in /www/wwwroot/liguoqi.site/wp-includes/functions.php on line 6078

scrapy-redis 实现分布式爬虫

scrapy-redis 架构

1，安装scrapy-redis

2，启用settings.py里面的组件

注意settings里面的中文注释会报错，换成英文

代理的中间件

All posts

Other pages

发表回复 取消回复

发表回复取消回复