Skip to content

Commit 1be0258

Browse files
committed
修正部分错误,并增加专用蜘蛛与中间件。
1 parent 33a5eee commit 1be0258

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+560
-329
lines changed

.gitignore

100644100755
+2-2
Original file line numberDiff line numberDiff line change
@@ -176,5 +176,5 @@ pyvenv.cfg
176176
pip-selfcheck.json
177177

178178
build
179-
.vscode
180-
.idea
179+
.vscode/
180+
.idea/

README.md

100644100755
+28-173
Original file line numberDiff line numberDiff line change
@@ -1,175 +1,30 @@
11
# Scrapy+
22

3-
Scrapy扩展工具包。具体使用方法与配置方法可以参考《虫术——Python绝技》一书。
4-
5-
## 过滤器
6-
7-
### Redis 去重过滤器 `scrapy_plus.dupefilters.RedisDupeFilter`
8-
9-
基于Redis使用`Set`存储曾访问过的URL。
10-
11-
**使用方法**
12-
13-
`settings`文件内引入以下的内容:
14-
15-
```py
16-
# 覆盖原有的去重过滤器
17-
DUPEFILTER_CLASS = 'scrapy_plus.dupefilters.RedisDupeFilter'
18-
REDIS_PORT = 6379 # REDIS服务器端口
19-
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
20-
REDIS_DB = 0 # 数据库名
21-
```
22-
23-
**默认配置**
24-
25-
```py
26-
REDIS_PORT = 6379 # REDIS服务器端口
27-
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
28-
REDIS_DB = 0 # 数据库名
29-
```
30-
31-
### Redis 布隆去重过滤器 `scrapy_plus.dupefilters.RedisBloomDupeFilter`
32-
33-
基于Redis采用布隆算法对URL进行去重处理
34-
35-
**使用方法**
36-
37-
`settings`文件内引入以下的内容:
38-
39-
```py
40-
# 覆盖原有的去重过滤器
41-
DUPEFILTER_CLASS = 'scrapy_plus.dupefilters.RedisBloomDupeFilter'
42-
REDIS_PORT = 6379 # REDIS服务器端口
43-
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
44-
REDIS_DB = 0 # 数据库名
45-
```
46-
47-
**默认配置**
48-
49-
```
50-
REDIS_PORT = 6379 # REDIS服务器端口
51-
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
52-
REDIS_DB = 0 # 数据库名
53-
BLOOMFILTER_REDIS_KEY = 'bloomfilter' # 去重键名
54-
BLOOMFILTER_BLOCK_NUMBER = 1 # 块大小
55-
```
56-
57-
## 中间件
58-
59-
### 自登录中间件 `scrapy_plus.middlewares.LoginMiddleWare`
60-
61-
```py
62-
LOGIN_URL = '网站登录地址'
63-
LOGIN_USR = '用户名'
64-
LOGIN_PWD = '密码'
65-
LOGIN_USR_ELE = '用户名input元素名称(name)'
66-
LOGIN_PWD_ELE = '密码input元素名称(name)'
67-
DOWNLOADER_MIDDLEWARES = {
68-
'scrapyplus.middlewares.LoginMiddleWare': 330
69-
}
70-
```
71-
72-
### Chrome 浏览器仿真中间件 `scrapy_plus.middlewares.ChromeMiddleware`
73-
74-
Chrome 无头浏览器仿真中间件。让爬虫用Chrome来访问目标URL,完美解决富JS页面的问题。
75-
76-
```py
77-
SELENIUM_TIMEOUT = 30 # 设置页面打开的超时秒数
78-
CHROMEDRIVER = "/path/to/chrome" # Chrome浏览器驱动地址
79-
DOWNLOADER_MIDDLEWARES = {
80-
'scrapyplus.middlewares.ChromeMiddleware': 800
81-
}
82-
83-
```
84-
85-
86-
### Splash `scrapy_plus.middlewares.SplashSpiderMiddleware`
87-
88-
Splash 中间件,可将请求转发至指定的Splash服务,使蜘蛛具有浏览器仿真功能。
89-
90-
```py
91-
WAIT_FOR_ELEMENT = "选择器" # 等待该元素被加载成功才认为页面加载完成
92-
DOWNLOADER_MIDDLEWARES = {
93-
'scrapyplus.middlewares.SplashSpiderMiddleware': 800
94-
}
95-
```
96-
97-
### 随机UA `scrapyplus.middlewares.RandomUserAgentMiddleware`
98-
99-
随机模拟User Agent
100-
101-
```python
102-
DOWNLOADER_MIDDLEWARES = {
103-
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
104-
'scrapyplus.middlewares.RandomUserAgentMiddleware': 500
105-
}
106-
## 可随机增加更多的UA,中间件会进行自动随机选择
107-
USER_AGENTS = [
108-
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0',
109-
'Mozilla/5.0 (Linux; U; Android 2.2) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
110-
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
111-
'Mozilla/5.0 (Linux; Android 6.0.1; SM-G532G Build/MMB29T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.83 Mobile Safari/537.36',
112-
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.0.3 Safari/604.5.6'
113-
]
114-
```
115-
116-
### Tor 中间件 `scrapyplus.middlewares.TorProxyMiddleware`
117-
118-
洋葱头代理中间件,让你的蜘蛛不停地更换IP地址,化身万千。
119-
120-
需要先安装 tor 与 privoxy 具体配置方法请参考《虫术——python绝技》
121-
122-
```py
123-
# Tor代理
124-
TOR_PROXY = 'http://127.0.0.1:8118' # 8118是Privoxy的默认代理端口
125-
TOR_CTRL_PORT = 9051
126-
TOR_PASSWORD = 'mypassword'
127-
TOR_CHANGE_AFTER_TIMES = 50 # 在发出多少次请求之后更换IP地址。
128-
```
129-
130-
## 管道
131-
132-
### MongoDB数据存储管道 `scrapy_plus.piplines.MongoDBPipeline`
133-
134-
可以将Item直接写入MongoDB数据库中。
135-
136-
**默认配置**
137-
138-
```py
139-
ITEM_PIPELINES = {'scrapy_plus.pipelines.MongoDBPipeline':2}
140-
141-
MONGODB_SERVER = "localhost" # mongodb服务器地址
142-
MONGODB_PORT = 27017 # mongodb服务端口
143-
MONGODB_DB = "数据库名" # 数据库名
144-
MONGODB_COLLECTION = "表名" # 表名
145-
```
146-
147-
## 存储后端
148-
149-
### SQL数据库存储后端 `scrapy_plus.extensions.SQLFeedStorage`
150-
151-
```py
152-
# 数据存储
153-
ORM_MODULE = 'movies.entities'
154-
ORM_METABASE = 'Base'
155-
ORM_ENTITY = 'Movie'
156-
157-
FEED_FORMAT = 'entity' #
158-
FEED_EXPORTERS = {
159-
'entity': 'scrapyplus.extensions.SQLItemExporter'
160-
}
161-
162-
FEED_URI = 'dialect+driver://username:password@host:port/database' # 默认后端存储文件的名称
163-
FEED_STORAGES = {
164-
'sqlite': 'scrapyplus.extensions.SQLFeedStorage',
165-
'postgresql': 'scrapyplus.extensions.SQLFeedStorage',
166-
'mysql': 'scrapyplus.extensions.SQLFeedStorage'
167-
}
168-
```
169-
170-
### 阿里云OSS存储后端 `scrapy_plus.extensions.OSSFeedStorage`
171-
172-
```py
173-
OSS_ACCESS_KEY_ID = ''
174-
OSS_SECRET_ACCESS_KEY = ''
175-
```
3+
Scrapy扩展工具包。为[《从0学爬虫专栏》](https://www.imooc.com/read/34) 提供,详细的使用方法请到专栏内参考。
4+
5+
```
6+
$ pip install scrapy_plus
7+
```
8+
9+
Scrapy+提供以下的内容
10+
11+
- 过滤器
12+
- Redis 去重过滤器
13+
- Redis 布隆去重过滤器
14+
- 中间件
15+
- 自登录中间件
16+
- 花瓣网专用中间件
17+
- Chrome通用中间件
18+
- Splash渲染中间件
19+
- Tor中间件
20+
- 随机UA中间件
21+
- 随机代理中间件
22+
- 管道
23+
- MongoDB数据存储管道
24+
- 可支持阿里云的OSS图片管道
25+
- SQL存储端
26+
- 输入/输出处理器
27+
- 蜘蛛
28+
- `BookSpider`
29+
- `NeteaseSpider`
30+
- `TaobaoSpider`

requirements.txt

100644100755
+1-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ python-dateutil==2.8.0
4444
pytz==2018.9
4545
qt5reactor==0.5
4646
queuelib==1.5.0
47-
redis==3.2.0
47+
redis==3.2.1
4848
regex==2019.2.21
4949
requests==2.21.0
5050
Scrapy==1.6.0

scrapy_plus/__init__.py

100644100755
File mode changed.

scrapy_plus/dupefilters/__init__.py

100644100755
+2-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
from .redis import RedisDupeFilter
2-
#from .bloom import FileBloomDupeFilter
32
from .redisbloom import RedisBloomDupeFilter
3+
4+
__all__ = ["RedisBloomDupeFilter", "RedisDupeFilter"]

scrapy_plus/dupefilters/redis.py

100644100755
+15-16
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,36 @@
11
# -*- coding: utf-8 -*-
22
import logging
3-
from scrapy.utils.request import request_fingerprint
4-
from redis import StrictRedis
3+
from redis import Redis
54
from scrapy.dupefilters import BaseDupeFilter
65

7-
BLOOMFILTER_HASH_NUMBER = 6
8-
BLOOMFILTER_BIT = 30
96

107

118
class RedisDupeFilter(BaseDupeFilter):
129
"""
13-
Redis去重过滤器
10+
Redis 去重过滤器
1411
"""
12+
def __init__(self, host='localhost', port=6379, db=0):
13+
self.redis = Redis(host=host, port=port, db=db)
14+
self.logger = logging.getLogger(__name__)
1515

1616
@classmethod
1717
def from_settings(cls, settings):
18-
return cls(host=settings.get('REDIS_HOST'),
19-
port=settings.getint('REDIS_PORT'),
20-
db=settings.get('REDIS_DB'))
21-
22-
def __init__(self, host, port, db):
23-
self.redis = StrictRedis(host=host, port=port, db=db)
24-
self.logger = logging.getLogger(__name__)
18+
host = settings.get('REDIS_HOST', 'localhost')
19+
redis_port = settings.getint('REDIS_PORT')
20+
redis_db = settings.get('REDIS_DUP_DB')
21+
return cls(host, redis_port, redis_db)
2522

2623
def request_seen(self, request):
27-
fp = request_fingerprint(request)
28-
key = 'UriFingerprints'
29-
if self.redis.sismember(key, fp) is None:
24+
fp = request.url
25+
key = 'UrlFingerprints'
26+
if not self.redis.sismember(key, fp):
3027
self.redis.sadd(key, fp)
3128
return False
3229
return True
3330

3431
def log(self, request, spider):
35-
msg = ("已过滤的重复请求:%(request)s")
32+
msg = ("已过滤的重复请求: %(request)s")
3633
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
3734
spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
35+
36+

scrapy_plus/dupefilters/redisbloom.py

100644100755
+7-11
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
# -*- coding: utf-8 -*-
22
import logging
33
from scrapy.utils.request import request_fingerprint
4-
from redis import StrictRedis
4+
from redis import Redis
55
from hashlib import md5
66
from scrapy.dupefilters import BaseDupeFilter
77

8+
BLOOMFILTER_HASH_NUMBER = 6
9+
BLOOMFILTER_BIT = 30
10+
811

912
class SimpleHash(object):
1013
def __init__(self, cap, seed):
@@ -21,14 +24,7 @@ def hash(self, value):
2124
class RedisBloomDupeFilter(BaseDupeFilter):
2225

2326
def __init__(self, host='localhost', port=6379, db=0, blockNum=1, key='bloomfilter'):
24-
"""
25-
:param host: the host of Redis
26-
:param port: the port of Redis
27-
:param db: witch db in Redis
28-
:param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it.
29-
:param key: the key's name in Redis
30-
"""
31-
self.redis = StrictRedis(host=host, port=port, db=db)
27+
self.redis = Redis(host=host, port=port, db=db)
3228

3329
self.bit_size = 1 << 31 # Redis的String类型最大容量为512M,现使用256M
3430
self.seeds = [5, 7, 11, 13, 31, 37, 61]
@@ -44,7 +40,7 @@ def __init__(self, host='localhost', port=6379, db=0, blockNum=1, key='bloomfilt
4440
def from_settings(cls, settings):
4541
_port = settings.getint('REDIS_PORT', 6379)
4642
_host = settings.get('REDIS_HOST', '127.0.0.1')
47-
_db = settings.get('REDIS_DB', 0)
43+
_db = settings.get('REDIS_DUP_DB', 0)
4844
key = settings.get('BLOOMFILTER_REDIS_KEY', 'bloomfilter')
4945
block_number = settings.getint(
5046
'BLOOMFILTER_BLOCK_NUMBER', 1)
@@ -85,4 +81,4 @@ def log(self, request, spider):
8581
msg = ("已过滤的重复请求: %(request)s")
8682
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
8783
spider.crawler.stats.inc_value(
88-
'redisbloomfilter/filtered', spider=spider)
84+
'redisbloomfilter/filtered', spider=spider)

scrapy_plus/extensions/__init__.py

100644100755
File mode changed.

0 commit comments

Comments
 (0)