Comments (5)
@FaizShah Hi:
There is a simple way to solve your problem, Ruia can customize the request parameters of the request by using the middleware, for example:
#!/usr/bin/env python
import random
from ruia import Spider, Middleware
middleware = Middleware()
@middleware.request
async def print_on_request(spider_ins, request):
request.request_config = {"RETRIES": 3, "DELAY": (abs(random.gauss(1, 0.5)) * 2)}
print(f"request: {request.request_config}")
class MiddlewareSpiderDemo(Spider):
start_urls = ["https://httpbin.org/get"]
concurrency = 10
async def parse(self, response):
pages = [f"https://httpbin.org/get?p={i}" for i in range(1, 3)]
for page in pages:
yield self.request(url=page)
if __name__ == "__main__":
MiddlewareSpiderDemo.start(middleware=middleware)
We can see that the terminal has the following output:
request: {'RETRIES': 3, 'DELAY': 2.4240349739809686}
[2019:11:12 20:52:07] INFO Spider Spider started!
[2019:11:12 20:52:07] INFO Spider Worker started: 4535383992
[2019:11:12 20:52:07] INFO Spider Worker started: 4535384128
[2019:11:12 20:52:10] INFO Request <GET: https://httpbin.org/get>
https://httpbin.org/get?p=1
https://httpbin.org/get?p=2
request: {'RETRIES': 3, 'DELAY': 1.6505360498833879}
request: {'RETRIES': 3, 'DELAY': 1.8258791988736391}
[2019:11:12 20:52:12] INFO Request <GET: https://httpbin.org/get?p=1>
[2019:11:12 20:52:13] INFO Request <GET: https://httpbin.org/get?p=2>
[2019:11:12 20:52:14] INFO Spider Stopping spider: Ruia
[2019:11:12 20:52:14] INFO Spider Total requests: 3
[2019:11:12 20:52:14] INFO Spider Time usage: 0:00:06.218620
[2019:11:12 20:52:14] INFO Spider Spider finished!
from ruia.
More info about middleware
from ruia.
Hi @FaizShah, sorry for my slowly reply :
Request object provides a parameter named request_config
, for example:
from ruia import Spider
async def retry_func(request):
request.request_config["TIMEOUT"] = 10
class RetryDemo(Spider):
start_urls = ["http://httpbin.org/get"]
request_config = {
"RETRIES": 3,
# add a delay for scraping
"DELAY": 0,
"TIMEOUT": 0.1,
"RETRY_FUNC": retry_func,
}
async def parse(self, response):
pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
async for resp in self.multiple_request(pages):
yield self.parse_item(response=resp)
async def parse_item(self, response):
json_data = await response.json()
print(json_data)
if __name__ == "__main__":
RetryDemo.start()
from ruia.
Hi @howie6879,
Not a problem it was actually a really quick reply!
This solution is working perfectly for me.
If I just modify
Lines 100 to 101 in c7eb0f5
from ruia.
Perfect, thank you @howie6879
from ruia.
Related Issues (20)
- Would be nice to be able to pass in "start_urls" HOT 7
- Trouble scraping deck.tk/deckstats.net HOT 7
- python3.9 remove asyncio.Task.all_tasks() HOT 3
- 【suggestion】重试逻辑可以添加或更换代理ip HOT 8
- 运行示例代码报错 HOT 10
- 代理使用问题 HOT 1
- 是否可以用模式匹配工具-pampy来实现对json解析的支持 HOT 1
- 并发5,循环爬取1000个网页,CPU耗尽为0,但是内存没有耗完,大佬帮看看代码有什么问题 HOT 3
- POST发送请求,收不到请求中的body HOT 2
- httpx替换aiohttp支持http2 HOT 1
- 我应当如何向 Spider 传递 start_urls? HOT 1
- 示例代码运行报错 HOT 3
- worker_numbers 数值多少合适 HOT 1
- ruia 使用lxml编码xml文档时报错 HOT 1
- 希望添加更多功能,更多示例,更多文档,希望长期维护~
- 通过中间件添加 socks5 代理后如何关闭 session?
- docs.python-ruia.org is not available HOT 4
- 如果能支持分布式就好了
- Logs HOT 1
- 请问如何判断发生了跳转呢? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ruia.