Comments (7)
@hrbrmstr I've added some basic rate limiting in fffe252 - and a warning about reducing it in the readme.
'basic' might be a bit of an understatement, but it's an improvement. I'll give some more thought to it over the weekend.
I'm going to leave this issue open until I'm comfortable about the solution - maybe even until it fetches, parses and respects the robots.txt.
Thanks again!
from meg.
Thanks for the heads up and super detailed reply!
I'm working on rate limiting at the moment. The tool has been for my own use up until now and I've been putting it off - largely because I've been trying to come up with some kind of requeuing mechanism to keep the pipeline of requests as full as possible and avoid stalling goroutines. Given your input I think I'll implement something simpler but less efficient as an interim measure.
Do you think default 5s delay with an override would be OK for now? I think I'm a bit away from fetching and parsing 🤖 yet.
WRT the user agent: yeah; that's a leftover from the only-for-my-use days. I've dropped that in 4418a31. I'll add a 'meg' mozilla-like user agent soon and give people the option of overriding if they want to.
WRT HAR format: I'll look in to that. The original idea for this tool was to make it easy for me to use standard tools like grep to filter and find things in the output. Is HAR fairly amenable to that?
from meg.
w/r/t HAR: aye. It's lovely JSON, so you can even use jq
to do cmd-line filtering, querying and selecting. Go has some nice HAR tools as does R and Python, too. Def not a "must have" and WARC is far more "grep
" friendly.
from meg.
ref: https://twitter.com/hrbrmstr/status/947064265621549058
This is a 👍 issue to bring up.
Given that you're directing this at bug bounty hunters, the per-org bug bounty rules of engagement need to state that it's OK to ignore Disallow
rules in robots.txt
(using 🤖 from now on for that file ref). Why? LinkedIn — but many others in 2017 — have successfully punished (civil court) ppl and orgs for ignoring it and there are a cpl suits on the bench that could become case law starting in 2018 which formalizes penalties. Verbiage when running it or inherent checks & force of a --
switch to override some default check & halting behaviour would help keep the community and you as a tool author from being at-risk.
🤖 has optional Crawl-Delay
settings in it. If not present the common standard is 5s and the kind standard is 10s between requests to same origin domain resources. The same came suits that are likely becoming case law in 2018 also used ignoring this as a reason for civil damages recompense. Unless the bug bounty program says it's OK to ignore this, a similar default behavior with forced manual override should be considered.
Slightly on-topic is
Line 37 in 29116f8
--
switch to override this and use one provided by the user OR to short-name select from an internal select list of canned ones wld be safer for users and you. https://github.com/hrbrmstr/splashr/blob/a7c5406264b91918e60e5abf692b51baf5ab2fb7/R/dsl.r#L422-L468 has a good set of ones to use for that purpose AND the added benefit is that folks can switch OS and from desktop to mobile sine many sites change behaviour with different UAs present.
Completely off-topic: you're capturing tons of juicy data. Consider using HAR format for the output (https://github.com/CyrusBiotechnology/go-har may help with that). It's slightly better than WARC (IMO) for storing some technical info along with the headers and payload but WARC support might also be nice (https://github.com/slyrz/warc may help) since there are at-scale tools to enable working with that.
from meg.
If I can carve out some cycles in wk2 of Jan (I'm on a pretty big project until then) I'll gladly lend a hand. This is a super nice and focused alternative to generating URL strings for wget
(which can do much of this but not with the 'cyber' eye and purpose you have).
from meg.
(replying as I read, apologies :-)
def 5s is — IMO — 100% 👍 https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/
from meg.
@hrbrmstr superb; thank you so much :)
I'll definitely look more into HAR in that case (although I may use gron for making it easier to grep - if you'll forgive the shameless self promotion :-P).
from meg.
Related Issues (20)
- solved :)
- Deterministic output file names HOT 1
- (feature request) - Multiple status codes in savestatus HOT 1
- failed to open index file for writing: open /index: permission denied HOT 1
- Feature Feedback: Would you be interested in being able to use files of full urls? HOT 1
- failed to open paths file: file ./paths not found HOT 1
- Meg False positives HOT 1
- Feature Request/Question: Accept URL's from Stdin HOT 1
- Getting request failed: unsupported protocol scheme error. HOT 7
- Issue with dependency rawhttp HOT 2
- Meg + proxychains
- Meg are mixing url parameters (temp fix) HOT 1
- Error execution HOT 4
- (Question) More info about response in index HOT 3
- Does Meg is supporting for MAC M1?
- request failed (Client.Timeout exceeded while awaiting headers)
- unsupported protocol scheme HOT 1
- Why does the method not work for post?
- Error when downloading HOT 1
- If you are bored one day, make the output like FFF with 2 files: headers and body. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from meg.