Comments (4)
Original comment by [email protected]
on 1 May 2012 at 10:33
- Changed state: Started
from warrick.
Breaking this into two parts:
The first is the subdirectory issue: only recovering from a particular
subdirectory and not creeping into parent directories.
The second is the ability to exclude patterns of resources, such as the
?page=... example in the feature request. I've started the first.
Original comment by [email protected]
on 1 May 2012 at 10:36
from warrick.
Just to be clear:
1. *Crawling* parent directories may still be necessary to ensure complete
coverage of the subdirectory contents for reconstruction.
2. IMHO, excluding a pattern is the general capability of which excluding
particular directories is a specific use-case
So I still think that the best form for these capabilities is something like
the following:
--donotcrawl PATTERN
--donotreconstruct PATTERN
...where PATTERN is a regex that is matched against the directory path starting
from the site root (that is, from the initial /, not including the domain).
Of the two, --donotreconstruct is more immediately useful, where --donotcrawl
is more of a performance optimization.
It might also be nice to be able to specify a file with several patterns, one
per line, rather than using a --donot switch several times in the commandline.
Original comment by [email protected]
on 1 May 2012 at 1:09
from warrick.
Completed the -sd flag (indicating that Warrick should only recover content in
the specified subdirectory)
For example:
if you provide http://myfolia.com/plants/, warrick will only recover things
that come from the myfolia.com/plants subdirectory (no going up to the parents).
I also provided the -ex|--exclude <FILE> feature. This will allow you to
provide a file of regular expressions that you want to exclude from the
recovery. For example, my test file looks like this:
myfolia\.com\/plants\/3581.*
staticweb\.archive\.org\/.*
myfolia\.com\/plants\/search\?page=.*
Meaning:
1) I don't want anything that starts with myfolia.com/plants/3581
2) I don't want any stylings or JS from the archive
3) I don't want any search pages from the plants subdirectory
so I can call warrick as:
perl warrick.pl -sd -ex /home/jbrunelle/regex.in http://myfolia.com/plants
And I won't get any of the URLs matching the REGEXs.
Original comment by [email protected]
on 4 May 2012 at 2:51
- Changed state: Fixed
from warrick.
Related Issues (20)
- New code mod. HOT 1
- GetOptionsFromString is not exported.... HOT 1
- problems with start HOT 2
- [Help]Resume Warrick when I turn off computer HOT 1
- old_make
- CPAN Install HOT 1
- Port 80 HOT 1
- zero length content "No Content in ..." HOT 2
- Distribution archive looks sloppy
- warrick is not working HOT 1
- Encoding HOT 1
- URI Rewriting
- Brass rework
- Testing feature is outdated
- Installation Script Rework HOT 1
- http://www.animalbehavior.org/Resources/CSASAB/#Uncert
- Regex for URLs to download
- ./TEST fails complaining that -nr is an invalid option
- Version string incorrect in latest download HOT 1
- Made a replacement tool HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warrick.