lingchant / crawl.py Goto Github PK
View Code? Open in Web Editor NEWThis project forked from shaddi/crawl.py
Web crawler based on QtWebKit that clicks on /everything/.
This project forked from shaddi/crawl.py
Web crawler based on QtWebKit that clicks on /everything/.
*Intro* crawl.py: A QtWebKit based web crawler that is designed to click on everything, even things in iframes, for a list of links and to a certain depth, with the goal of investigating distribution of malware via advertising networks. While it can crawl through all links on a set of pages, it includes optimizations to reduce crawling time. Fully headless, but based on a real web browser (WebKit) so it actually renders the page as a user would see it and properly evaluates JavaScript (in theory). get-links.py: A QtWebKit based link scraper for a single page. *Usage* To run: python crawl.py <link depth> <domain with http://> Results come back as: <depth> <site visited> <site's parent> Example: $ python crawl.py 2 http://example.com 0 http://example.com None 1 http://www.icann.org/ http://example.com 2 http://gsa.icann.org/search?access=p&client=icann&proxystylesheet=icann&output=xml_no_dtd&site=icann&q=&proxycustom=%3CADVANCED/%3E http://www.icann.org/ 2 http://twitter.com/icann/ http://www.icann.org/ 2 http://blog.icann.org http://www.icann.org/ 2 http://meetings.icann.org http://www.icann.org/ 2 http://hostedjobs.openhire.com/epostings/submit.cfm?fuseaction=app.allpositions&company_id=16025&version=1 http://www.icann.org/ 2 http://www.root-dnssec.org/ http://www.icann.org/ 2 http://svsf40.icann.org/ http://www.icann.org/ 2 http://www.iana.org http://www.icann.org/ 2 http://www.atlarge.icann.org http://www.icann.org/ 2 http://aso.icann.org/ http://www.icann.org/ 2 http://ccnso.icann.org http://www.icann.org/ 2 http://gac.icann.org/ http://www.icann.org/ 2 http://gnso.icann.org http://www.icann.org/ 2 http://www.internic.net/whois.html http://www.icann.org/ *Crawler class* To create a Crawler object, use this: Crawler(url_list, max_depth, dots=True, skip_same_domain=False, debug=False) where url_list is a list of url strings you want to crawl, and max_depth is the click depths you're interested in. Crawling results are stored as a list of results (Crawler.results), each element of which contains a single URL's crawl tree. There are some options inside the Crawler class for configuring output: - dots=True: turn on some status dots to indicate that progress is occuring. - skip_same_domain=True: skips links to the same domain as the current page. - debug=True: as expected, provides some extra debug verbosity (exactly what depends on the revision you're using!) There are two key functions in the Crawler class: - process(url, ttl=10, log=False, strip_dupes=True, debug=False, round_two=False) This function extracts all the urls that are available for a user to click on a single page. This includes links on the page itself, as well as those contained in any iframes on the page. Because iframes can contain other iframes, redirects, etc, we use the ttl field to prevent gettting lost in a particularly nasty iframe. If log=True, we log the (prettified) HTML we pulled in to a file. strip_dupes=True means we remove duplicate links from the result set. round_two is a marker for handling links inside iframes: think of it as a finishing move. - crawl(url): This starts a crawl at a particular URL. It basically builds a crawl tree using process() and stores it in the results list. *Earl class* The results list of a crawler is a list of Earl objects. Each Earl is a node of a crawl tree. It has four attributes: - value: the URL of this node of the crawl tree - depth: the click depth we were at when we discovered it - parent: the Earl of the page on which on which this URL was discovered - children[]: a list of Earls of URLs which were reached form this page There is one function, show(). Earl.show() prints a crawl tree, as shown in the example output above. *Acks* QtWebKit code taken from here: http://blog.sitescraper.net/2010/06/scraping-javascript-webpages-in-python.html *Author* Shaddi Hasan ([email protected]) March 2011
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.