A Web Crawler and Scraper, built in Python 3.9
Works with HTTP(S) and FTP(S) links.
Allows you to optionally scrape for emails and phone numbers.
This is a pre-release. Documentation and command-line flags to come with full release.
Web Crawler and Scraper
License: GNU General Public License v3.0
When redundancy is enabled, the 2nd, 3rd, etc. time the URL is logged with its previously crawled results it shows a different set of URLs (usually smaller/incomplete) than the original logged one.
i.e. if example.org were crawled, it would log the appropriate URLs discovered. Let's say here there were 10 of them.
But if example.org is discovered again, and redundant logging is enabled, it will log it but only show it having, say, 3 URLs this time. Despite the fact that these URLs are (should) be stored in a dict that it pulls the redundant info from.
The first appearance of each URL seems to be reliable. It only appears to be happening specifically when a URL is redundantly logged.
Ran this scraper exactly as it was created, only modified path from logs to a .txt in a Windows folder. Captures about half the email addresses on a given webpage, but never captures phone numbers. Running code against an entire website, not just a single webpage. Error seems to occur even when I run it against a single webpage with multiple phone numbers listed.
Windows 10, python 3.8, pyCharm.
Please note - I'm a newbie to python, so it's possible the error is on my end.
Edit: Ran scraper against this link because it has lots of phone/email: https://www.hamradio.com/contact.cfm
Result:
`Crawling https://www.hamradio.com/contact.cfm
Emails:
Phone Numbers:
Process finished with exit code 0`
FTP support currently has what appears to be an off by one bug, where it will store the value of i in the for loop as though it is an element to make a URL out of. There is also the occasional index out of bounds, which is most likely related to this, but could potentially be a separate bug entirely. I will be fixing this the next time I work on this project.
Original URL: http://xts.site.nfoservers.com
Preblem URL: http://xts.site.nfoservers.com/#content
Config: Depth 3, redundant logging enabled
Issue: The crawler attempts to crawl http://xts.site.nfoservers.com/#content/#content, despite #content only appearing once on the page.
Displaying URL's that are already crawled should show all the previously discovered URL's for the link, but instead only shows around 1-3 usually (even if there were, say 20). In addition, it says "Crawling..." if it's a root URL that's already been crawled.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.