Light

z7r1k3 / creeper Goto Github PK

12.0 3.0 1.0 78 KB

Web Crawler and Scraper

License: GNU General Public License v3.0

Python 100.00%

web-crawler web-scraper cli windows linux macos osint information-gathering redteam python python39 scraper crawler web-crawler-python

creeper's Introduction

Creeper

A Web Crawler and Scraper, built in Python 3.9

Works with HTTP(S) and FTP(S) links.

Allows you to optionally scrape for emails and phone numbers.

This is a pre-release. Documentation and command-line flags to come with full release.

creeper's People

Contributors

Stargazers

Watchers

Forkers

firmangel8

creeper's Issues

Redundantly Logged URLs Show Conflicting Information

When redundancy is enabled, the 2nd, 3rd, etc. time the URL is logged with its previously crawled results it shows a different set of URLs (usually smaller/incomplete) than the original logged one.

i.e. if example.org were crawled, it would log the appropriate URLs discovered. Let's say here there were 10 of them.

But if example.org is discovered again, and redundant logging is enabled, it will log it but only show it having, say, 3 URLs this time. Despite the fact that these URLs are (should) be stored in a dict that it pulls the redundant info from.

The first appearance of each URL seems to be reliable. It only appears to be happening specifically when a URL is redundantly logged.

Unable to Crawl JS Dependent Sites

Ran this scraper exactly as it was created, only modified path from logs to a .txt in a Windows folder. Captures about half the email addresses on a given webpage, but never captures phone numbers. Running code against an entire website, not just a single webpage. Error seems to occur even when I run it against a single webpage with multiple phone numbers listed.
Windows 10, python 3.8, pyCharm.
Please note - I'm a newbie to python, so it's possible the error is on my end.

Edit: Ran scraper against this link because it has lots of phone/email: https://www.hamradio.com/contact.cfm

Result:

`Crawling https://www.hamradio.com/contact.cfm

Emails:

Phone Numbers:

Process finished with exit code 0`

FTP Crawling has Off by One Bug

FTP support currently has what appears to be an off by one bug, where it will store the value of i in the for loop as though it is an element to make a URL out of. There is also the occasional index out of bounds, which is most likely related to this, but could potentially be a separate bug entirely. I will be fixing this the next time I work on this project.

URL IDs added multiple times in URL

Original URL: http://xts.site.nfoservers.com

Preblem URL: http://xts.site.nfoservers.com/#content

Config: Depth 3, redundant logging enabled

Issue: The crawler attempts to crawl http://xts.site.nfoservers.com/#content/#content, despite #content only appearing once on the page.

Displaying hasCrawled URL's is not Showing all Data

Displaying URL's that are already crawled should show all the previously discovered URL's for the link, but instead only shows around 1-3 usually (even if there were, say 20). In addition, it says "Crawling..." if it's a root URL that's already been crawled.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.