Coder Social home page Coder Social logo

z7r1k3 / creeper Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 1.0 78 KB

Web Crawler and Scraper

License: GNU General Public License v3.0

Python 100.00%
web-crawler web-scraper cli windows linux macos osint information-gathering redteam python python39 scraper crawler web-crawler-python

creeper's Introduction

Creeper

A Web Crawler and Scraper, built in Python 3.9

Works with HTTP(S) and FTP(S) links.

Allows you to optionally scrape for emails and phone numbers.

This is a pre-release. Documentation and command-line flags to come with full release.

creeper's People

Contributors

blueeagle avatar z7r1k3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

firmangel8

creeper's Issues

Redundantly Logged URLs Show Conflicting Information

When redundancy is enabled, the 2nd, 3rd, etc. time the URL is logged with its previously crawled results it shows a different set of URLs (usually smaller/incomplete) than the original logged one.

i.e. if example.org were crawled, it would log the appropriate URLs discovered. Let's say here there were 10 of them.

But if example.org is discovered again, and redundant logging is enabled, it will log it but only show it having, say, 3 URLs this time. Despite the fact that these URLs are (should) be stored in a dict that it pulls the redundant info from.

The first appearance of each URL seems to be reliable. It only appears to be happening specifically when a URL is redundantly logged.

Unable to Crawl JS Dependent Sites

Ran this scraper exactly as it was created, only modified path from logs to a .txt in a Windows folder. Captures about half the email addresses on a given webpage, but never captures phone numbers. Running code against an entire website, not just a single webpage. Error seems to occur even when I run it against a single webpage with multiple phone numbers listed.
Windows 10, python 3.8, pyCharm.
Please note - I'm a newbie to python, so it's possible the error is on my end.

Edit: Ran scraper against this link because it has lots of phone/email: https://www.hamradio.com/contact.cfm

Result:

`Crawling https://www.hamradio.com/contact.cfm

Emails:

Phone Numbers:

Process finished with exit code 0`

FTP Crawling has Off by One Bug

FTP support currently has what appears to be an off by one bug, where it will store the value of i in the for loop as though it is an element to make a URL out of. There is also the occasional index out of bounds, which is most likely related to this, but could potentially be a separate bug entirely. I will be fixing this the next time I work on this project.

Displaying hasCrawled URL's is not Showing all Data

Displaying URL's that are already crawled should show all the previously discovered URL's for the link, but instead only shows around 1-3 usually (even if there were, say 20). In addition, it says "Crawling..." if it's a root URL that's already been crawled.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.