Coder Social home page Coder Social logo

repo-scraper's Introduction

#repo_scraper

Check your projects for possible password (or other sensitive data) leaks.

The library exposes two commands:

  • check-dir - Performs checks on a folder and subdirectories
  • check-repo - Performs a check in a git repository

Both scripts work almost the same from the user point of view, enter check-dir --help or check-repo --help for more details.

##Example

Check your dummy-project:

check-dir dummy-project

Output:

Checking folder dummy-project...

dummy-project/python_file_with_password.py
ALERT - MATCH ["password = 'qwerty'"]

dummy-project/dangerous_file.json
ALERT - MATCH ['"password": "super-secret-password"']

##How does it work?

Briefly speaking, check-dir lists all files below a folder and applies regular expressions to look for passwords/IPs. Given that a blind search would never end (for example, if the repo constans a 50MB csv file), some filters are applied before the regular expressions are matched:

  • File size - If file is bigger than 1MB, ignore it but print a warning
  • Extension - If extension is not allowed, ignore file but print a warning. (See NOTES to know why extension is used instead of mimetype)
  • Base64 - If file contains Base64 data, remove it. Many plain-text formats (such as Jupyter notebooks embed data in Base64 format. Applying regex to such files is never going to end)

check-repo works in a slightly different way, one obvious way to check git history is to checkout each commit and apply check-dir. That approach would be really slow since the script would be checking the same files many times. Instead, check-repo checks out the first commit, runs check-dir there and then, moves up one commit at a time and uses git diff to get only the difference between each consecutive pair of commits.

As in check-dir, the script applies some filters before applying regular expressions to prevent getting stuck on big files, note that in this case we are not dealing with files, but with the git diff output, and that prevents us to check for file size directly:

  • Number of lines -
  • Number of characters -
  • Extension - If extension is not allowed, ignore file but print a warning. (See NOTES to know why extension is used instead of mimetype)
  • Base64 - Remove Base64 code.

The project has some limitations see NOTES file for information regarding the design of the project and how that limits what the library is able to detect.

##Installation

    pip install git+git://github.com/dssg/repo-scraper.git -r requirements.txt

##Dependencies

  • glob2
  • nose (optional, for running tests)

##Tested with

  • Python 2.7.10
  • Git 2.6.0

##Usage

    cd path/to/your/project
    check-dir

See help for more options available:

    check-dir --help

###Using a IGNORE file with check-dir

Just as with git, you can specify a file to make the program ignore some files/folders. This is specially useful when you have folder with many log files that you are sure do not have sensitive data. The library assumes one glob rule per line.

Adding a IGNORE file will make execution faster, since many regular expressions are matched against all files that have certain characteristics.

Important: Even though the format is very similar, you cannot use the same rules as in your .gitignore file. For more details, see this.

##What's done

  • Passwords (using regex). See test_password_check.py
  • IPs
  • URLs on amazonaws.com (it's simple to add more domains if needed)

##What's missing

  • URLs
  • Check other branches apart from master

#TODO

  • Come up with a cool name

repo-scraper's People

Contributors

edublancas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

repo-scraper's Issues

Extension Error

Getting this after installing requirements
root@kali:~/repo-scraper# check-dir
Traceback (most recent call last):
File "/usr/local/bin/check-dir", line 4, in
import('pkg_resources').run_script('repo-scraper==0.1', 'check-dir')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/repo_scraper-0.1-py2.7.egg/EGG-INFO/scripts/check-dir", line 2, in
from repo_scraper.constants.extensions import *
ImportError: No module named constants.extensions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.