Coder Social home page Coder Social logo

commoncrawler's Introduction

Common Crawler

๐Ÿ•ธ A simple and easy way to extract data from Common Crawl with little or no hassle.

Go Version License Build Status Go Report Card

Notice in regards to development

Currently I do not have the capacity to hire full time, however, I do have the intention of hiring someone to help build infrastructure related to CommonCrawl. All Gitcoin bounties are currently on hold. When I do have time to further invest in this project, will discuss full time devops developer to work on said project. All payment will be done in DAI and resource allocation will be approximately 5k/mo.

As a GUI

An electron based interface that works with a Go server will be available.

As a library

Install as a dependency:

go get https://github.com/ChrisCates/CommonCrawler

Access the library functions by importing it:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

As a command line tool

Install from source:

go install  https://github.com/ChrisCates/CommonCrawler

Or you can curl from Github:

curl https://github.com/ChrisCates/CommonCrawler/raw/master/dist/commoncrawler -o commoncrawler

Then run as a binary:

# Output help
commoncrawler --help

# Specify configuration
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5 # -1 will loop through all wet files from wet.paths

# Start crawling the web
commoncrawler start --stop -1

Compilation and Configuration

Installing dependencies

go get github.com/logrusorgru/aurora

Downloading data with the application

First configure the type of data you want to extract.

// Config is the preset variables for your extractor
type Config struct {
    baseURI     string
    wetPaths    string
    dataFolder  string
    matchFolder string
    start       int
    stop        int
}

//Defaults
Config{
    start:       0,
    stop:        5,
    baseURI:     "https://commoncrawl.s3.amazonaws.com/",
    wetPaths:    path.Join(cwd, "wet.paths"),
    dataFolder:  path.Join(cwd, "/output/crawl-data"),
    matchFolder: path.Join(cwd, "/output/match-data"),
}

With Docker

docker build -t commoncrawler .
docker run commoncrawler

Without Docker

go build -i -o ./dist/commoncrawler ./src/*.go
./dist/commoncrawler

Or you can run simply just run it.

go run src/*.go

Resources

  • MIT Licensed

  • If people are interested or need it. I can create a documentation and tutorial page on https://commoncrawl.chriscates.ca

  • You can post issues if they are valid, and, I could potentially fund them based on priority.

commoncrawler's People

Contributors

chriscates avatar iamonuwa avatar lastpossum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

commoncrawler's Issues

Test Coverage with CodeCov

Summary

Full test coverage of all components. Must pass on Travis CI and on Unix. Branch coverage included should be 100%. CodeCov would be highly preferred over other testing suites.

Requirements

Several tests are required for each modular component and functionality.

Tests needed:

  • crawl.go

  • config.go

  • scan.go

  • extract.go

  • analyze.go

These tests should be able to run successfully in Travis CI and upload the results to CodeCov. Please add a badge in the README.md of CodeCov.

Payment

2 ETH will be paid upon completion of this bounty. Relies on #5 being complete in order to finish.

Electron GUI

Summary

An electron based GUI to query CommonCrawl servers.

Requirements

  • Must be a Typescript based Electron app that can compile on Windows, Linux and Mac OS systems.

  • Must be able to filter data based on keyword searches.

  • Must be able to navigate all historical CommonCrawl data.

Windows Docker CI Build

Summary

Docker for Windows is currently not functioning well. Based on my research. It doesn't seem like there is any way to reasonably run a Linux based container through Travis-CI. This is a backlogged item since Travis-CI needs to have better support for Windows Docker Containers.

Binary is cURLable from web

Summary

Please enable this binary to be downloadable from the internet as a binary. As long as this is cURLable from the Github as a release. That would be the most ideal.

Requirements

  1. Must be able to wget or curl from the internet and be able to run in Linux, Unix and Windows environments.

  2. Must update the Travis CI demonstrating actions for Linux, Unix and Windows.

  3. Must provide a webhook for .git that will automatically build and compile binaries for Linux, Unix and Windows.

Payment

1 ETH will be paid upon completion of this bounty. Please refer to #5 as that must be complete before Payment is provided.

Accessibile as a CLI binary.

Summary

Make this program both accessible via Golang and Terminal. Ensure that it works correctly in the terminal.

Requirements

  • Must have a download archive feature so that you can get latest entries from 2019 and beyond.

  • Must download files autonomously from a certain date range.

  • Must be able to extract compressed .wet files.

  • Please review the README.md for the proposed functionality.

Payment

  • Once #10 is complete, the bounty for this will increase to 2.5

Docker Container

Summary

A Docker Container that works effectively in Linux, Unix and Windows systems.

Requirements

  • Must update the .travis.yml file with passing build configurations for Linux, Unix, and Windows environments. Please remove: go test -v -race ./... and simply have the program download one .wet file instead.

  • The shell commands clean.sh and extract.sh are tested in the continuous integration suite as Docker containers.

Payment

  • 1 ETH to someone who can meet the following requirements.

  • Related to #5 which will have a 2.5 bounty for the CLI tool upon completion of the Docker Container.

More detailed logging upon failure(s)

Sometimes the network can fail or other things can happen...
However, we don't have detailed logs for when a failure happens for a specific wet file... Would be nice to have detail logs in the output:

  • Better logging for extract.go

  • Better logging for analyze.go

Preparing CommonCrawl .wet files via IPFS

Summary

CommonCrawler is easily accessible via AWS S3. However, I'm interested in creating some sort of IPFS based distribution of Common Crawl. This way we can self-host and create our own P2P network for seeding and distributing data.

Requirements

  • A website with an index that lists all the wet files. I can style it if you need help.

  • An easy to use JSON REST API that you can cURL data from.

Payment

TBD and is not in consideration in the near term. Will be hosting seed network under %eaxops infrastructure.

Having trouble building

docker build -t commoncrawler
returns

Sending build context to Docker daemon  15.68MB
Step 1/9 : FROM golang
 ---> 2421885b04da
Step 2/9 : ENV GO111MODULE=on
 ---> Using cache
 ---> 385def581eff
Step 3/9 : LABEL maintainer="Chris Cates <[email protected]>, Onuwa Nnachi Isaac <[email protected]>"
 ---> Using cache
 ---> 3b73fcb759b7
Step 4/9 : WORKDIR /app
 ---> Using cache
 ---> 199d61a88ad6
Step 5/9 : COPY . .
 ---> e1580e665263
Step 6/9 : RUN go mod init
 ---> Running in c969b25db63f
go: cannot determine module path for source directory /app (outside GOPATH, module path must be specified)

Example usage:
	'go mod init example.com/m' to initialize a v0 or v1 module
	'go mod init example.com/m/v2' to initialize a v2 module

Run 'go help mod init' for more information.
The command '/bin/sh -c go mod init' returned a non-zero code: 1

Error during parse binary warc package

I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.

It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.

Full usage as a library

Summary

This repository should be accessible via go get and can be included easily into anyone else's project.

Requirements

Must be able to run:

go get https://github.com/ChrisCates/CommonCrawler

Must be able to access the library as:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

The library must be stable and also have a demo example in the repository. Demoing the library should be part of the Travis CI configuration.

Payment

1 ETH will be paid once meeting the requirements.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.