chriscates / commoncrawler Goto Github PK

View Code? Open in Web Editor NEW

33.0 12.0 12.0 2.47 MB

🕸 A simple way to extract data from Common Crawl

License: MIT License

Go 93.36% Dockerfile 6.64%

golang commoncrawl

commoncrawler's Introduction

Common Crawler

🕸 A simple and easy way to extract data from Common Crawl with little or no hassle.

Notice in regards to development

Currently I do not have the capacity to hire full time, however, I do have the intention of hiring someone to help build infrastructure related to CommonCrawl. All Gitcoin bounties are currently on hold. When I do have time to further invest in this project, will discuss full time devops developer to work on said project. All payment will be done in DAI and resource allocation will be approximately 5k/mo.

As a GUI

An electron based interface that works with a Go server will be available.

As a library

Install as a dependency:

go get https://github.com/ChrisCates/CommonCrawler

Access the library functions by importing it:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

As a command line tool

Install from source:

go install  https://github.com/ChrisCates/CommonCrawler

Or you can curl from Github:

curl https://github.com/ChrisCates/CommonCrawler/raw/master/dist/commoncrawler -o commoncrawler

Then run as a binary:

# Output help
commoncrawler --help

# Specify configuration
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5 # -1 will loop through all wet files from wet.paths

# Start crawling the web
commoncrawler start --stop -1

Compilation and Configuration

Installing dependencies

go get github.com/logrusorgru/aurora

Downloading data with the application

First configure the type of data you want to extract.

// Config is the preset variables for your extractor
type Config struct {
    baseURI     string
    wetPaths    string
    dataFolder  string
    matchFolder string
    start       int
    stop        int
}

//Defaults
Config{
    start:       0,
    stop:        5,
    baseURI:     "https://commoncrawl.s3.amazonaws.com/",
    wetPaths:    path.Join(cwd, "wet.paths"),
    dataFolder:  path.Join(cwd, "/output/crawl-data"),
    matchFolder: path.Join(cwd, "/output/match-data"),
}

With Docker

docker build -t commoncrawler .
docker run commoncrawler

Without Docker

go build -i -o ./dist/commoncrawler ./src/*.go
./dist/commoncrawler

Or you can run simply just run it.

go run src/*.go

Resources

MIT Licensed
If people are interested or need it. I can create a documentation and tutorial page on https://commoncrawl.chriscates.ca
You can post issues if they are valid, and, I could potentially fund them based on priority.

commoncrawler's People

Contributors

Stargazers

Watchers

Forkers

amahajavon cambricorp sungodmedia josprachi iamonuwa vanleantking famatte69 pjox seandunford ranyuhan c00renut admariner

commoncrawler's Issues

Test Coverage with CodeCov

Summary

Full test coverage of all components. Must pass on Travis CI and on Unix. Branch coverage included should be 100%. CodeCov would be highly preferred over other testing suites.

Requirements

Several tests are required for each modular component and functionality.

Tests needed:

These tests should be able to run successfully in Travis CI and upload the results to CodeCov. Please add a badge in the README.md of CodeCov.

Payment

2 ETH will be paid upon completion of this bounty. Relies on #5 being complete in order to finish.

Electron GUI

Summary

An electron based GUI to query CommonCrawl servers.

Requirements

Must be a Typescript based Electron app that can compile on Windows, Linux and Mac OS systems.
Must be able to filter data based on keyword searches.
Must be able to navigate all historical CommonCrawl data.

Windows Docker CI Build

Summary

Docker for Windows is currently not functioning well. Based on my research. It doesn't seem like there is any way to reasonably run a Linux based container through Travis-CI. This is a backlogged item since Travis-CI needs to have better support for Windows Docker Containers.

what functionality is ready to try / test out

The readme says coming soon..., are any of the go programs ready for folks to try out today?

Binary is cURLable from web

Summary

Please enable this binary to be downloadable from the internet as a binary. As long as this is cURLable from the Github as a release. That would be the most ideal.

Requirements

Must be able to wget or curl from the internet and be able to run in Linux, Unix and Windows environments.
Must update the Travis CI demonstrating actions for Linux, Unix and Windows.
Must provide a webhook for .git that will automatically build and compile binaries for Linux, Unix and Windows.

Payment

1 ETH will be paid upon completion of this bounty. Please refer to #5 as that must be complete before Payment is provided.

Accessibile as a CLI binary.

Summary

Make this program both accessible via Golang and Terminal. Ensure that it works correctly in the terminal.

Requirements

Must have a download archive feature so that you can get latest entries from 2019 and beyond.
Must download files autonomously from a certain date range.
Must be able to extract compressed .wet files.
Please review the README.md for the proposed functionality.

Payment

Once #10 is complete, the bounty for this will increase to 2.5

Docker Container

Summary

A Docker Container that works effectively in Linux, Unix and Windows systems.

Requirements

Must update the .travis.yml file with passing build configurations for Linux, Unix, and Windows environments. Please remove: go test -v -race ./... and simply have the program download one .wet file instead.
The shell commands clean.sh and extract.sh are tested in the continuous integration suite as Docker containers.

Payment

1 ETH to someone who can meet the following requirements.
Related to #5 which will have a 2.5 bounty for the CLI tool upon completion of the Docker Container.

More detailed logging upon failure(s)

Sometimes the network can fail or other things can happen...
However, we don't have detailed logs for when a failure happens for a specific wet file... Would be nice to have detail logs in the output:

Better logging for extract.go
Better logging for analyze.go

Preparing CommonCrawl .wet files via IPFS

Summary

CommonCrawler is easily accessible via AWS S3. However, I'm interested in creating some sort of IPFS based distribution of Common Crawl. This way we can self-host and create our own P2P network for seeding and distributing data.

Requirements

A website with an index that lists all the wet files. I can style it if you need help.
An easy to use JSON REST API that you can cURL data from.

Payment

TBD and is not in consideration in the near term. Will be hosting seed network under %eaxops infrastructure.

Having trouble building

docker build -t commoncrawler
returns

Sending build context to Docker daemon  15.68MB
Step 1/9 : FROM golang
 ---> 2421885b04da
Step 2/9 : ENV GO111MODULE=on
 ---> Using cache
 ---> 385def581eff
Step 3/9 : LABEL maintainer="Chris Cates <[email protected]>, Onuwa Nnachi Isaac <[email protected]>"
 ---> Using cache
 ---> 3b73fcb759b7
Step 4/9 : WORKDIR /app
 ---> Using cache
 ---> 199d61a88ad6
Step 5/9 : COPY . .
 ---> e1580e665263
Step 6/9 : RUN go mod init
 ---> Running in c969b25db63f
go: cannot determine module path for source directory /app (outside GOPATH, module path must be specified)

Example usage:
	'go mod init example.com/m' to initialize a v0 or v1 module
	'go mod init example.com/m/v2' to initialize a v2 module

Run 'go help mod init' for more information.
The command '/bin/sh -c go mod init' returned a non-zero code: 1

Error during parse binary warc package

I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.

It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.

Full usage as a library

Summary

This repository should be accessible via go get and can be included easily into anyone else's project.

Requirements

Must be able to run:

go get https://github.com/ChrisCates/CommonCrawler

Must be able to access the library as:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

The library must be stable and also have a demo example in the repository. Demoing the library should be part of the Travis CI configuration.

Payment

1 ETH will be paid once meeting the requirements.

chriscates / commoncrawler Goto Github PK

commoncrawler's Introduction

Common Crawler

🕸 A simple and easy way to extract data from Common Crawl with little or no hassle.

Notice in regards to development

As a GUI

As a library

As a command line tool

Compilation and Configuration

Installing dependencies

Downloading data with the application

With Docker

Without Docker

Resources

commoncrawler's People

Contributors

Stargazers

Watchers

Forkers

commoncrawler's Issues

Summary

Requirements

Payment

Summary

Requirements

Summary

Summary

Requirements

Payment

Summary

Requirements

Payment

Summary

Requirements

Payment

Summary

Requirements

Payment

Summary

Requirements

Payment

Recommend Projects

Recommend Topics

Recommend Org