Coder Social home page Coder Social logo

commoncrawler's People

Contributors

chriscates avatar iamonuwa avatar lastpossum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

commoncrawler's Issues

Preparing CommonCrawl .wet files via IPFS

Summary

CommonCrawler is easily accessible via AWS S3. However, I'm interested in creating some sort of IPFS based distribution of Common Crawl. This way we can self-host and create our own P2P network for seeding and distributing data.

Requirements

  • A website with an index that lists all the wet files. I can style it if you need help.

  • An easy to use JSON REST API that you can cURL data from.

Payment

TBD and is not in consideration in the near term. Will be hosting seed network under %eaxops infrastructure.

Having trouble building

docker build -t commoncrawler
returns

Sending build context to Docker daemon  15.68MB
Step 1/9 : FROM golang
 ---> 2421885b04da
Step 2/9 : ENV GO111MODULE=on
 ---> Using cache
 ---> 385def581eff
Step 3/9 : LABEL maintainer="Chris Cates <[email protected]>, Onuwa Nnachi Isaac <[email protected]>"
 ---> Using cache
 ---> 3b73fcb759b7
Step 4/9 : WORKDIR /app
 ---> Using cache
 ---> 199d61a88ad6
Step 5/9 : COPY . .
 ---> e1580e665263
Step 6/9 : RUN go mod init
 ---> Running in c969b25db63f
go: cannot determine module path for source directory /app (outside GOPATH, module path must be specified)

Example usage:
	'go mod init example.com/m' to initialize a v0 or v1 module
	'go mod init example.com/m/v2' to initialize a v2 module

Run 'go help mod init' for more information.
The command '/bin/sh -c go mod init' returned a non-zero code: 1

Test Coverage with CodeCov

Summary

Full test coverage of all components. Must pass on Travis CI and on Unix. Branch coverage included should be 100%. CodeCov would be highly preferred over other testing suites.

Requirements

Several tests are required for each modular component and functionality.

Tests needed:

  • crawl.go

  • config.go

  • scan.go

  • extract.go

  • analyze.go

These tests should be able to run successfully in Travis CI and upload the results to CodeCov. Please add a badge in the README.md of CodeCov.

Payment

2 ETH will be paid upon completion of this bounty. Relies on #5 being complete in order to finish.

Docker Container

Summary

A Docker Container that works effectively in Linux, Unix and Windows systems.

Requirements

  • Must update the .travis.yml file with passing build configurations for Linux, Unix, and Windows environments. Please remove: go test -v -race ./... and simply have the program download one .wet file instead.

  • The shell commands clean.sh and extract.sh are tested in the continuous integration suite as Docker containers.

Payment

  • 1 ETH to someone who can meet the following requirements.

  • Related to #5 which will have a 2.5 bounty for the CLI tool upon completion of the Docker Container.

Electron GUI

Summary

An electron based GUI to query CommonCrawl servers.

Requirements

  • Must be a Typescript based Electron app that can compile on Windows, Linux and Mac OS systems.

  • Must be able to filter data based on keyword searches.

  • Must be able to navigate all historical CommonCrawl data.

Binary is cURLable from web

Summary

Please enable this binary to be downloadable from the internet as a binary. As long as this is cURLable from the Github as a release. That would be the most ideal.

Requirements

  1. Must be able to wget or curl from the internet and be able to run in Linux, Unix and Windows environments.

  2. Must update the Travis CI demonstrating actions for Linux, Unix and Windows.

  3. Must provide a webhook for .git that will automatically build and compile binaries for Linux, Unix and Windows.

Payment

1 ETH will be paid upon completion of this bounty. Please refer to #5 as that must be complete before Payment is provided.

More detailed logging upon failure(s)

Sometimes the network can fail or other things can happen...
However, we don't have detailed logs for when a failure happens for a specific wet file... Would be nice to have detail logs in the output:

  • Better logging for extract.go

  • Better logging for analyze.go

Error during parse binary warc package

I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.

It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.

Windows Docker CI Build

Summary

Docker for Windows is currently not functioning well. Based on my research. It doesn't seem like there is any way to reasonably run a Linux based container through Travis-CI. This is a backlogged item since Travis-CI needs to have better support for Windows Docker Containers.

Full usage as a library

Summary

This repository should be accessible via go get and can be included easily into anyone else's project.

Requirements

Must be able to run:

go get https://github.com/ChrisCates/CommonCrawler

Must be able to access the library as:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

The library must be stable and also have a demo example in the repository. Demoing the library should be part of the Travis CI configuration.

Payment

1 ETH will be paid once meeting the requirements.

Accessibile as a CLI binary.

Summary

Make this program both accessible via Golang and Terminal. Ensure that it works correctly in the terminal.

Requirements

  • Must have a download archive feature so that you can get latest entries from 2019 and beyond.

  • Must download files autonomously from a certain date range.

  • Must be able to extract compressed .wet files.

  • Please review the README.md for the proposed functionality.

Payment

  • Once #10 is complete, the bounty for this will increase to 2.5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.