chriscates / commoncrawler Goto Github PK
View Code? Open in Web Editor NEW๐ธ A simple way to extract data from Common Crawl
License: MIT License
๐ธ A simple way to extract data from Common Crawl
License: MIT License
CommonCrawler is easily accessible via AWS S3. However, I'm interested in creating some sort of IPFS based distribution of Common Crawl. This way we can self-host and create our own P2P network for seeding and distributing data.
A website with an index that lists all the wet files. I can style it if you need help.
An easy to use JSON REST API that you can cURL data from.
TBD and is not in consideration in the near term. Will be hosting seed network under %eaxops infrastructure.
docker build -t commoncrawler
returns
Sending build context to Docker daemon 15.68MB
Step 1/9 : FROM golang
---> 2421885b04da
Step 2/9 : ENV GO111MODULE=on
---> Using cache
---> 385def581eff
Step 3/9 : LABEL maintainer="Chris Cates <[email protected]>, Onuwa Nnachi Isaac <[email protected]>"
---> Using cache
---> 3b73fcb759b7
Step 4/9 : WORKDIR /app
---> Using cache
---> 199d61a88ad6
Step 5/9 : COPY . .
---> e1580e665263
Step 6/9 : RUN go mod init
---> Running in c969b25db63f
go: cannot determine module path for source directory /app (outside GOPATH, module path must be specified)
Example usage:
'go mod init example.com/m' to initialize a v0 or v1 module
'go mod init example.com/m/v2' to initialize a v2 module
Run 'go help mod init' for more information.
The command '/bin/sh -c go mod init' returned a non-zero code: 1
Full test coverage of all components. Must pass on Travis CI and on Unix. Branch coverage included should be 100%. CodeCov would be highly preferred over other testing suites.
Several tests are required for each modular component and functionality.
Tests needed:
crawl.go
config.go
scan.go
extract.go
analyze.go
These tests should be able to run successfully in Travis CI and upload the results to CodeCov. Please add a badge in the README.md
of CodeCov.
2 ETH will be paid upon completion of this bounty. Relies on #5 being complete in order to finish.
A Docker Container that works effectively in Linux, Unix and Windows systems.
Must update the .travis.yml file with passing build configurations for Linux, Unix, and Windows environments. Please remove: go test -v -race ./...
and simply have the program download one .wet
file instead.
The shell commands clean.sh
and extract.sh
are tested in the continuous integration suite as Docker containers.
1 ETH to someone who can meet the following requirements.
Related to #5 which will have a 2.5 bounty for the CLI tool upon completion of the Docker Container.
An electron based GUI to query CommonCrawl servers.
Must be a Typescript based Electron app that can compile on Windows, Linux and Mac OS systems.
Must be able to filter data based on keyword searches.
Must be able to navigate all historical CommonCrawl data.
Please enable this binary to be downloadable from the internet as a binary. As long as this is cURLable from the Github as a release. That would be the most ideal.
Must be able to wget
or curl
from the internet and be able to run in Linux, Unix and Windows environments.
Must update the Travis CI demonstrating actions for Linux, Unix and Windows.
Must provide a webhook for .git
that will automatically build and compile binaries for Linux, Unix and Windows.
1 ETH will be paid upon completion of this bounty. Please refer to #5 as that must be complete before Payment is provided.
Sometimes the network can fail or other things can happen...
However, we don't have detailed logs for when a failure happens for a specific wet file... Would be nice to have detail logs in the output:
Better logging for extract.go
Better logging for analyze.go
I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.
It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.
Docker for Windows is currently not functioning well. Based on my research. It doesn't seem like there is any way to reasonably run a Linux based container through Travis-CI. This is a backlogged item since Travis-CI needs to have better support for Windows Docker Containers.
The readme says coming soon..., are any of the go programs ready for folks to try out today?
This repository should be accessible via go get
and can be included easily into anyone else's project.
Must be able to run:
go get https://github.com/ChrisCates/CommonCrawler
Must be able to access the library as:
import(
cc "github.com/ChrisCates/CommonCrawler"
)
func main() {
cc.scan()
cc.download()
cc.extract()
// And so forth
}
The library must be stable and also have a demo example in the repository. Demoing the library should be part of the Travis CI configuration.
1 ETH will be paid once meeting the requirements.
Make this program both accessible via Golang and Terminal. Ensure that it works correctly in the terminal.
Must have a download archive feature so that you can get latest entries from 2019 and beyond.
Must download files autonomously from a certain date range.
Must be able to extract compressed .wet
files.
Please review the README.md
for the proposed functionality.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.