Coder Social home page Coder Social logo

common-crawl-downloader's Introduction

Common Crawl Downloader

Languages: English | 中文

python-3.7-3.8-3.9 license-MIT

Distributed download scripts for Common Crawl data.

Dependencies

Python >= 3.7 is required.

Install dependencies by:

pip install -r requirements.txt

libmysqlclient-dev or an equivalent one is also required on Linux distros:

sudo apt install libmysqlclient-dev

Run

Configurations

The default config file is located at configs/default.conf, which lists all the modifiable entries. Their descriptions and default values are listed below:

[database]
drivername = mysql
username = user
password = password
host = localhost
port = 3306
database = common_crawl

[worker]
; The name of this worker
name = unknown
; The interval of retries in seconds
retry_interval = 5
; The number of retries before giving up
retries = 10
; The timeout of internet connections in seconds
socket_timeout = 30
; The download root path
download_path = downloaded

[schedule]
; Whether to restrict download time
enabled = false
; The start of the allowed download time
start_time = 20:00:00
; The end of the allowed download time
end_time = 07:59:59
; The interval of retries when download is restricted
retry_interval = 300

Do not modify the default config file directly. You can create your own local.conf under the configs folder and add modified entries in it.

An example of a valid local config file:

[database]
username = common_crawl
password = &WcKLEsX!
host = 10.10.1.217

[schedule]
enabled = true
start_time = 20:00:00
end_time = 07:59:59

Execute the download script

Run the following command at the root path of the project:

python src/main.py

Always press CTRL-C to exit the download process. Killing it directly will cause data loss and inconsistency in database.

Database Structure

data

Field Type Description
id int Primary Key Data ID
uri varchar(256) The URI of the data, which constitutes the download URL and the folder structure
size int The size of the data in bytes
started_at datetime Download start time (CST)
finished_at datetime Download end time (CST)
download_state tinyint Download state
0 for pending
1 for downloading
2 for finished
3 for failed
id_worker int Foreign Key The ID of the worker that downloads this data
archive varchar(30) The year and month of the data on Common Crawl

URIs can be obtained from wet.paths files on Common Crawl website.

An example of a URI:

crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/wet/CC-MAIN-20210224165708-20210224195708-00000.warc.wet.gz

worker

Field Type Description
id int Primary Key Worker ID
name varchar(128) The name of the worker

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.