Coder Social home page Coder Social logo

cc-warc-io's Introduction

Common Crawl WARC Processor

This script automates the downloading and processing of WARC (Web ARChive) files from the Common Crawl dataset. It extracts content from these archives, performs language detection using fastText, and filters content based on the specified target language (default is Indonesian). Extracted content is then saved in a JSONL format for further analysis or processing.

Features

  • Automatic Downloading: Downloads WARC files from specified URLs.
  • Content Extraction: Uses trafilatura to extract web page content from WARC files.
  • Language Detection: Employs fastText to detect the language of extracted content.
  • Customizable Filtering: Filters content by language (configurable to any language supported by fastText).
  • Multiprocessing Support: Utilizes multiple CPU cores for faster processing of multiple WARC files simultaneously.

Dependencies

  • Python 3.x
  • requests
  • warcio
  • trafilatura
  • fasttext
  • tqdm (optional, for progress bar support)

Setup

  1. Clone the Repository:

git clone cd

  1. Install Required Python Packages:

  2. Download fastText Pre-trained Model:

Download the fastText pre-trained language detection model (lid.176.bin) from the official fastText website and place it in the script's directory or specify its path in the script configuration.

Configuration

Edit the script or create a .env file (if your script is set up to use environment variables) to customize the following settings:

  • DATA_ROOT: The root directory where WARC files will be downloaded and processed.
  • FASTTEXT_MODEL_PATH: Path to the fastText model file for language detection.
  • LANGUAGE_TARGET: Target language code for filtering content (default is id for Indonesian).
  • NUM_WORKERS: Number of worker processes for multiprocessing.

Usage

Run the script with the following command:

python cc_news.py

Ensure you have a file named warc.txt in your DATA_ROOT directory with one WARC file URL per line that you wish to download and process.

Contributing

Contributions to improve the script or add new features are welcome. Please submit a pull request or open an issue to discuss your ideas.

License

Specify your license here or indicate if the project is open-source under a specific license agreement.

cc-warc-io's People

Contributors

acul3 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.