Coder Social home page Coder Social logo

sebastianbodza / crawl4ai Goto Github PK

View Code? Open in Web Editor NEW

This project forked from unclecode/crawl4ai

0.0 0.0 0.0 126.05 MB

πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper

License: Apache License 2.0

JavaScript 5.35% Python 48.40% CSS 1.14% HTML 44.33% Dockerfile 0.78%

crawl4ai's Introduction

Crawl4AI v0.2.6 πŸ•·οΈπŸ€–

GitHub Stars GitHub Forks GitHub Issues GitHub Pull Requests License

Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. πŸ†“πŸŒ

Try it Now!

  • Use as REST API: Open In Colab
  • Use as Python library: Open In Colab

✨ visit our Documentation Website

Features ✨

  • πŸ†“ Completely free and open-source
  • πŸ€– LLM-friendly output formats (JSON, cleaned HTML, markdown)
  • 🌍 Supports crawling multiple URLs simultaneously
  • 🎨 Extracts and returns all media tags (Images, Audio, and Video)
  • πŸ”— Extracts all external and internal links
  • πŸ“š Extracts metadata from the page
  • πŸ”„ Custom hooks for authentication, headers, and page modifications before crawling
  • πŸ•΅οΈ User-agent customization
  • πŸ–ΌοΈ Takes screenshots of the page
  • πŸ“œ Executes multiple custom JavaScripts before crawling
  • πŸ“š Various chunking strategies: topic-based, regex, sentence, and more
  • 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
  • 🎯 CSS selector support
  • πŸ“ Passes instructions/keywords to refine extraction

Cool Examples πŸš€

Quick Start

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")

# Print the extracted content
print(result.markdown)

Extract Structured Data from Web Pages πŸ“Š

Crawl all OpenAI models and their fees from the official page.

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url=url,
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token=os.getenv('OPENAI_API_KEY'),
        instruction="Extract all model names and their fees for input and output tokens."
    ),
)

print(result.extracted_content)

Execute JS, Filter Data with CSS Selector, and Clustering

from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy

js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]

crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url="https://www.nbcnews.com/business",
    js=js_code,
    css_selector="p",
    extraction_strategy=CosineStrategy(semantic_filter="technology")
)

print(result.extracted_content)

Documentation πŸ“š

For detailed documentation, including installation instructions, advanced features, and API reference, visit our Documentation Website.

Contributing 🀝

We welcome contributions from the open-source community. Check out our contribution guidelines for more information.

License πŸ“„

Crawl4AI is released under the Apache 2.0 License.

Contact πŸ“§

For questions, suggestions, or feedback, feel free to reach out:

Happy Crawling! πŸ•ΈοΈπŸš€

crawl4ai's People

Contributors

unclecode avatar sebastianbodza avatar gkhngyk avatar qin2dim avatar ntohidi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.