Coder Social home page Coder Social logo

html-to-markdown-scraping's Introduction

html-to-markdown-scraping

This Python script converts an HTML page into a Markdown file. It can be used for web scraping or archiving web pages in a mardown format that's easy to read and write.

The script is designed to be run from the command line and accepts two arguments: the URL of the web page to convert, and a selector that specifies which part of the page to convert.

Dependencies

The script depends on the following Python packages:

requests for making HTTP requests
bs4 (BeautifulSoup4) for parsing HTML
html2text for converting HTML to Markdown

Run this command to checkand install dependencies on your system:

# Fetch and run the setup script
curl -fsSL https://raw.githubusercontent.com/portalnetcar/html-to-markdown-scraping/main/setup.sh | bash

Scrapping

To run the code use this command for local dowloaded script:

python3 markdowner.py "https://pageurl" "section"

or to run without download:

#!/bin/bash

# Fetch and run the Python script
curl -sSL https://raw.githubusercontent.com/portalnetcar/html-to-markdown-scraping/main/markdowner.py | python3 - "https://pageurl" "section"

Enjoy

Feel free to customize this README to fit your needs.

Important

Please note that web scraping should be done responsibly and ethically. It's crucial to respect the copyright of the websites you are scraping. Just because data is publicly available doesn't mean it is legally available for all uses. Websites often have a "terms of service" or "robots.txt" file that may limit or prohibit web scraping. Additionally, laws about web scraping vary by country, so always ensure that you are scraping data in a way that is legal and respectful. If in doubt, it's a good practice to ask for permission before scraping data. This script is provided for educational purposes and should not be used to infringe on copyrights or violate any terms of service.

Links

html-to-markdown-scraping's People

Contributors

leosavio avatar portalnetcarhq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.