Coder Social home page Coder Social logo

saraswatayu / substack2markdown Goto Github PK

View Code? Open in Web Editor NEW

This project forked from timf34/substack2markdown

0.0 0.0 0.0 242 KB

Script to scrape free and premium Substack posts, saving them as Markdown files. Also generates HTML interfaces to allow you to browse and sort the markdown files for each author.

License: MIT License

JavaScript 7.86% Python 81.65% CSS 7.27% HTML 3.21%

substack2markdown's Introduction

Substack2Markdown

Substack2Markdown is a Python tool for scraping free and premium Substack posts and saving them as Markdown files. It will save paid for content as long as you're subscribed to that substack. Most "save for later" apps (such as Pocket) don't save these posts, but with this script you can now browse and sort through these posts in a user-friendly HTML interface.

Substack2Markdown Interface

Once you run the script, it will create a folder named after the substack in /substack_md_files, and then begin to scrape the substack URL, converting the blog posts into markdown files. Once all the posts have been saved, it will generate an HTML file in /substack_html_pages directory that allows you to browse the posts.

You can either hardcode the substack URL and the number of posts you'd like to save into the top of the file, or specify them as command line arguments.

Features

  • Converts Substack posts into Markdown files.
  • Generates an HTML file to browse Markdown files.
  • Supports free and premium content (with subscription).
  • The HTML interface allows sorting essays by date or likes.

Installation

Clone the repo and install the dependencies:

git clone https://github.com/yourusername/substack_scraper.git
cd substack_scraper

# # Optinally create a virtual environment
# python -m venv venv
# # Activate the virtual environment
# .\venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux

pip install -r requirements.txt

For the premium scraper, update the config.py in the root directory with your Substack email and password:

EMAIL = "[email protected]"
PASSWORD = "your-password"

You'll also need Microsoft Edge installed for the Selenium webdriver.

Usage

Specify the Substack URL and the directory to save the posts to:

You can hardcode your desired Substack URL and the number of posts you'd like to save into the top of the file and run:

python substack_scraper.py

For free Substack sites:

python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts

For premium Substack sites:

python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --premium

To scrape a specific number of posts:

python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --number 5

Viewing Markdown Files in Browser

To read the Markdown files in your browser, install the Markdown Viewer browser extension.

substack2markdown's People

Contributors

timf34 avatar saraswatayu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.