zxt-tzx / substack-archives-downloader Goto Github PK
View Code? Open in Web Editor NEWA simple Selenium-based web-scraper with a command line interface that downloads the archives of your Substack subscriptions as PDF files.
A simple Selenium-based web-scraper with a command line interface that downloads the archives of your Substack subscriptions as PDF files.
Is any chance to save a markdown using: MarkDownload - Markdown Web Clipper Chrome Extension.
I guess you're using Ctrl+P to save to PDF. The MarkDownload extension has a button to save to a file as well.
Thank you.
I'm getting the majority of the PDFs containing one line saying, 'Too many requests'. I tried adding a sleep method in the while loop in this function _load_articles_in_date_range
but that didn't solve it.
Any other suggestions please?
Great work on this, easy to install and use.
I did find that after about 3 or 4 PDFs the rest were generated very rapidly with just the text "Too many requests."
Adding a time.sleep(5) to the end of convert_article_tuples_to_pdfs() resolved this, if inelegantly.
Enter the URL of the Substack-hosted newsletter you would like to scrape:
[]
Your input of is not a URL.
Please fix the error above or try again later.
What is the correct format anyway? I tried several different versions.
I'm trying this project and I'm getting the following:
Enter the URL of the Substack-hosted newsletter you would like to scrape:
https://XXX.XXX.com
Would you like to see the browser while it performs the scraping?
Please type 'Y' or 'N'.
Y
A new window will open during the scraping.
Please enter your Substack account email address:
[email protected]
Please enter your Substack account password:
XXXXX
Please wait while we log in using the credential you provided...
Message: no such element: Unable to locate element: {"method":"css selector","selector":".menu-button > svg"}
(Session info: chrome=101.0.4951.64)
Unexpected error occurred while logging in.
We're sorry, something has gone wrong.
I'd no issues during the installation:
~/p/substack-archives-downloader> pip install -r requirements.txt 219ms Fri May 13 16:13:04 2022
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Requirement already satisfied: requests~=2.27.1 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 1)) (2.27.1)
Collecting selenium==3.141.0
Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 904.6/904.6 KB 5.6 MB/s eta 0:00:00
Collecting soupsieve==2.2.1
Downloading soupsieve-2.2.1-py3-none-any.whl (33 kB)
Collecting urllib3==1.26.6
Using cached urllib3-1.26.6-py2.py3-none-any.whl (138 kB)
Collecting validators~=0.18.2
Downloading validators-0.18.2-py3-none-any.whl (19 kB)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/site-packages (from requests~=2.27.1->-r requirements.txt (line 1)) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/site-packages (from requests~=2.27.1->-r requirements.txt (line 1)) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/site-packages (from requests~=2.27.1->-r requirements.txt (line 1)) (2.0.12)
Requirement already satisfied: six>=1.4.0 in /usr/local/lib/python3.9/site-packages (from validators~=0.18.2->-r requirements.txt (line 5)) (1.16.0)
Collecting decorator>=3.4.0
Downloading decorator-5.1.1-py3-none-any.whl (9.1 kB)
Installing collected packages: urllib3, soupsieve, decorator, validators, selenium
Attempting uninstall: urllib3
Found existing installation: urllib3 1.26.8
Uninstalling urllib3-1.26.8:
Successfully uninstalled urllib3-1.26.8
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Attempting uninstall: soupsieve
Found existing installation: soupsieve 2.3.1
Uninstalling soupsieve-2.3.1:
Successfully uninstalled soupsieve-2.3.1
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Successfully installed decorator-5.1.1 selenium-3.141.0 soupsieve-2.2.1 urllib3-1.26.6 validators-0.18.2
WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/usr/local/opt/[email protected]/bin/python3.9 -m pip install --upgrade pip' command.
Also, I'm very sure the login/password information were provided correctly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.