anilabhadatta / educative.io_scraper Goto Github PK
View Code? Open in Web Editor NEWEducative.io Course Downloader developed using Python and Selenium. Refer Readme.md for setup instructions.
License: MIT License
Educative.io Course Downloader developed using Python and Selenium. Refer Readme.md for setup instructions.
License: MIT License
Describe the bug
##############
Scraper Started, Log file can be found in Save directory
Driver Loaded
[Selected config: 0] Starting Scraping: 0, https://www.educative.io/courses/decode-coding-interview-java/3woxv5GoKmQ
Load Webpage Function
Checking Login Function
Checking for captcha Function...
Create Course Folder Function
Getting File Name
This is a module page
Get File name module
Inside Course Folder
Checking Login Function
Checking for captcha Function...
Scrolling Page
Getting File Name
This is a module page
Get File name module
Checking page
Removing Nav tags from page
Node deleted div[class*='ed-grid'] > nav
Node deleted div[class*='ed-grid'] > nav
Show Hints Function
No hints found
Inside find_mark_down_quiz_containers function
No mark down quiz_container found
Show Codebox Answers Function
No Codebox answers found
Finding Slides Function
No Slides Found
Inside take_quiz_screenshot function
Quiz not found
Adding Name Tag in Next Back Button
Fixing SVG Tags inside Object Tags
Get HTML Page Content Using Single File Function
make_code_selectable function
make_code_selectable function executed
Creating HTML File
HTML File Created
HTML Page content taken.
Inside Widget Container Function
No widget container found
Code Container Download Type Function
No Code Container Downloadable Type found
Code Container Clipboard Type Function
No code containers found
Remove Mark completed
Next Page Function
Going Next Page
--------------- 0 Complete-------------------
Checking Login Function
Checking for captcha Function...
Scrolling Page
Getting File Name
This is a module page
Get File name module
Checking page
Removing Nav tags from page
Node deleted div[class*='ed-grid'] > nav
Node deleted div[class*='ed-grid'] > nav
Show Hints Function
No hints found
Inside find_mark_down_quiz_containers function
No mark down quiz_container found
Show Codebox Answers Function
No Codebox answers found
Finding Slides Function
No Slides Found
Inside take_quiz_screenshot function
Quiz not found
Adding Name Tag in Next Back Button
Fixing SVG Tags inside Object Tags
Get HTML Page Content Using Single File Function
make_code_selectable function
make_code_selectable function executed
Creating HTML File
HTML File Created
HTML Page content taken.
Inside Widget Container Function
No widget container found
Code Container Download Type Function
No Code Container Downloadable Type found
Code Container Clipboard Type Function
No code containers found
Remove Mark completed
Next Page Function
Going Next Page
--------------- 1 Complete-------------------
Checking Login Function
Checking for captcha Function...
Scrolling Page
Getting File Name
This is a module page
Get File name module
Checking page
Removing Nav tags from page
Node deleted div[class*='ed-grid'] > nav
Node deleted div[class*='ed-grid'] > nav
Show Hints Function
No hints found
Inside find_mark_down_quiz_containers function
No mark down quiz_container found
Show Codebox Answers Function
No Codebox answers found
Finding Slides Function
No Slides Found
Inside take_quiz_screenshot function
Quiz not found
Adding Name Tag in Next Back Button
Fixing SVG Tags inside Object Tags
Get HTML Page Content Using Single File Function
make_code_selectable function
make_code_selectable function executed
Creating HTML File
HTML File Created
HTML Page content taken.
Inside Widget Container Function
No widget container found
Code Container Download Type Function
No Code Container Downloadable Type found
Code Container Clipboard Type Function
No code containers found
Remove Mark completed
Next Page Function
Going Next Page
--------------- 2 Complete-------------------
Checking Login Function
Checking for captcha Function...
Scrolling Page
Getting File Name
This is a module page
Get File name module
Checking page
Removing Nav tags from page
Node deleted div[class*='ed-grid'] > nav
Node deleted div[class*='ed-grid'] > nav
Show Hints Function
No hints found
Inside find_mark_down_quiz_containers function
No mark down quiz_container found
Show Codebox Answers Function
No Codebox answers found
Finding Slides Function
No Slides Found
Inside take_quiz_screenshot function
Quiz not found
Adding Name Tag in Next Back Button
Fixing SVG Tags inside Object Tags
Get HTML Page Content Using Single File Function
make_code_selectable function
make_code_selectable function executed
Creating HTML File
HTML File Created
HTML Page content taken.
Inside Widget Container Function
No widget container found
Code Container Download Type Function
No Code Container Downloadable Type found
Code Container Clipboard Type Function
No code containers found
Remove Mark completed
Next Page Function
Going Next Page
--------------- 3 Complete-------------------
Checking Login Function
Checking for captcha Function...
Scrolling Page
Getting File Name
This is a module page
Get File name module
Checking page
Removing Nav tags from page
Node deleted div[class*='ed-grid'] > nav
Node deleted div[class*='ed-grid'] > nav
Show Hints Function
No hints found
Inside find_mark_down_quiz_containers function
No right button found in Mark Down Quiz
Clicking on Mark Down Quiz function
Found Issue, Going Next Course Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}
(Session info: headless chrome=98.0.4758.102)
Stacktrace:
#0 0x559e6a0edb33
#1 0x559e69bb66d8
#2 0x559e69bec6f1
#3 0x559e69bec8b1
#4 0x559e69be1067
#5 0x559e69c0a08d
#6 0x559e69be0fa3
#7 0x559e69c0a16e
#8 0x559e69c1d2fb
#9 0x559e69c09f53
#10 0x559e69bdfa0a
#11 0x559e69be0ad5
#12 0x559e6a11f2fd
#13 0x559e6a1384bb
#14 0x559e6a1210d5
#15 0x559e6a139145
#16 0x559e6a114aaf
#17 0x559e6a155ba8
#18 0x559e6a155d28
#19 0x559e6a17048d
#20 0x7f3111490402
Script Execution Complete
Educative Scraper (version 7.2), developed by Anilabha Datta
######################################################################
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Main Exception Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /home/osboxes/Desktop/educative.io_scraper-master/Chrome-bin/linux/chrome/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x5576c832eb33
#1 0x5576c7df76d8
#2 0x5576c7e1a84c
#3 0x5576c7e15fca
#4 0x5576c7e50e0a
#5 0x5576c7e4af53
#6 0x5576c7e20a0a
#7 0x5576c7e21ad5
#8 0x5576c83602fd
#9 0x5576c83794bb
#10 0x5576c83620d5
#11 0x5576c837a145
#12 0x5576c8355aaf
#13 0x5576c8396ba8
#14 0x5576c8396d28
#15 0x5576c83b148d
#16 0x7f3895f57b43
Describe the bug
A clear and concise description of what the bug is.
Hi Anil,
I am getting the issues where scrapping is being stale one point. Tried multiple times its giving some error.
Tried 3 to 4 times by reconfiguring ubuntu and MX Linux but at the same point its getting stopped..
https://www.educative.io/courses/grokking-dynamic-programming-a-deep-dive-using-python/7A3RLnnRvOy
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Hi, I got the error as title after scraper the this course. It only downloads the first 3 chapters then stops the browser. What should I do ? Thank you.
Describe the bug
while trying to download this page Challenge 1: Implement the Rectangle Class Using the Concepts of Encapsulation
Expected behavior
downloading successfully
hi @anilabhadatta after adding the above driver able to login but the scrapper fails on the first page:
0 https://www.educative.io/courses/grokking-adv-system-design-intvw/YQ49jQx9Ry2
list index out of range
any way out
Originally posted by @sarveshkhanna in #27 (reply in thread)
log
` 2023-11-26 10:47:32,159 - INFO - ApiUtility - Course URL Selector: //a[contains(@href, 'courses/distributed-systems-practitioners/')]
2023-11-26 10:47:32,185 - INFO - LoginAccount - Checking if logged in...
2023-11-26 10:47:32,194 - INFO - ApiUtility - Getting Course Collections JSON from URL: https://educative.io/api/collection/10370001/4891237377638400?work_type=collection
2023-11-26 10:47:32,194 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-26 10:47:32,946 - INFO - ExtensionScraper - API Urls: 166 == 166 :Topic Urls
2023-11-26 10:47:32,947 - INFO - ExtensionScraper - ----------------------------------------------------------------------------------
Scraping Topic: 165-some more things to discover: https://www.educative.io/courses/distributed-systems-practitioners/some-more-things-to-discover?showContent=true
2023-11-26 10:47:32,954 - INFO - LoginAccount - Checking if logged in...
2023-11-26 10:47:32,961 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4891237377638400/page/6134167328260096?work_type=collection
2023-11-26 10:47:32,962 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-26 10:47:33,008 - INFO - ApiUtility - Successfully fetched JSON API data
2023-11-26 10:47:33,008 - INFO - OSUtility - Sleeping for 10 seconds
2023-11-26 10:47:46,328 - INFO - SeleniumBasicUtility - Loading page and checking if something went wrong
2023-11-26 10:47:46,328 - INFO - OSUtility - Sleeping for 10 seconds
2023-11-26 10:47:56,397 - INFO - SeleniumBasicUtility - Waiting for webdriver to load topic page
2023-11-26 10:47:56,418 - INFO - SeleniumBasicUtility - Adding name attribute in next/back button
2023-11-26 10:47:56,422 - INFO - BrowserUtility - Scrolling Page
2023-11-26 10:47:56,808 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-26 10:47:58,811 - INFO - RemoveUtility - Removing blur with CSS
2023-11-26 10:47:58,837 - INFO - RemoveUtility - Removing mark-as-completed/completed tick mark
2023-11-26 10:47:58,900 - INFO - RemoveUtility - Removing unwanted elements
2023-11-26 10:47:58,905 - INFO - ShowUtility - Showing single markdown quiz solution
2023-11-26 10:47:58,909 - INFO - ShowUtility - No single markdown quiz solution found
2023-11-26 10:47:58,910 - INFO - ShowUtility - Showing code solutions
2023-11-26 10:47:58,914 - INFO - ShowUtility - No code solution found
2023-11-26 10:47:58,914 - INFO - ShowUtility - Showing hints
2023-11-26 10:47:58,917 - INFO - ShowUtility - No hints found
2023-11-26 10:47:58,918 - INFO - ShowUtility - Showing slides
2023-11-26 10:47:58,920 - INFO - ShowUtility - No slides found
2023-11-26 10:47:58,921 - INFO - SingleFileUtility - Fixing all object tags
2023-11-26 10:47:58,924 - INFO - SingleFileUtility - No object tag found
2023-11-26 10:47:58,924 - INFO - SingleFileUtility - Injecting important scripts
2023-11-26 10:47:58,931 - INFO - OSUtility - Sleeping for 5 seconds
2023-11-26 10:48:03,951 - INFO - OSUtility - Sleeping for 5 seconds
2023-11-26 10:48:08,954 - INFO - SingleFileUtility - Making code selectable
2023-11-26 10:48:08,964 - INFO - SingleFileUtility - No code found
2023-11-26 10:48:08,965 - INFO - SingleFileUtility - getSingleFileHtml: Getting SingleFile Html...
2023-11-26 10:48:10,015 - INFO - ExtensionScraper - Topic File Successfully Created
2023-11-26 10:48:10,015 - INFO - ExtensionScraper - Downloading Code and Quiz Files if found...
2023-11-26 10:48:10,015 - INFO - ExtensionScraper - Code and Quiz Files Downloaded if found.
2023-11-26 10:48:10,092 - INFO - ExtensionScraper - Started Scraping from Text File URL: ?showContent=true
2023-11-26 10:48:10,092 - INFO - BrowserUtility - Loading Browser...
2023-11-26 10:48:12,555 - INFO - BrowserUtility - Browser Initiated
2023-11-26 10:48:12,631 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 52: ExtensionScraper:scrapeCourse: 64: ApiUtility:getCourseUrl: 131: Message: invalid argument
(Session info: chrome=116.0.5845.96)
Stacktrace:
0 chromedriver 0x00000001007da65c chromedriver + 4318812
1 chromedriver 0x00000001007d2d00 chromedriver + 4287744
2 chromedriver 0x0000000100404644 chromedriver + 296516
3 chromedriver 0x00000001003ec430 chromedriver + 197680
4 chromedriver 0x00000001003e9fe0 chromedriver + 188384
5 chromedriver 0x00000001003eaafc chromedriver + 191228
6 chromedriver 0x00000001004067d4 chromedriver + 305108
7 chromedriver 0x000000010047b380 chromedriver + 783232
8 chromedriver 0x000000010047ad28 chromedriver + 781608
9 chromedriver 0x0000000100436178 chromedriver + 500088
10 chromedriver 0x0000000100436fc0 chromedriver + 503744
11 chromedriver 0x000000010079ac40 chromedriver + 4058176
12 chromedriver 0x000000010079f160 chromedriver + 4075872
13 chromedriver 0x0000000100762e68 chromedriver + 3829352
14 chromedriver 0x000000010079fc4c chromedriver + 4078668
15 chromedriver 0x0000000100777f08 chromedriver + 3915528
16 chromedriver 0x00000001007bc140 chromedriver + 4194624
17 chromedriver 0x00000001007bc2c4 chromedriver + 4195012
18 chromedriver 0x00000001007cc4d0 chromedriver + 4261072
19 libsystem_pthread.dylib 0x0000000187ec1034 _pthread_start + 136
20 libsystem_pthread.dylib 0x0000000187ebbe3c thread_start + 8`
Is your feature request related to a problem? Please describe.
Currently, the code clipboards being downloaded are anonymous by file name and aren't easily readable
Current code clipboard path: ./code_clipboard0/code-0.txt
Describe the solution you'd like
Single code clipboard file for all the code clipboards per course topic. Sample result file attached.
Additional context
For example: To scrape the code "kubectl run db --image mongo" from the below snippet,
dentifiable.
we can scrape the caption-text from the second attached image showing the span tag, and create a single file.
Our final code_clipboard.txt file can be like this:
Thanks for taking the time to create this great tool. Mostly, it's working just fine but sometimes I get the below error. Any idea on how I can fix it ? Thanks!
Take Screenshot Function
Found Issue, Going Next Course Message: no such element: Unable to locate element: {"method":"css selector","selector":"div[class*='ArticlePage']"}
(Session info: headless chrome=106.0.5249.103)
Stacktrace:
Backtrace:
Ordinal0 [0x01181ED3+2236115]
Ordinal0 [0x011192F1+1807089]
Ordinal0 [0x010266FD+812797]
Ordinal0 [0x010555DF+1005023]
Ordinal0 [0x010557CB+1005515]
Ordinal0 [0x01087632+1209906]
Ordinal0 [0x01071AD4+1120980]
Ordinal0 [0x010859E2+1202658]
Ordinal0 [0x010718A6+1120422]
Ordinal0 [0x0104A73D+960317]
Ordinal0 [0x0104B71F+964383]
GetHandleVerifier [0x0142E7E2+2743074]
GetHandleVerifier [0x014208D4+2685972]
GetHandleVerifier [0x01212BAA+532202]
GetHandleVerifier [0x01211990+527568]
Ordinal0 [0x0112080C+1837068]
Ordinal0 [0x01124CD8+1854680]
Ordinal0 [0x01124DC5+1854917]
Ordinal0 [0x0112ED64+1895780]
BaseThreadInitThunk [0x76C9FA29+25]
RtlGetAppContainerNamedObjectPath [0x77067B5E+286]
RtlGetAppContainerNamedObjectPath [0x77067B2E+238]
Hi, several times I encountered this error:
2023-10-15 15:24:22,586 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 105: SeleniumBasicUtility:setWindowSize: 112: Message: unknown error: failed to change window state to 'normal', current state is 'maximized'
(Session info: chrome=118.0.5993.70)
not sure what is the root cause
I got some error related to judgeContentPrepend.
2023-10-22 17:17:08,754 - ERROR - ExtensionScraper - ExtensionScraper:scrapeCourse: https://www.educative.io/courses/coderust-hacking-the-coding-interview/find-low-high-index-of-a-key-in-a-sorted-array?showContent=true, 88: ExtensionScraper:scrapeTopic: 138: CodeUtility:downloadCodeFiles: 28: CodeUtility:downloadTabbedCode: 85: CodeUtility:downloadCode: 54: 'judgeContentPrepend'
Describe the bug
Hi @anilabhadatta I am running this script in my virtualbox where guest OS is Ubuntu-22.04 host is windows11. The issue i am facing is while i am scrapping the files in the background its increasing the size of ubuntu VDI image with an alarming speed . It was initially 23Gb while in now i have downloaded around 500mb file my vdi size was increased to 35Gb in the middle and at last it is going beyind 40gb.
so exactly its increasing the footprint of VDI image nothing is happening in guest OS. looks like its impacting the base
I have attached the screenshot.
Still Going
Try to down the course learn git the hard way, it doesn't work
this is the address in the urls.txt https://www.educative.io/courses/learn-git-hard-way/JYOGn0Bkv5J
Hello,
there seems to be an issue with the text in the courses.
Some paragraphs have white text color.
Is there anything that can be done?
Describe the bug
Not able to download https://www.educative.io/courses/grokking-dynamic-programming-a-deep-dive-using-java/B84xWoY7YB2
getting below error -
4 https://www.educative.io/courses/grokking-dynamic-programming-a-deep-dive-using-java/B84xWoY7YB2
Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}
(Session info: chrome=96.0.4664.0)
Stacktrace:
Backtrace:
Ordinal0 [0x00B66903+2517251]
Ordinal0 [0x00AFF8E1+2095329]
Ordinal0 [0x00A02848+1058888]
Ordinal0 [0x00A2D448+1233992]
Ordinal0 [0x00A2D63B+1234491]
Ordinal0 [0x00A23AB1+1194673]
Ordinal0 [0x00A4650A+1336586]
Ordinal0 [0x00A23A36+1194550]
Ordinal0 [0x00A465BA+1336762]
Ordinal0 [0x00A55BBF+1399743]
Ordinal0 [0x00A4639B+1336219]
Ordinal0 [0x00A227A7+1189799]
Ordinal0 [0x00A23609+1193481]
GetHandleVerifier [0x00CF5904+1577972]
GetHandleVerifier [0x00DA0B97+2279047]
GetHandleVerifier [0x00BF6D09+534521]
GetHandleVerifier [0x00BF5DB9+530601]
Ordinal0 [0x00B04FF9+2117625]
Ordinal0 [0x00B098A8+2136232]
Ordinal0 [0x00B099E2+2136546]
Ordinal0 [0x00B13541+2176321]
BaseThreadInitThunk [0x76427D49+25]
RtlInitializeExceptionChain [0x771FB74B+107]
RtlClearBits [0x771FB6CF+191]
Using the most recent version of the scrapper.
I'm running into:
Found Issue, Going Next Course Message: session not created: This version of ChromeDriver only supports Chrome version 98
Current browser version is 110.0.5481.100
I'm using the binary and driver from the most recent release. I'm running M1 ARM64 chip from Apple. Any tips?
Describe the bug
when I try to download a course, I get the following Error:
ERROR - StartScraper - start: 20: ExtensionScraper:start: 52: ExtensionScraper:scrapeCourse: 80: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal
To Reproduce
The problem happens with any course I try to download.
Expected behavior
I expect the course to be downloaded
Screenshots/ScreenRecord
EducativeScraper.log
Add the Log file and mention the timestamp and topicUrl.
2023-11-20 12:26:08,551 - INFO - StartScraper - StartScraper Initiated...
To Terminate, Click on Stop ScraperType Button
2023-11-20 12:26:08,671 - INFO - ExtensionScraper - ExtensionScraper initiated...
2023-11-20 12:26:08,672 - INFO - ExtensionScraper - Started Scraping from Text File URL: https://www.educative.io/courses/machine-learning-numpy-pandas-scikit-learn?showContent=true
2023-11-20 12:26:08,672 - INFO - BrowserUtility - Loading Browser...
2023-11-20 12:26:09,050 - INFO - BrowserUtility - Browser Initiated
2023-11-20 12:26:10,797 - INFO - ApiUtility - Course Type Selector: a[href*='/courses/']
2023-11-20 12:26:10,952 - INFO - ApiUtility - Getting Next Data
2023-11-20 12:26:10,967 - INFO - ApiUtility - Getting Course Topic URLs List from URL: https://www.educative.io/courses/machine-learning-numpy-pandas-scikit-learn/overview?showContent=true
2023-11-20 12:26:12,865 - INFO - SeleniumBasicUtility - Expanding all sections
2023-11-20 12:26:12,865 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-20 12:26:14,883 - INFO - SeleniumBasicUtility - Expanding all sections
2023-11-20 12:26:14,883 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-20 12:26:16,892 - INFO - SeleniumBasicUtility - Expanding all sections
2023-11-20 12:26:16,892 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-20 12:26:18,903 - INFO - ApiUtility - Course URL Selector: //a[contains(@href, 'courses/')]
2023-11-20 12:26:18,911 - INFO - LoginAccount - Checking if logged in...
2023-11-20 12:26:18,917 - INFO - ApiUtility - Getting Course Collections JSON from URL: https://educative.io/api/collection/6083138522447872/5629499534213120?work_type=collection
2023-11-20 12:26:18,917 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-20 12:26:19,479 - INFO - ExtensionScraper - API Urls: 87 == 89 :Topic Urls
2023-11-20 12:26:19,609 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 52: ExtensionScraper:scrapeCourse: 80: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal
Desktop (please complete the following information):
Followed the readme. Got the below result. I'm on Win64.
0 https://www.educative.io/courses/cpp-fundamentals-for-professionals/qAmyZQB2qm2
Message: no such element: Unable to locate element: {"method":"css selector","selector":"h4"}
(Session info: headless chrome=96.0.4664.110)
Stacktrace:
Backtrace:
Ordinal0 [0x00C96903+2517251]
Ordinal0 [0x00C2F8E1+2095329]
Ordinal0 [0x00B32848+1058888]
Ordinal0 [0x00B5D448+1233992]
Ordinal0 [0x00B5D63B+1234491]
Ordinal0 [0x00B53AB1+1194673]
Ordinal0 [0x00B7650A+1336586]
Ordinal0 [0x00B53A36+1194550]
Ordinal0 [0x00B765BA+1336762]
Ordinal0 [0x00B85BBF+1399743]
Ordinal0 [0x00B7639B+1336219]
Ordinal0 [0x00B527A7+1189799]
Ordinal0 [0x00B53609+1193481]
GetHandleVerifier [0x00E25904+1577972]
GetHandleVerifier [0x00ED0B97+2279047]
GetHandleVerifier [0x00D26D09+534521]
GetHandleVerifier [0x00D25DB9+530601]
Ordinal0 [0x00C34FF9+2117625]
Ordinal0 [0x00C398A8+2136232]
Ordinal0 [0x00C399E2+2136546]
Ordinal0 [0x00C43541+2176321]
BaseThreadInitThunk [0x7610FA29+25]
RtlGetAppContainerNamedObjectPath [0x77127B5E+286]
RtlGetAppContainerNamedObjectPath [0x77127B2E+238]
I had to pause download while it was saving https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/evaluation-of-a-distributed-caches-design
Resume with course link updated to to above. Fails with below stacktrace
source env/bin/activate
❯ python3 EducativeScraper.py
Educative Scraper (v3.1.0 Master Branch), developed by Anilabha Datta
Project Link: https://github.com/anilabhadatta/educative.io_scraper/tree/v3-dev
Check out ReadMe for more information about this project.
Use the GUI to start scraping.
2023-10-05 08:57:12.534 Python[401:1285289] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
2023-10-05 08:57:12,928 - INFO - HomeScreen - Creating Home Screen...
2023-10-05 08:57:12,942 - DEBUG - HomeScreen - fixGeometry called
2023-10-05 08:57:12,984 - DEBUG - HomeScreen - fixGeometry completed
2023-10-05 08:57:12,986 - DEBUG - HomeScreen - createHomeScreen completed
2023-10-05 08:57:14,841 - INFO - HomeScreen - Starting Chrome Driver...
Path: /Users/amisra/dev/educative.io_scraper/src/ChromeDrivers/mac/chromedriver-mac-arm64/chromedriver
2023-10-05 08:57:14,860 - DEBUG - HomeScreen - startChromeDriver completed
2023-10-05 08:57:27,192 - INFO - HomeScreen - Starting Chrome Driver...
Path: /Users/amisra/dev/educative.io_scraper/src/ChromeDrivers/mac/chromedriver-mac-arm64/chromedriver
2023-10-05 08:57:27,214 - DEBUG - HomeScreen - startChromeDriver completed
2023-10-05 08:57:35,674 - DEBUG - HomeScreen - startScraper called
2023-10-05 08:57:35,734 - DEBUG - HomeScreen - startScraper completed
2023-10-05 08:57:36,074 - INFO - StartScraper - StartScraper Initiated...
To Terminate, Click on Stop Scraper Button
2023-10-05 08:57:36,095 - INFO - ExtensionScraper - ExtensionScraper initiated...
2023-10-05 08:57:36,104 - INFO - ExtensionScraper - Started Scraping from Text File URL: https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/evaluation-of-a-distributed-caches-design
2023-10-05 08:57:36,104 - INFO - BrowserUtility - Loading Browser...
2023-10-05 08:57:38,070 - INFO - BrowserUtility - Browser Initiated
2023-10-05 08:57:40,029 - INFO - ApiUtility - Course Type Selector: a[href*='/courses/']
2023-10-05 08:57:40,043 - INFO - ApiUtility - Getting Next Data
2023-10-05 08:57:40,064 - INFO - ApiUtility - Getting Course Topic URLs List from URL: https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/introduction-to-modern-system-design
2023-10-05 08:57:43,384 - DEBUG - SeleniumBasicUtility - Expanding all sections function
2023-10-05 08:57:53,849 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 47: ExtensionScraper:scrapeCourse: 61: ApiUtility:getCourseTopicUrlsList: 107: SeleniumBasicUtility:expandAllSections: 33: Message:
Stacktrace:
0 chromedriver 0x00000001013e6d98 chromedriver + 4337048
1 chromedriver 0x00000001013dee14 chromedriver + 4304404
2 chromedriver 0x000000010100ba5c chromedriver + 293468
3 chromedriver 0x0000000101050d50 chromedriver + 576848
4 chromedriver 0x000000010108b908 chromedriver + 817416
5 chromedriver 0x0000000101044a5c chromedriver + 526940
6 chromedriver 0x0000000101045908 chromedriver + 530696
7 chromedriver 0x00000001013acde4 chromedriver + 4099556
8 chromedriver 0x00000001013b12a0 chromedriver + 4117152
9 chromedriver 0x00000001013b752c chromedriver + 4142380
10 chromedriver 0x00000001013b1da0 chromedriver + 4119968
11 chromedriver 0x0000000101389a74 chromedriver + 3955316
12 chromedriver 0x00000001013cea48 chromedriver + 4237896
13 chromedriver 0x00000001013cebc4 chromedriver + 4238276
14 chromedriver 0x00000001013dea8c chromedriver + 4303500
15 libsystem_pthread.dylib 0x000000018a2bf034 _pthread_start + 136
16 libsystem_pthread.dylib 0x000000018a2b9e3c thread_start + 8
2023-10-05 08:57:53,850 - DEBUG - StartScraper - Exiting Scraper..
It happens with any course page. I am not sure if this is a bug with this scraper like drive issue or something else, but this is annoying and I can not take the printout of the page.
Steps to reproduce -
Any help on this would be appreciated.
On windows 11, using pyenv and python 3.11.2. I just cloned the latest and downloaded the chrome driver and binary.
Experience the following crash on every course tried.
ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message:
Stacktrace:
GetHandleVerifier [0x00007FF6535A8EF2+54786]
(No symbol) [0x00007FF653515612]
(No symbol) [0x00007FF6533CA64B]
(No symbol) [0x00007FF65340B79C]
(No symbol) [0x00007FF65340B91C]
(No symbol) [0x00007FF653446D87]
(No symbol) [0x00007FF65342BEAF]
(No symbol) [0x00007FF653444D02]
(No symbol) [0x00007FF65342BC43]
(No symbol) [0x00007FF653400941]
(No symbol) [0x00007FF653401B84]
GetHandleVerifier [0x00007FF6538F7F52+3524194]
GetHandleVerifier [0x00007FF65394D800+3874576]
GetHandleVerifier [0x00007FF653945D7F+3843215]
GetHandleVerifier [0x00007FF653645086+694166]
(No symbol) [0x00007FF653520A88]
(No symbol) [0x00007FF65351CA94]
(No symbol) [0x00007FF65351CBC2]
(No symbol) [0x00007FF65350CC83]
BaseThreadInitThunk [0x00007FFDE605257D+29]
RtlUserThreadStart [0x00007FFDE82CAA78+40]
2023-10-18 08:48:11,034 - DEBUG - StartScraper - Exiting Scraper...
Hello, thank you for creating an excellent project, it is very helpful to me.
I have an idea. You can add the ability to generate pages with links similar to those below, Generating images is not conducive to reading.
url-shortening-service-like-tiny-url
Thank you very much.
Hey there,
I was unable to use the executable for MacOS nor Windows, so I decided to check out the repo and see if I were able to make it work from there.
I followed everything in the instructions and was able to launch chromium and authenticate, but when it's time to start scraping, it fails with this error:
Main Exception local variable 'driver' referenced before assignment
I did the same for both Windows and MacOS. Same result. Can launch the driver, can login with chromium, then this error happens.
I solved the problem above. The folder name cannot contain these symbols like '?:'. This renaming problem may only occur in Chinese edition Windows10 or 11.
The code I changed as follows. I added a regex to delete the annoying symbols. Maybe you have a better solution!
#changed by woden
import re
def replace_filename(str):
numDict = {':':' ','?':' ','|':' ','>':' ','<':' ','/':' '}
print(str.group())
return numDict[str.group()]
#end
def scrape_page(driver, file_index):
scroll_page(driver)
wait_webdriver(driver)
title = get_file_name(driver)
check_page(title)
file_name = str(file_index) + "-" + title
# change by woden
a = re.sub(r'[:?|></]', replace_filename, file_name)
# a = ''.join(filter(lambda i: i in [' '] or i.isalnum(),file_name))
# end
driver.set_window_size(1920, get_current_height(driver))
remove_nav_tags(driver)
show_hints_answer(driver)
mark_down_quiz(driver)
show_code_box_answer(driver)
open_slides(driver)
create_folder(a)
quiz_html = take_quiz_screenshot(driver)
# take_screenshot(driver, file_name, quiz_html)
add_name_tag_in_next_back_button(driver)
fix_all_svg_tags_inside_object_tags(driver)
#woden change
get_pagecontent_using_singleFile(driver, "main", quiz_html)
#end
code_widget_type(driver)
code_container_download_type(driver)
code_container_clipboard_type(driver)
demark_as_completed(driver)
if not next_page(driver):
sleep(5)
return False
return True
def create_course_folder(driver, url):
print("Create Course Folder Function")
course_name = get_file_name(driver, True)
x = re.sub(r'[:?|></]', replace_filename, course_name)
create_folder(x)
print("Inside Course Folder")
I have tested these codes. no problem.
But I am a python beginner, the above code may have some bug.
How to increase script timeout?
so that the scraper doesn't goto next course without completing previous one.
if you don't wamt to implement it here, then please tell me, i'll do in my local script.
Describe the bug
while trying to download this course https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers
as you can see, it downloaded the same module many times
Downloaded the packages from releases, running both executable at the same time, not sure what's the correct config, but have created a text file with url links in it, was able to login in through the code, but then when you start scrapping it says: Main Exception local variable 'driver' referenced before assignment
Anything you want me to try apart from what you've mentioned in the readme? A little help would be appreciated.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
Please Find logs below --
Main Exception Message: unknown error: Chrome failed to start: was killed.
(unknown error: DevToolsActivePort file doesn't exist)
Main Exception Message: unknown error: Chrome failed to start: was killed.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /Users/penguin/Desktop/EducativeScraper/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
Main Exception Message: unknown error: Chrome failed to start: was killed.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /Users/penguin/Desktop/EducativeScraper/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
0 chromedriver 0x0000000100a98ee9 chromedriver + 5013225
1 chromedriver 0x0000000100a241d3 chromedriver + 4534739
2 chromedriver 0x00000001005faa68 chromedriver + 170600
3 chromedriver 0x000000010061cc9a chromedriver + 310426
4 chromedriver 0x0000000100618756 chromedriver + 292694
5 chromedriver 0x0000000100652550 chromedriver + 529744
6 chromedriver 0x000000010064c6d3 chromedriver + 505555
7 chromedriver 0x000000010062276e chromedriver + 333678
8 chromedriver 0x0000000100623745 chromedriver + 337733
9 chromedriver 0x0000000100a54efe chromedriver + 4734718
10 chromedriver 0x0000000100a6ea19 chromedriver + 4839961
11 chromedriver 0x0000000100a741c8 chromedriver + 4862408
12 chromedriver 0x0000000100a6f3aa chromedriver + 4842410
13 chromedriver 0x0000000100a49a01 chromedriver + 4688385
14 chromedriver 0x0000000100a8a538 chromedriver + 4953400
15 chromedriver 0x0000000100a8a6c1 chromedriver + 4953793
16 chromedriver 0x0000000100aa0225 chromedriver + 5042725
17 libsystem_pthread.dylib 0x00007ff80a8964e1 _pthread_start + 125
18 libsystem_pthread.dylib 0x00007ff80a891f6b thread_start + 15
I appreciate your efforts to create this, but have nod idea how to start it, the explanation is more confusing than helpful.
I meet "Found Issue, Going Next Course Message: javascript error: Maximum call stack size exceeded" when scrape the 22th course in https://www.educative.io/courses/grokking-the-machine-learning-interview/JEOBmok417l.
The problem occurs after "Get HTML Page Content Using Single File Function".
Is this problem caused by my network or something else?
The log message as follows:
22 https://www.educative.io/courses/grokking-the-machine-learning-interview/R1VOnXE904L
Message: javascript error: Maximum call stack size exceeded
(Session info: headless chrome=96.0.4664.110)
Stacktrace:
Backtrace:
Ordinal0 [0x00D06903+2517251]
Ordinal0 [0x00C9F8E1+2095329]
Ordinal0 [0x00BA2848+1058888]
Ordinal0 [0x00BA4F44+1068868]
Ordinal0 [0x00BA4E0E+1068558]
Ordinal0 [0x00BA56BA+1070778]
Ordinal0 [0x00BF64F9+1402105]
Ordinal0 [0x00BE64D3+1336531]
Ordinal0 [0x00BF5BBF+1399743]
Ordinal0 [0x00BE639B+1336219]
Ordinal0 [0x00BC27A7+1189799]
Ordinal0 [0x00BC3609+1193481]
GetHandleVerifier [0x00E95904+1577972]
GetHandleVerifier [0x00F40B97+2279047]
GetHandleVerifier [0x00D96D09+534521]
GetHandleVerifier [0x00D95DB9+530601]
Ordinal0 [0x00CA4FF9+2117625]
Ordinal0 [0x00CA98A8+2136232]
Ordinal0 [0x00CA99E2+2136546]
Ordinal0 [0x00CB3541+2176321]
BaseThreadInitThunk [0x761E6739+25]
RtlGetFullPathName_UEx [0x77E48FEF+1215]
RtlGetFullPathName_UEx [0x77E48FBD+1165]
Am i missing a step here or is the chrome binary supposed to be there?
Plaform - linux OS
Describe the bug
I have this issue while trying to download a course
Found Issue, Going Next Course Message: no such element: Unable to locate element: {"method":"css selector","selector":"h4"}
Expected behavior
course downloaded successfully
Desktop (please complete the following information):
I found a course Learn Ruby from Scratch (https://www.educative.io/courses/learn-ruby-from-scratch) contains mini-project sections.
However, when scraping the mini-project page, there is an error of "Found Issue, Going Next Course Expected to find two og:title elements."
Maybe you can add a condition to skip the scraping project section?
Describe the bug
Course scrape fails with the following error:
ERROR - StartScraper - start: 20: ExtensionScraper:start: 43: ExtensionScraper:scrapeCourse: 67: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Scraper should scrape course
Desktop (please complete the following information):
Running windows 10
Describe the bug
After one scrape attempt, the next time I try to run the scrapper it fails due to version incompatibility with Chrome? I'm not sure if this is more an issue with the script itself or somehow Chrome auto-updates itself after its first usage or something like that?
I'm getting this error
Found Issue, Going Next Course Message: session not created: This version of ChromeDriver only supports Chrome version 98 Current browser version is 103.0.5060.134 with binary path /Users/deniz946/Documents/Projects/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome
To Reproduce
Steps to reproduce the behavior:
Found Issue, Going Next Course Message: session not created: This version of ChromeDriver only supports Chrome version 98 Current browser version is 103.0.5060.134 with binary path /Users/deniz946/Documents/Projects/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome
Expected behavior
Start scrapping the course listed in the urls file
Desktop (please complete the following information):
Why not embed all required bins together and just one command to run ?
I have to figure out all the required bins versions and many other env problems to solve, that make it not so easy to use, or painfully to use.
I am getting this error in my terminal(after successfully logging in and having a config file):
Exception, Driver exited 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Main Exception local variable 'driver' referenced before assignment
Press Enter to continue
Do you know why I am getting this?
I cloned the latest updated repo
I am trying to download the below course
https://www.educative.io/courses/grokking-coding-interview-patterns-java/g2wXzBm72L6
but always get this error just after starting the download
hello! There are some problems when I used the version6.6 to scrape https://www.educative.io/module/lesson/dynamic-programming-cpp/qV7yoAqYgPG.
The log shows '3 https://www.educative.io/module/lesson/dynamic-programming-cpp/qV7yoAqYgPG
too many values to unpack (expected 2)'
Please help!
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
These are some of the page where the program crashes
https://www.educative.io/courses/grokking-the-behavioral-interview/YQXMlDkqpoA
https://www.educative.io/courses/recursion-for-coding-interviews-in-java/m2qwWvOjR0R
2023-11-03 19:07:40,690 - INFO - BrowserUtility - Loading Browser...
2023-11-03 19:07:48,919 - ERROR - LoginAccount - start: 25: BrowserUtility:loadBrowser: 48: HTTPConnectionPool(host='127.0.0.1', port=9515): Max retries exceeded with url: /session (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001792B0B2310>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
2023-11-03 19:07:48,919 - DEBUG - LoginAccount - Exiting...
Hi, I've been trying to use your script a little on both Windows and MacOS but I've been unable to use it.
I've found several problems, using both the executables and the source code from this repo. I've been absolutely unable to use it.
On Windows I suspect you might have hardcoded Python in the executable. I get this when running it:
No Python at 'C:\Users\anila\AppData\Local\Programs\Python\Python39\python.exe'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.