Coder Social home page Coder Social logo

anilabhadatta / educative.io_scraper Goto Github PK

View Code? Open in Web Editor NEW
139.0 0.0 50.0 37.17 MB

Educative.io Course Downloader developed using Python and Selenium. Refer Readme.md for setup instructions.

License: MIT License

Python 100.00%
python selenium chrome chromedriver educative-downloader educativeio-downloader html javascript educative-scraper automation

educative.io_scraper's People

Contributors

anhpho avatar anilabhadatta avatar boostupstation avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

educative.io_scraper's Issues

Not working any more

Describe the bug

##############
Scraper Started, Log file can be found in Save directory

Driver Loaded

                        [Selected config: 0] Starting Scraping: 0, https://www.educative.io/courses/decode-coding-interview-java/3woxv5GoKmQ

Load Webpage Function

Checking Login Function

Checking for captcha Function...

Create Course Folder Function

Getting File Name

This is a module page

Get File name module

Inside Course Folder

Checking Login Function

Checking for captcha Function...

Scrolling Page

Getting File Name

This is a module page

Get File name module

Checking page

Removing Nav tags from page

Node deleted div[class*='ed-grid'] > nav

Node deleted div[class*='ed-grid'] > nav

Show Hints Function

No hints found

Inside find_mark_down_quiz_containers function

No mark down quiz_container found

Show Codebox Answers Function

No Codebox answers found

Finding Slides Function

No Slides Found

Inside take_quiz_screenshot function

Quiz not found

Adding Name Tag in Next Back Button

Fixing SVG Tags inside Object Tags

Get HTML Page Content Using Single File Function

make_code_selectable function

make_code_selectable function executed

Creating HTML File

HTML File Created

HTML Page content taken.

Inside Widget Container Function

No widget container found

Code Container Download Type Function

No Code Container Downloadable Type found

Code Container Clipboard Type Function

No code containers found

Remove Mark completed

Next Page Function

Going Next Page

--------------- 0 Complete-------------------

Checking Login Function

Checking for captcha Function...

Scrolling Page

Getting File Name

This is a module page

Get File name module

Checking page

Removing Nav tags from page

Node deleted div[class*='ed-grid'] > nav

Node deleted div[class*='ed-grid'] > nav

Show Hints Function

No hints found

Inside find_mark_down_quiz_containers function

No mark down quiz_container found

Show Codebox Answers Function

No Codebox answers found

Finding Slides Function

No Slides Found

Inside take_quiz_screenshot function

Quiz not found

Adding Name Tag in Next Back Button

Fixing SVG Tags inside Object Tags

Get HTML Page Content Using Single File Function

make_code_selectable function

make_code_selectable function executed

Creating HTML File

HTML File Created

HTML Page content taken.

Inside Widget Container Function

No widget container found

Code Container Download Type Function

No Code Container Downloadable Type found

Code Container Clipboard Type Function

No code containers found

Remove Mark completed

Next Page Function

Going Next Page

--------------- 1 Complete-------------------

Checking Login Function

Checking for captcha Function...

Scrolling Page

Getting File Name

This is a module page

Get File name module

Checking page

Removing Nav tags from page

Node deleted div[class*='ed-grid'] > nav

Node deleted div[class*='ed-grid'] > nav

Show Hints Function

No hints found

Inside find_mark_down_quiz_containers function

No mark down quiz_container found

Show Codebox Answers Function

No Codebox answers found

Finding Slides Function

No Slides Found

Inside take_quiz_screenshot function

Quiz not found

Adding Name Tag in Next Back Button

Fixing SVG Tags inside Object Tags

Get HTML Page Content Using Single File Function

make_code_selectable function

make_code_selectable function executed

Creating HTML File

HTML File Created

HTML Page content taken.

Inside Widget Container Function

No widget container found

Code Container Download Type Function

No Code Container Downloadable Type found

Code Container Clipboard Type Function

No code containers found

Remove Mark completed

Next Page Function

Going Next Page

--------------- 2 Complete-------------------

Checking Login Function

Checking for captcha Function...

Scrolling Page

Getting File Name

This is a module page

Get File name module

Checking page

Removing Nav tags from page

Node deleted div[class*='ed-grid'] > nav

Node deleted div[class*='ed-grid'] > nav

Show Hints Function

No hints found

Inside find_mark_down_quiz_containers function

No mark down quiz_container found

Show Codebox Answers Function

No Codebox answers found

Finding Slides Function

No Slides Found

Inside take_quiz_screenshot function

Quiz not found

Adding Name Tag in Next Back Button

Fixing SVG Tags inside Object Tags

Get HTML Page Content Using Single File Function

make_code_selectable function

make_code_selectable function executed

Creating HTML File

HTML File Created

HTML Page content taken.

Inside Widget Container Function

No widget container found

Code Container Download Type Function

No Code Container Downloadable Type found

Code Container Clipboard Type Function

No code containers found

Remove Mark completed

Next Page Function

Going Next Page

--------------- 3 Complete-------------------

Checking Login Function

Checking for captcha Function...

Scrolling Page

Getting File Name

This is a module page

Get File name module

Checking page

Removing Nav tags from page

Node deleted div[class*='ed-grid'] > nav

Node deleted div[class*='ed-grid'] > nav

Show Hints Function

No hints found

Inside find_mark_down_quiz_containers function

No right button found in Mark Down Quiz

Clicking on Mark Down Quiz function

Found Issue, Going Next Course Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}

(Session info: headless chrome=98.0.4758.102)

Stacktrace:

#0 0x559e6a0edb33

#1 0x559e69bb66d8

#2 0x559e69bec6f1

#3 0x559e69bec8b1

#4 0x559e69be1067

#5 0x559e69c0a08d

#6 0x559e69be0fa3

#7 0x559e69c0a16e

#8 0x559e69c1d2fb

#9 0x559e69c09f53

#10 0x559e69bdfa0a

#11 0x559e69be0ad5

#12 0x559e6a11f2fd

#13 0x559e6a1384bb

#14 0x559e6a1210d5

#15 0x559e6a139145

#16 0x559e6a114aaf

#17 0x559e6a155ba8

#18 0x559e6a155d28

#19 0x559e6a17048d

#20 0x7f3111490402

Script Execution Complete

                    Educative Scraper (version 7.2), developed by Anilabha Datta

######################################################################

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Main Exception Message: unknown error: Chrome failed to start: exited abnormally.

Main Exception Message: unknown error: Chrome failed to start: exited abnormally.

(unknown error: DevToolsActivePort file doesn't exist)

(The process started from chrome location /home/osboxes/Desktop/educative.io_scraper-master/Chrome-bin/linux/chrome/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Stacktrace:

#0 0x5576c832eb33

#1 0x5576c7df76d8

#2 0x5576c7e1a84c

#3 0x5576c7e15fca

#4 0x5576c7e50e0a

#5 0x5576c7e4af53

#6 0x5576c7e20a0a

#7 0x5576c7e21ad5

#8 0x5576c83602fd

#9 0x5576c83794bb

#10 0x5576c83620d5

#11 0x5576c837a145

#12 0x5576c8355aaf

#13 0x5576c8396ba8

#14 0x5576c8396d28

#15 0x5576c83b148d

#16 0x7f3895f57b43

Scrapping is being stale at one page

Describe the bug
A clear and concise description of what the bug is.
Hi Anil,

I am getting the issues where scrapping is being stale one point. Tried multiple times its giving some error.
Tried 3 to 4 times by reconfiguring ubuntu and MX Linux but at the same point its getting stopped..

https://www.educative.io/courses/grokking-dynamic-programming-a-deep-dive-using-python/7A3RLnnRvOy

image

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

failure

log

` 2023-11-26 10:47:32,159 - INFO - ApiUtility - Course URL Selector: //a[contains(@href, 'courses/distributed-systems-practitioners/')]
2023-11-26 10:47:32,185 - INFO - LoginAccount - Checking if logged in...
2023-11-26 10:47:32,194 - INFO - ApiUtility - Getting Course Collections JSON from URL: https://educative.io/api/collection/10370001/4891237377638400?work_type=collection
2023-11-26 10:47:32,194 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-26 10:47:32,946 - INFO - ExtensionScraper - API Urls: 166 == 166 :Topic Urls
2023-11-26 10:47:32,947 - INFO - ExtensionScraper - ----------------------------------------------------------------------------------
Scraping Topic: 165-some more things to discover: https://www.educative.io/courses/distributed-systems-practitioners/some-more-things-to-discover?showContent=true

2023-11-26 10:47:32,954 - INFO - LoginAccount - Checking if logged in...
2023-11-26 10:47:32,961 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4891237377638400/page/6134167328260096?work_type=collection
2023-11-26 10:47:32,962 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-26 10:47:33,008 - INFO - ApiUtility - Successfully fetched JSON API data
2023-11-26 10:47:33,008 - INFO - OSUtility - Sleeping for 10 seconds
2023-11-26 10:47:46,328 - INFO - SeleniumBasicUtility - Loading page and checking if something went wrong
2023-11-26 10:47:46,328 - INFO - OSUtility - Sleeping for 10 seconds
2023-11-26 10:47:56,397 - INFO - SeleniumBasicUtility - Waiting for webdriver to load topic page
2023-11-26 10:47:56,418 - INFO - SeleniumBasicUtility - Adding name attribute in next/back button
2023-11-26 10:47:56,422 - INFO - BrowserUtility - Scrolling Page
2023-11-26 10:47:56,808 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-26 10:47:58,811 - INFO - RemoveUtility - Removing blur with CSS
2023-11-26 10:47:58,837 - INFO - RemoveUtility - Removing mark-as-completed/completed tick mark
2023-11-26 10:47:58,900 - INFO - RemoveUtility - Removing unwanted elements
2023-11-26 10:47:58,905 - INFO - ShowUtility - Showing single markdown quiz solution
2023-11-26 10:47:58,909 - INFO - ShowUtility - No single markdown quiz solution found
2023-11-26 10:47:58,910 - INFO - ShowUtility - Showing code solutions
2023-11-26 10:47:58,914 - INFO - ShowUtility - No code solution found
2023-11-26 10:47:58,914 - INFO - ShowUtility - Showing hints
2023-11-26 10:47:58,917 - INFO - ShowUtility - No hints found
2023-11-26 10:47:58,918 - INFO - ShowUtility - Showing slides
2023-11-26 10:47:58,920 - INFO - ShowUtility - No slides found
2023-11-26 10:47:58,921 - INFO - SingleFileUtility - Fixing all object tags
2023-11-26 10:47:58,924 - INFO - SingleFileUtility - No object tag found
2023-11-26 10:47:58,924 - INFO - SingleFileUtility - Injecting important scripts
2023-11-26 10:47:58,931 - INFO - OSUtility - Sleeping for 5 seconds
2023-11-26 10:48:03,951 - INFO - OSUtility - Sleeping for 5 seconds
2023-11-26 10:48:08,954 - INFO - SingleFileUtility - Making code selectable
2023-11-26 10:48:08,964 - INFO - SingleFileUtility - No code found
2023-11-26 10:48:08,965 - INFO - SingleFileUtility - getSingleFileHtml: Getting SingleFile Html...
2023-11-26 10:48:10,015 - INFO - ExtensionScraper - Topic File Successfully Created
2023-11-26 10:48:10,015 - INFO - ExtensionScraper - Downloading Code and Quiz Files if found...
2023-11-26 10:48:10,015 - INFO - ExtensionScraper - Code and Quiz Files Downloaded if found.
2023-11-26 10:48:10,092 - INFO - ExtensionScraper - Started Scraping from Text File URL: ?showContent=true
2023-11-26 10:48:10,092 - INFO - BrowserUtility - Loading Browser...
2023-11-26 10:48:12,555 - INFO - BrowserUtility - Browser Initiated
2023-11-26 10:48:12,631 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 52: ExtensionScraper:scrapeCourse: 64: ApiUtility:getCourseUrl: 131: Message: invalid argument
(Session info: chrome=116.0.5845.96)
Stacktrace:
0 chromedriver 0x00000001007da65c chromedriver + 4318812
1 chromedriver 0x00000001007d2d00 chromedriver + 4287744
2 chromedriver 0x0000000100404644 chromedriver + 296516
3 chromedriver 0x00000001003ec430 chromedriver + 197680
4 chromedriver 0x00000001003e9fe0 chromedriver + 188384
5 chromedriver 0x00000001003eaafc chromedriver + 191228
6 chromedriver 0x00000001004067d4 chromedriver + 305108
7 chromedriver 0x000000010047b380 chromedriver + 783232
8 chromedriver 0x000000010047ad28 chromedriver + 781608
9 chromedriver 0x0000000100436178 chromedriver + 500088
10 chromedriver 0x0000000100436fc0 chromedriver + 503744
11 chromedriver 0x000000010079ac40 chromedriver + 4058176
12 chromedriver 0x000000010079f160 chromedriver + 4075872
13 chromedriver 0x0000000100762e68 chromedriver + 3829352
14 chromedriver 0x000000010079fc4c chromedriver + 4078668
15 chromedriver 0x0000000100777f08 chromedriver + 3915528
16 chromedriver 0x00000001007bc140 chromedriver + 4194624
17 chromedriver 0x00000001007bc2c4 chromedriver + 4195012
18 chromedriver 0x00000001007cc4d0 chromedriver + 4261072
19 libsystem_pthread.dylib 0x0000000187ec1034 _pthread_start + 136
20 libsystem_pthread.dylib 0x0000000187ebbe3c thread_start + 8`

no chrome binary

Describe the bug
Program can't file no chrome binary

To Reproduce
Steps to reproduce the behavior:
When I try to log in at step 3

Screenshots
image

Desktop (please complete the following information):

  • OS: Ubuntu, I also try on mint but still the same.

Thanks for your help

code clipboard files can be enhanced with naming convention

Is your feature request related to a problem? Please describe.
Currently, the code clipboards being downloaded are anonymous by file name and aren't easily readable
Current code clipboard path: ./code_clipboard0/code-0.txt

Describe the solution you'd like
Single code clipboard file for all the code clipboards per course topic. Sample result file attached.

Additional context
For example: To scrape the code "kubectl run db --image mongo" from the below snippet,
1
dentifiable.
we can scrape the caption-text from the second attached image showing the span tag, and create a single file.
2

Our final code_clipboard.txt file can be like this:

code_clipboard.txt

no such element: Unable to locate element:

Thanks for taking the time to create this great tool. Mostly, it's working just fine but sometimes I get the below error. Any idea on how I can fix it ? Thanks!

Take Screenshot Function
Found Issue, Going Next Course Message: no such element: Unable to locate element: {"method":"css selector","selector":"div[class*='ArticlePage']"}
  (Session info: headless chrome=106.0.5249.103)
Stacktrace:
Backtrace:
        Ordinal0 [0x01181ED3+2236115]
        Ordinal0 [0x011192F1+1807089]
        Ordinal0 [0x010266FD+812797]
        Ordinal0 [0x010555DF+1005023]
        Ordinal0 [0x010557CB+1005515]
        Ordinal0 [0x01087632+1209906]
        Ordinal0 [0x01071AD4+1120980]
        Ordinal0 [0x010859E2+1202658]
        Ordinal0 [0x010718A6+1120422]
        Ordinal0 [0x0104A73D+960317]
        Ordinal0 [0x0104B71F+964383]
        GetHandleVerifier [0x0142E7E2+2743074]
        GetHandleVerifier [0x014208D4+2685972]
        GetHandleVerifier [0x01212BAA+532202]
        GetHandleVerifier [0x01211990+527568]
        Ordinal0 [0x0112080C+1837068]
        Ordinal0 [0x01124CD8+1854680]
        Ordinal0 [0x01124DC5+1854917]
        Ordinal0 [0x0112ED64+1895780]
        BaseThreadInitThunk [0x76C9FA29+25]
        RtlGetAppContainerNamedObjectPath [0x77067B5E+286]
        RtlGetAppContainerNamedObjectPath [0x77067B2E+238]

failed to change window state to 'normal', current state is 'maximized'

Hi, several times I encountered this error:

 2023-10-15 15:24:22,586 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 105: SeleniumBasicUtility:setWindowSize: 112: Message: unknown error: failed to change window state to 'normal', current state is 'maximized'
  (Session info: chrome=118.0.5993.70)

not sure what is the root cause

Script is Increasing the VDI size in the BackGround while scrapping the files.

Describe the bug
Hi @anilabhadatta I am running this script in my virtualbox where guest OS is Ubuntu-22.04 host is windows11. The issue i am facing is while i am scrapping the files in the background its increasing the size of ubuntu VDI image with an alarming speed . It was initially 23Gb while in now i have downloaded around 500mb file my vdi size was increased to 35Gb in the middle and at last it is going beyind 40gb.
so exactly its increasing the footprint of VDI image nothing is happening in guest OS. looks like its impacting the base
I have attached the screenshot.

vdi-snapshot-start-time

vdi-snapshot-middle

Still Going

image

Handle Quiz just like Slides

Until now most of the things seem to work except Quiz page.
image
Can you please extract all the quiz questions just like the slides are extracted.

Unreadable text in some pages

Hello,
there seems to be an issue with the text in the courses.
Some paragraphs have white text color.
Is there anything that can be done?

no such element: Unable to locate element: {"method":"css selector","selector":"span"}

Describe the bug
Not able to download https://www.educative.io/courses/grokking-dynamic-programming-a-deep-dive-using-java/B84xWoY7YB2

getting below error -

4 https://www.educative.io/courses/grokking-dynamic-programming-a-deep-dive-using-java/B84xWoY7YB2 
Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}
  (Session info: chrome=96.0.4664.0)
Stacktrace:
Backtrace:
	Ordinal0 [0x00B66903+2517251]
	Ordinal0 [0x00AFF8E1+2095329]
	Ordinal0 [0x00A02848+1058888]
	Ordinal0 [0x00A2D448+1233992]
	Ordinal0 [0x00A2D63B+1234491]
	Ordinal0 [0x00A23AB1+1194673]
	Ordinal0 [0x00A4650A+1336586]
	Ordinal0 [0x00A23A36+1194550]
	Ordinal0 [0x00A465BA+1336762]
	Ordinal0 [0x00A55BBF+1399743]
	Ordinal0 [0x00A4639B+1336219]
	Ordinal0 [0x00A227A7+1189799]
	Ordinal0 [0x00A23609+1193481]
	GetHandleVerifier [0x00CF5904+1577972]
	GetHandleVerifier [0x00DA0B97+2279047]
	GetHandleVerifier [0x00BF6D09+534521]
	GetHandleVerifier [0x00BF5DB9+530601]
	Ordinal0 [0x00B04FF9+2117625]
	Ordinal0 [0x00B098A8+2136232]
	Ordinal0 [0x00B099E2+2136546]
	Ordinal0 [0x00B13541+2176321]
	BaseThreadInitThunk [0x76427D49+25]
	RtlInitializeExceptionChain [0x771FB74B+107]
	RtlClearBits [0x771FB6CF+191]

Using the most recent version of the scrapper.

This version of ChromeDriver only supports Chrome version 98

I'm running into:

Found Issue, Going Next Course Message: session not created: This version of ChromeDriver only supports Chrome version 98
Current browser version is 110.0.5481.100

I'm using the binary and driver from the most recent release. I'm running M1 ARM64 chip from Apple. Any tips?

Error: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal

Describe the bug
when I try to download a course, I get the following Error:
ERROR - StartScraper - start: 20: ExtensionScraper:start: 52: ExtensionScraper:scrapeCourse: 80: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal

To Reproduce
The problem happens with any course I try to download.

Expected behavior
I expect the course to be downloaded

Screenshots/ScreenRecord

EducativeScraper.log
Add the Log file and mention the timestamp and topicUrl.
2023-11-20 12:26:08,551 - INFO - StartScraper - StartScraper Initiated...
To Terminate, Click on Stop ScraperType Button

2023-11-20 12:26:08,671 - INFO - ExtensionScraper - ExtensionScraper initiated...
2023-11-20 12:26:08,672 - INFO - ExtensionScraper - Started Scraping from Text File URL: https://www.educative.io/courses/machine-learning-numpy-pandas-scikit-learn?showContent=true
2023-11-20 12:26:08,672 - INFO - BrowserUtility - Loading Browser...
2023-11-20 12:26:09,050 - INFO - BrowserUtility - Browser Initiated
2023-11-20 12:26:10,797 - INFO - ApiUtility - Course Type Selector: a[href*='/courses/']
2023-11-20 12:26:10,952 - INFO - ApiUtility - Getting Next Data
2023-11-20 12:26:10,967 - INFO - ApiUtility - Getting Course Topic URLs List from URL: https://www.educative.io/courses/machine-learning-numpy-pandas-scikit-learn/overview?showContent=true
2023-11-20 12:26:12,865 - INFO - SeleniumBasicUtility - Expanding all sections
2023-11-20 12:26:12,865 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-20 12:26:14,883 - INFO - SeleniumBasicUtility - Expanding all sections
2023-11-20 12:26:14,883 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-20 12:26:16,892 - INFO - SeleniumBasicUtility - Expanding all sections
2023-11-20 12:26:16,892 - INFO - OSUtility - Sleeping for 2 seconds
2023-11-20 12:26:18,903 - INFO - ApiUtility - Course URL Selector: //a[contains(@href, 'courses/')]
2023-11-20 12:26:18,911 - INFO - LoginAccount - Checking if logged in...
2023-11-20 12:26:18,917 - INFO - ApiUtility - Getting Course Collections JSON from URL: https://educative.io/api/collection/6083138522447872/5629499534213120?work_type=collection
2023-11-20 12:26:18,917 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-20 12:26:19,479 - INFO - ExtensionScraper - API Urls: 87 == 89 :Topic Urls
2023-11-20 12:26:19,609 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 52: ExtensionScraper:scrapeCourse: 80: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal

Desktop (please complete the following information):

  • OS: Linux
  • ChromeVersion Version 119.0.6045.159 (Official Build) (64-bit)

Unable to locate element

Followed the readme. Got the below result. I'm on Win64.

0 https://www.educative.io/courses/cpp-fundamentals-for-professionals/qAmyZQB2qm2 
Message: no such element: Unable to locate element: {"method":"css selector","selector":"h4"}
  (Session info: headless chrome=96.0.4664.110)
Stacktrace:
Backtrace:
	Ordinal0 [0x00C96903+2517251]
	Ordinal0 [0x00C2F8E1+2095329]
	Ordinal0 [0x00B32848+1058888]
	Ordinal0 [0x00B5D448+1233992]
	Ordinal0 [0x00B5D63B+1234491]
	Ordinal0 [0x00B53AB1+1194673]
	Ordinal0 [0x00B7650A+1336586]
	Ordinal0 [0x00B53A36+1194550]
	Ordinal0 [0x00B765BA+1336762]
	Ordinal0 [0x00B85BBF+1399743]
	Ordinal0 [0x00B7639B+1336219]
	Ordinal0 [0x00B527A7+1189799]
	Ordinal0 [0x00B53609+1193481]
	GetHandleVerifier [0x00E25904+1577972]
	GetHandleVerifier [0x00ED0B97+2279047]
	GetHandleVerifier [0x00D26D09+534521]
	GetHandleVerifier [0x00D25DB9+530601]
	Ordinal0 [0x00C34FF9+2117625]
	Ordinal0 [0x00C398A8+2136232]
	Ordinal0 [0x00C399E2+2136546]
	Ordinal0 [0x00C43541+2176321]
	BaseThreadInitThunk [0x7610FA29+25]
	RtlGetAppContainerNamedObjectPath [0x77127B5E+286]
	RtlGetAppContainerNamedObjectPath [0x77127B2E+238]

Error while resuming download from mid

I had to pause download while it was saving https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/evaluation-of-a-distributed-caches-design

Resume with course link updated to to above. Fails with below stacktrace

 source env/bin/activate
❯ python3 EducativeScraper.py

                Educative Scraper (v3.1.0 Master Branch), developed by Anilabha Datta
                Project Link: https://github.com/anilabhadatta/educative.io_scraper/tree/v3-dev
                Check out ReadMe for more information about this project.
                Use the GUI to start scraping.
        
2023-10-05 08:57:12.534 Python[401:1285289] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
 2023-10-05 08:57:12,928 - INFO - HomeScreen - Creating Home Screen...
 2023-10-05 08:57:12,942 - DEBUG - HomeScreen - fixGeometry called
 2023-10-05 08:57:12,984 - DEBUG - HomeScreen - fixGeometry completed
 2023-10-05 08:57:12,986 - DEBUG - HomeScreen - createHomeScreen completed
 2023-10-05 08:57:14,841 - INFO - HomeScreen -   Starting Chrome Driver...
                                Path:  /Users/amisra/dev/educative.io_scraper/src/ChromeDrivers/mac/chromedriver-mac-arm64/chromedriver
                          
 2023-10-05 08:57:14,860 - DEBUG - HomeScreen - startChromeDriver completed
 2023-10-05 08:57:27,192 - INFO - HomeScreen -   Starting Chrome Driver...
                                Path:  /Users/amisra/dev/educative.io_scraper/src/ChromeDrivers/mac/chromedriver-mac-arm64/chromedriver
                          
 2023-10-05 08:57:27,214 - DEBUG - HomeScreen - startChromeDriver completed
 2023-10-05 08:57:35,674 - DEBUG - HomeScreen - startScraper called
 2023-10-05 08:57:35,734 - DEBUG - HomeScreen - startScraper completed
 2023-10-05 08:57:36,074 - INFO - StartScraper - StartScraper Initiated...
                            To Terminate, Click on Stop Scraper Button
                        
 2023-10-05 08:57:36,095 - INFO - ExtensionScraper - ExtensionScraper initiated...
 2023-10-05 08:57:36,104 - INFO - ExtensionScraper - Started Scraping from Text File URL: https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/evaluation-of-a-distributed-caches-design
 2023-10-05 08:57:36,104 - INFO - BrowserUtility - Loading Browser...
 2023-10-05 08:57:38,070 - INFO - BrowserUtility - Browser Initiated
 2023-10-05 08:57:40,029 - INFO - ApiUtility - Course Type Selector: a[href*='/courses/']
 2023-10-05 08:57:40,043 - INFO - ApiUtility - Getting Next Data
 2023-10-05 08:57:40,064 - INFO - ApiUtility - Getting Course Topic URLs List from URL: https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/introduction-to-modern-system-design
 2023-10-05 08:57:43,384 - DEBUG - SeleniumBasicUtility - Expanding all sections function
 2023-10-05 08:57:53,849 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 47: ExtensionScraper:scrapeCourse: 61: ApiUtility:getCourseTopicUrlsList: 107: SeleniumBasicUtility:expandAllSections: 33: Message: 
Stacktrace:
0   chromedriver                        0x00000001013e6d98 chromedriver + 4337048
1   chromedriver                        0x00000001013dee14 chromedriver + 4304404
2   chromedriver                        0x000000010100ba5c chromedriver + 293468
3   chromedriver                        0x0000000101050d50 chromedriver + 576848
4   chromedriver                        0x000000010108b908 chromedriver + 817416
5   chromedriver                        0x0000000101044a5c chromedriver + 526940
6   chromedriver                        0x0000000101045908 chromedriver + 530696
7   chromedriver                        0x00000001013acde4 chromedriver + 4099556
8   chromedriver                        0x00000001013b12a0 chromedriver + 4117152
9   chromedriver                        0x00000001013b752c chromedriver + 4142380
10  chromedriver                        0x00000001013b1da0 chromedriver + 4119968
11  chromedriver                        0x0000000101389a74 chromedriver + 3955316
12  chromedriver                        0x00000001013cea48 chromedriver + 4237896
13  chromedriver                        0x00000001013cebc4 chromedriver + 4238276
14  chromedriver                        0x00000001013dea8c chromedriver + 4303500
15  libsystem_pthread.dylib             0x000000018a2bf034 _pthread_start + 136
16  libsystem_pthread.dylib             0x000000018a2b9e3c thread_start + 8

 2023-10-05 08:57:53,850 - DEBUG - StartScraper - Exiting Scraper..

On print command in chrome/edge the page gets blurred

It happens with any course page. I am not sure if this is a bug with this scraper like drive issue or something else, but this is annoying and I can not take the printout of the page.

Steps to reproduce -

  1. Download any course
  2. open the HTML page in chrome and try to print -> the preview page gets blurred-> even print pdf gets blurred

image

Any help on this would be appreciated.

Crash

On windows 11, using pyenv and python 3.11.2. I just cloned the latest and downloaded the chrome driver and binary.

Experience the following crash on every course tried.

ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message:
Stacktrace:
GetHandleVerifier [0x00007FF6535A8EF2+54786]
(No symbol) [0x00007FF653515612]
(No symbol) [0x00007FF6533CA64B]
(No symbol) [0x00007FF65340B79C]
(No symbol) [0x00007FF65340B91C]
(No symbol) [0x00007FF653446D87]
(No symbol) [0x00007FF65342BEAF]
(No symbol) [0x00007FF653444D02]
(No symbol) [0x00007FF65342BC43]
(No symbol) [0x00007FF653400941]
(No symbol) [0x00007FF653401B84]
GetHandleVerifier [0x00007FF6538F7F52+3524194]
GetHandleVerifier [0x00007FF65394D800+3874576]
GetHandleVerifier [0x00007FF653945D7F+3843215]
GetHandleVerifier [0x00007FF653645086+694166]
(No symbol) [0x00007FF653520A88]
(No symbol) [0x00007FF65351CA94]
(No symbol) [0x00007FF65351CBC2]
(No symbol) [0x00007FF65350CC83]
BaseThreadInitThunk [0x00007FFDE605257D+29]
RtlUserThreadStart [0x00007FFDE82CAA78+40]

2023-10-18 08:48:11,034 - DEBUG - StartScraper - Exiting Scraper...

Main Exception local variable 'driver' referenced before assignment

Hey there,

I was unable to use the executable for MacOS nor Windows, so I decided to check out the repo and see if I were able to make it work from there.

I followed everything in the instructions and was able to launch chromium and authenticate, but when it's time to start scraping, it fails with this error:

Main Exception local variable 'driver' referenced before assignment

I did the same for both Windows and MacOS. Same result. Can launch the driver, can login with chromium, then this error happens.

I solved the problem of creating folders name with symbols like '?:\'

explorer_8Bt9NnpZoO
I solved the problem above. The folder name cannot contain these symbols like '?:'. This renaming problem may only occur in Chinese edition Windows10 or 11.

The code I changed as follows. I added a regex to delete the annoying symbols. Maybe you have a better solution!

#changed by woden
import re
def replace_filename(str):  
    numDict = {':':' ','?':' ','|':' ','>':' ','<':' ','/':' '}
    print(str.group())  
    return numDict[str.group()]
#end

def scrape_page(driver, file_index):
    scroll_page(driver)
    wait_webdriver(driver)
    title = get_file_name(driver)
    check_page(title)
    file_name = str(file_index) + "-" + title

    # change by woden
    a = re.sub(r'[:?|></]', replace_filename, file_name)
    # a = ''.join(filter(lambda i: i in [' '] or i.isalnum(),file_name))
    # end

    driver.set_window_size(1920, get_current_height(driver))
    remove_nav_tags(driver)
    show_hints_answer(driver)
    mark_down_quiz(driver)
    show_code_box_answer(driver)
    open_slides(driver)
    create_folder(a)
    quiz_html = take_quiz_screenshot(driver)
    # take_screenshot(driver, file_name, quiz_html)
    add_name_tag_in_next_back_button(driver)
    fix_all_svg_tags_inside_object_tags(driver) 

    #woden change
    get_pagecontent_using_singleFile(driver, "main", quiz_html)
    #end

    code_widget_type(driver)
    code_container_download_type(driver)
    code_container_clipboard_type(driver)
    demark_as_completed(driver)

    if not next_page(driver):
        sleep(5)
        return False
    return True

def create_course_folder(driver, url):
    print("Create Course Folder Function")
    course_name = get_file_name(driver, True)

    x = re.sub(r'[:?|></]', replace_filename, course_name)

    create_folder(x)
    print("Inside Course Folder")

I have tested these codes. no problem.
But I am a python beginner, the above code may have some bug.

Not sure how to use it

Downloaded the packages from releases, running both executable at the same time, not sure what's the correct config, but have created a text file with url links in it, was able to login in through the code, but then when you start scrapping it says: Main Exception local variable 'driver' referenced before assignment

Anything you want me to try apart from what you've mentioned in the readme? A little help would be appreciated.

Mac OS 12.+ Intel Based. Not working

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Please Find logs below --
Main Exception Message: unknown error: Chrome failed to start: was killed.
(unknown error: DevToolsActivePort file doesn't exist)
Main Exception Message: unknown error: Chrome failed to start: was killed.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /Users/penguin/Desktop/EducativeScraper/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
Main Exception Message: unknown error: Chrome failed to start: was killed.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /Users/penguin/Desktop/EducativeScraper/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
0 chromedriver 0x0000000100a98ee9 chromedriver + 5013225
1 chromedriver 0x0000000100a241d3 chromedriver + 4534739
2 chromedriver 0x00000001005faa68 chromedriver + 170600
3 chromedriver 0x000000010061cc9a chromedriver + 310426
4 chromedriver 0x0000000100618756 chromedriver + 292694
5 chromedriver 0x0000000100652550 chromedriver + 529744
6 chromedriver 0x000000010064c6d3 chromedriver + 505555
7 chromedriver 0x000000010062276e chromedriver + 333678
8 chromedriver 0x0000000100623745 chromedriver + 337733
9 chromedriver 0x0000000100a54efe chromedriver + 4734718
10 chromedriver 0x0000000100a6ea19 chromedriver + 4839961
11 chromedriver 0x0000000100a741c8 chromedriver + 4862408
12 chromedriver 0x0000000100a6f3aa chromedriver + 4842410
13 chromedriver 0x0000000100a49a01 chromedriver + 4688385
14 chromedriver 0x0000000100a8a538 chromedriver + 4953400
15 chromedriver 0x0000000100a8a6c1 chromedriver + 4953793
16 chromedriver 0x0000000100aa0225 chromedriver + 5042725
17 libsystem_pthread.dylib 0x00007ff80a8964e1 _pthread_start + 125
18 libsystem_pthread.dylib 0x00007ff80a891f6b thread_start + 15

image
image

Terrible documentation

I appreciate your efforts to create this, but have nod idea how to start it, the explanation is more confusing than helpful.

Found Issue, Going Next Course Message: javascript error: Maximum call stack size exceeded

I meet "Found Issue, Going Next Course Message: javascript error: Maximum call stack size exceeded" when scrape the 22th course in https://www.educative.io/courses/grokking-the-machine-learning-interview/JEOBmok417l.
The problem occurs after "Get HTML Page Content Using Single File Function".
Is this problem caused by my network or something else?

The log message as follows:
22 https://www.educative.io/courses/grokking-the-machine-learning-interview/R1VOnXE904L
Message: javascript error: Maximum call stack size exceeded
(Session info: headless chrome=96.0.4664.110)
Stacktrace:
Backtrace:
Ordinal0 [0x00D06903+2517251]
Ordinal0 [0x00C9F8E1+2095329]
Ordinal0 [0x00BA2848+1058888]
Ordinal0 [0x00BA4F44+1068868]
Ordinal0 [0x00BA4E0E+1068558]
Ordinal0 [0x00BA56BA+1070778]
Ordinal0 [0x00BF64F9+1402105]
Ordinal0 [0x00BE64D3+1336531]
Ordinal0 [0x00BF5BBF+1399743]
Ordinal0 [0x00BE639B+1336219]
Ordinal0 [0x00BC27A7+1189799]
Ordinal0 [0x00BC3609+1193481]
GetHandleVerifier [0x00E95904+1577972]
GetHandleVerifier [0x00F40B97+2279047]
GetHandleVerifier [0x00D96D09+534521]
GetHandleVerifier [0x00D95DB9+530601]
Ordinal0 [0x00CA4FF9+2117625]
Ordinal0 [0x00CA98A8+2136232]
Ordinal0 [0x00CA99E2+2136546]
Ordinal0 [0x00CB3541+2176321]
BaseThreadInitThunk [0x761E6739+25]
RtlGetFullPathName_UEx [0x77E48FEF+1215]
RtlGetFullPathName_UEx [0x77E48FBD+1165]

ERROR - StartScraper - start: 20: ExtensionScraper:start: 43: ExtensionScraper:scrapeCourse: 67: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal

Describe the bug
Course scrape fails with the following error:

ERROR - StartScraper - start: 20: ExtensionScraper:start: 43: ExtensionScraper:scrapeCourse: 67: CourseCollectionsJson and CourseTopicUrlsList Urls are not equal

To Reproduce
Steps to reproduce the behavior:

  1. Add the following url for system design course: https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/introduction-to-modern-system-design
  2. Start scraper
  3. See error

Expected behavior
Scraper should scrape course

Desktop (please complete the following information):
Running windows 10

Chrome auto-update itself?

Describe the bug
After one scrape attempt, the next time I try to run the scrapper it fails due to version incompatibility with Chrome? I'm not sure if this is more an issue with the script itself or somehow Chrome auto-updates itself after its first usage or something like that?

I'm getting this error
Found Issue, Going Next Course Message: session not created: This version of ChromeDriver only supports Chrome version 98 Current browser version is 103.0.5060.134 with binary path /Users/deniz946/Documents/Projects/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome

To Reproduce
Steps to reproduce the behavior:

  1. Run the script to scrape something, it works fine
  2. Try starting another scrapping process later on
  3. See error
    Found Issue, Going Next Course Message: session not created: This version of ChromeDriver only supports Chrome version 98 Current browser version is 103.0.5060.134 with binary path /Users/deniz946/Documents/Projects/educative.io_scraper/Chrome-bin/mac/Chromium.app/Contents/MacOS/Google Chrome

Expected behavior
Start scrapping the course listed in the urls file

Screenshots
Captura de pantalla 2022-07-24 a las 21 07 31

Desktop (please complete the following information):

  • OS: Mac OS 12.1 Monterey

So difficult to use

Why not embed all required bins together and just one command to run ?
I have to figure out all the required bins versions and many other env problems to solve, that make it not so easy to use, or painfully to use.

utf8 Encoding/Decoding error

I am getting this error in my terminal(after successfully logging in and having a config file):
Exception, Driver exited 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Main Exception local variable 'driver' referenced before assignment
Press Enter to continue

Do you know why I am getting this?
I cloned the latest updated repo

v3 release when?

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

THE CHROME BROWSER DOES NOT WORK

2023-11-03 19:07:40,690 - INFO - BrowserUtility - Loading Browser...
2023-11-03 19:07:48,919 - ERROR - LoginAccount - start: 25: BrowserUtility:loadBrowser: 48: HTTPConnectionPool(host='127.0.0.1', port=9515): Max retries exceeded with url: /session (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001792B0B2310>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
2023-11-03 19:07:48,919 - DEBUG - LoginAccount - Exiting...

No Python at 'C:\Users\anila\AppData\Local\Programs\Python\Python39\python.exe'

Hi, I've been trying to use your script a little on both Windows and MacOS but I've been unable to use it.
I've found several problems, using both the executables and the source code from this repo. I've been absolutely unable to use it.

On Windows I suspect you might have hardcoded Python in the executable. I get this when running it:

No Python at 'C:\Users\anila\AppData\Local\Programs\Python\Python39\python.exe'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.