Coder Social home page Coder Social logo

geminidsystems / googlenewsscraper Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 5.0 15.63 MB

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https://pepy.tech/project/GoogleNewsScraper)

Home Page: https://pypi.org/project/GoogleNewsScraper/

License: MIT License

Python 100.00%
googlenews webscraper webcrawler googleautomator googlescraper googlenewsscraper selenium python scraping crawler

googlenewsscraper's People

Contributors

abnoviello23 avatar alexkhazzam avatar karlgunst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

googlenewsscraper's Issues

Stability Enhancement - Replace selection by class name

@abnoviello23

For stability reasons, we want to replace the use of find_elements_by_class_name or div[contains(@class)], google changes the class names regularly and it breaks our script

We should be able to select what we need using one of the following

  • select by id (preferred method as this is unlikely to change)
  • select by tag name (example <img/> for the image_url and <a/> for the url for sure can be used)
  • select by tag position (for example we know the text content we want is under a > div > div > [div,div,div] (the 3 divs each contain the source, title, and description we need)

Error running example.py

So I tried the example but I got the following error:

Message: javascript error: Cannot read properties of null (reading 'classList')
  (Session info: headless chrome=103.0.5060.114)

What can I do to fix this?

Bug : Empty Results

Debug why the results are empty
Report on the ticket what is going on
Try to fix it

Example

Hi there,
Thanks for this great script. I was able to get the output text in the terminal, however I am unable to figure out how to save the results in a spreadsheet format โ€” my python skills are obviously limited. Could you expand your example code to include storing the results in csv? thanks.

Feature - Add support for ChromeDriverManager

@abnoviello23

Our scraper current allows for the caller to use their own driver or use the static driver

We want to replace the static driver to use ChromeDriverManager (see working in example/app.py)

The user must still be allowed to pass their own driver in as shown below

GoogleNewsScraper(my_driver)

If there is no driver passed in, we want to use ChromeDriverManager to install the latest version of the chrome driver

Python Selenium Code Throws Errors

I am working on the following.

This method of selecting HTML elements fails to work.

'//div[@id="rso"]/div/div/div/a/div/div[2]/div[4]/p/span'

Additionally, these HTML selectors need to be updated.

driver.execute_script("""
        const menu = document.getElementById('hdtbMenus');
        menu.classList.remove('p4DDCd');
        menu.classList.add('yyoM4d');
""")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.