Coder Social home page Coder Social logo

autoscraper's Introduction

Self-learning Web Scraping with AutoScraper in Python

AutoScraper is a Smart, Automatic, Fast and Lightweight Web Scraper for Python.

Developed by Alireza Mika, it can be downloaded at https://github.com/alirezamika/autoscraper

Concept and problem solved

Despite the availability of tools such as Beautiful Soup Web Scraping is difficult.

A library such Beautiful Soup helps you to:

  • query a Web page
  • parse the result of the query into a structured data structure: a tree
  • query the resulting tree with an idiomatic way

But a Web Scraper doesn't write the query for you.

The purpose of a web page is to be consumed by humans not machines:

  • the format of the page can change over time, and so the query could stop to work
  • writing a "web scraping query" is time-consuming

What if a library could learn from an example and then can write the scrap query for you : it's "the reason d'être of AutoScraper".

How to use AutoScraper

Problem

I want to create a Web scraper for the Web site Quora to all the questions about a subject.

Phase 1 : the learning phase

  • We train our library on a specific page: "https://www.amazon.in/s?i=aps&k=iphones"
  • We give to the library an example about what we seek on the page for example "When will deep learning finally die out?"

Install AutoScraper from Github

pip install git+https://github.com/alirezamika/autoscraper.git

Create the file amazon_scrapper.ipynb

from autoscraper import AutoScraper

# Parameters
url = "https://www.amazon.in/s?i=aps&k=iphones"
model_name = "amazon-search"

wanted_list = ["₹58,400","New Apple iPhone 11 (128GB) - Black"]

# We instanciate the AutoScraper
scraper = AutoScraper()

# We train the Scraper
# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper.build(url, wanted_list)

# We display the results if any
if(result):
  print("🚀 Great a query has been inferred !! Great gob.")
  print(result)

# If no result we leave with an error code
if(result == None):
  print("Sorry no query can be inferred ... 😿")
  exit(-1)

# We save the model for future use
print(f"💿 > Save the model {model_name}")
scraper.save(model_name)

We execute the file amazon_scrapper.ipynb

python3 amazon_scrapper.ipynb
🚀 Great a query has been inferred !! Great gob.
['₹58,400', '₹59,900', '₹1,25,900', '₹1,29,900', '₹82,400', '₹84,900', '₹66,900', '₹69,900', '₹1,35,900', '₹1,39,900', '₹93,900', '₹1,23,900', '₹1,15,900', '₹1,19,900', '₹77,900', '₹79,900', '₹82,900', '₹1,31,900', 'New Apple iPhone 11 (128GB) - Black', 'New Apple iPhone 12 Pro Max (128GB) - Pacific Blue', 'New Apple iPhone 12 (128GB) - Blue', 'New Apple iPhone 11 (128GB) - White', 'New Apple iPhone 11 (128GB) - Green', 'New Apple iPhone 11 (128GB) - (Product) RED', 'New Apple iPhone 12 Mini (64GB) - Blue', 'New Apple iPhone 12 Pro Max (256GB) - Graphite', 'Apple iPhone 11 Pro Max (256GB) - Midnight Green', 'New Apple iPhone 12 Pro (128GB) - Pacific Blue', 'New Apple iPhone 12 (64GB) - Blue', 'New Apple iPhone 12 Mini (64GB) - White', 'New Apple iPhone 12 (128GB) - Black', 'New Apple iPhone 12 (64GB) - Black', 'Apple iPhone 11 Pro Max (256GB) - Gold']
💿 > Save the model amazon-search

Phase 2 : the usage phase

A model has been saved in the preceding step that contains all the rules of scraping.

Now, we can apply our model on a page that shares the same structure with the page we have used during the training phase.

We create a new file called amazon_scrapper.ipynb

from autoscraper import AutoScraper

# AutoScraper must be installed with
#  pip install git+https://github.com/alirezamika/autoscraper.git

question = "iphone"
time = "year"
url = f"https://www.amazon.in/s?i=aps&k=mi"
model_name = "amazon-search"

scraper = AutoScraper()
scraper.load(f"./{model_name}")
# Get all the results in the page similar to our model
results = scraper.get_result_similar(url)

# if no results
if results:
  for r in results:
    print(r)
else:
  print("No result found")

We test if the scraper works

python3 amazon_scrapper.ipynb

Result of the execution of amazon_scrapper.ipynb

['Mi 10i 5G (Atlantic Blue, 6GB RAM, 128GB Storage) - 108MP Quad Camera | Snapdragon 750G Processor | Upto 6 Months No Cost EMI',
 'Redmi Note 9 Pro (Champagne Gold, 4GB RAM, 64GB Storage) - Latest 8nm Snapdragon 720G & Alexa Hands-Free',
 'Redmi Note 9 Pro Max (Champagne Gold, 6GB RAM, 64GB Storage) - 64MP Quad Camera & Latest 8nm Snapdragon 720G & Alexa Hands...',
 'Mi Redmi 6A (Black, 2GB RAM, 16GB Storage)',
 'Redmi 9 (Sky Blue, 4GB RAM, 64GB Storage)',
 'Mi 11X Pro (Celestial Silver, 8GB RAM, 256GB Storage) | Upto INR 4000 Off on HDFC Bank Cards and EMI | Upto 12 Months No C...',
 'Redmi Note 10 (Shadow Black, 6GB RAM, 128GB Storage)',
 'Redmi 9A (Nature Green, 2GB Ram, 32GB Storage) | 2GHz Octa-core Helio G25 Processor',
 'Samsung Galaxy M31 (Ocean Blue, 6GB RAM, 128GB Storage)',
 'Samsung Galaxy M31s (Mirage Blue, 6GB RAM, 128GB Storage)']

Conclusion

The scraper must be trained again if the structure of the page changes.

The real advantage of the is approach is to be very reactive when a new format is available and to propose a new model quickly to continue the data extraction.

The library is very new. It's not perfect, but a big thanks to Alireza Mika for this great approach.

Writen by Raphaël MANSUY CTO at https://www.elitizon.com

{"mode":"full","isActive":false}

autoscraper's People

Contributors

zabhitak avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.