Coder Social home page Coder Social logo

m-murasovs / amazon-bestsellers-scraper Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 12.0 585 KB

Amazon Bestsellers page scraper built using the Apify platform

Home Page: https://apify.com/mihails/amazon-bestsellers-scraper

License: Apache License 2.0

Dockerfile 9.28% JavaScript 90.72%
actor apify-platform amazon-bestsellers-scraper

amazon-bestsellers-scraper's Introduction

What does Amazon Best Sellers Scraper do?

Our free Amazon Best Sellers Scraper allows you to scrape the 100 top-selling items on Amazon. You can extract the names, prices, URLs, and thumbnail images of the 100 best-selling items on Amazon.

The actor can currently extract from .com, .co.uk, .de, .fr, .es, and .it domains. If you would like to add support for another domain, please get in touch or you can just edit the source code yourself.

If you would prefer a more general Amazon product or data scraper, you should try Amazon Scraper.

Why you should scrape Amazon Best Sellers

If you're web scraping Amazon for retail or market research, the Amazon Best Sellers list features the top-selling items across Amazon, which can tell you a lot about the top trends in e-commerce. Competing directly against these products can be difficult, but the Best Sellers list can be a source of inspiration for new products and help e-commerce retailers stay ahead of the competition. Getting your item into the Best Sellers list and keeping it there is one of the surest ways to guarantee sales for your business. Once a product reaches the Best Sellers list, e-commerce retailers increasingly turn to web scraping to track up-and-coming products, and adjust their own products to compete.

How much will it cost me to scrape Amazon Best Sellers?

For every 100 pages scraped, the actor will consume 0.6 compute units. This means that you can scrape around 160 pages for 1 compute unit. That will cost you just 25 cents.

Input settings

  • Domain you want to extract
  • Depth of extraction - how many subcategories you want to scrape
  • Proxy

Tips

  • By default, this Amazon scraper extracts the 37 top Best Seller subcategories. A deeper level of extraction can be added to allow you to scrape the top-selling items from the first level of the main categories' sub-divisions.

  • The default depth of the crawl is limited to two subcategories. There is a way around this restriction. Start on the main category, scrape two departments. Then remove duplicate category URLs from there and feed them back into the scraper again.

  • Make sure that memory is set to at least 1024 MB so the scraper will have enough power to complete the task in a timely manner. If your machine allows, feel free to increase the memory allocation for more speed.

Proxy configuration

The proxy configuration (proxyConfiguration) option enables you to set proxies that will be used by the scraper in order to prevent its detection by target websites. You can use both Apify Proxy and custom HTTP or SOCKS5 proxy servers.

The following table lists the available options of the proxy configuration setting:

None: The scraper will not use any proxies. All web pages will be loaded directly from IP addresses of Apify servers running on Amazon Web Services.

Apify Proxy (automatic): The scraper will load all web pages using Apify Proxy in automatic mode. In this mode, the proxy uses all proxy groups that are available to the user, and for each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available proxy groups on the proxy page in the app.

Apify Proxy (selected groups): The scraper will load all web pages using Apify Proxy with specific groups of target proxy servers.

Custom proxies: The scraper will use a custom list of proxy servers. The proxies must be specified in the scheme://user:password@host:port format, multiple proxies should be separated by a space or new line. The URL scheme can be either HTTP or SOCKS5. User and password might be omitted, but the port must always be present.

Results

The actor stores its result in the default dataset associated with the actor run. You can export it from there to various formats, such as JSON, XML, CSV, or Excel.

The API will return results like this (in JSON format):

{
    "category": "Amazon.co.uk Best Sellers: The most popular items in Books",
    "categoryUrl": "https://www.amazon.co.uk/Best-Sellers-Books/zgbs/books/ref=zg_bs_nav_0/261-6986927-7102013",
    "items": {
        "0": {
            "name":  "The Mirror and the Light (The Wolf Hall Trilogy)",
            "price":  "£15.49",
            "url":  "https://www.amazon.co.uk/Mirror-Light-Wolf-Hall-Trilogy/dp/0007480997/ref=zg_bs_books_1?_encoding=UTF8&psc=1&refRID=3PNZSWBH3A0H1QCWYPP6",
            "thumbnail":  "https://images-eu.ssl-images-amazon.com/images/I/91-UvTTh4lL._AC_UL200_SR200,200_.jpg"
        }
    }
}

The results can be downloaded using get dataset items

During the run

  • During the run, the actor will output messages notifying you of which page is being extracted. When the items are extracted, the actor will notify you that they are being saved.
  • Due to concurrent extraction of pages, these notifications may not be displayed in order.
  • In the event of an error, the actor will complete its run immediately, without adding any data to the dataset.

amazon-bestsellers-scraper's People

Contributors

andreybykov avatar davidjohnbarton avatar levent91 avatar lhotanok avatar m-murasovs avatar metalwarrior665 avatar zpelechova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

amazon-bestsellers-scraper's Issues

Add more depts categories

At the moment the actor only can get results from main category plus 1 subcategories. Would be cool if it went deeper.

Page not found error

Hi,

Love using this best sellers scraper! Although recently it doesn't seem to run anymore. It keeps saying that the page can't be found when I try to run it. Please see attached log.

error log.txt

Any idea why this keeps happening?

Require more categories

Hello,
I want to know how can i request more depth in categories as to currently offered as one. I only need the all of the category and subcategory names

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.