Coder Social home page Coder Social logo

amazon-downloader's Introduction

Amazon reviews downloader and parser

Amazon offers no more a simple public API to access its reviews. If ones wants to quickly build a dataset of Amazon reviews he/she has to download them using a web crawler on the HTML pages.

I wrote a Perl script that, given a first level domain (e.g., "com", "it") and list of IDs of Amazon products, automatically downloads, from the Amazon server that is dedicated to that domain, all and only the HTML pages that contain the reviews about that products.

Then I wrote another Perl script that, given a list of the downloaded HTML files, extract all the reviews contained in them, outputting for each review a record with the following information:

  • A counter of the extracted reviews so far (can be used as a unique ID for the dataset).
  • Date of the review in YYYYMMDD format (note: on non-English speaking domains this feature won’t work, edit the script to set the name of months in the desired language).
  • ID of the reviewed product.
  • Star rating assigned by the reviewer.
  • Count of "yes" helpfulness votes.
  • Count of total helpfulness votes ("yes"+"no").
  • Date of the review in human readable format (will be in the language used by the specified domain).
  • ID of the author of the review.
  • Title of the review
  • Content of the review

Example

Given the products with IDs, e.g., B0040JHVCC and B00004ZDB1, the reviews from the ".com" domain are downloaded with the command:

./downloadAmazonReviews.pl com B0040JHVCC B00004ZDB1

Reviews are automatically downloaded in the ./amazonreviews/com/B0040JHVCC and ./amazonreviews/com/B00004ZDB1 directory. The script automatically adapts a timeout between download requests in order to be polite with Amazon, and also retries failing downloads (503 errors) until every page is downloaded.

Then the reviews are extracted from the HTML file by issuing the command:

./extractAmazonReviews-DivLayout.pl ./amazonreviews/com/B0040JHVCC/ ./amazonreviews/com/B00004ZDB1/

The scripts outputs one review per line on the standard output, in a CSV format:

"0","20120118","B0040JHVCC","1.0","1","2","January 18, 2012","A1E5SQ7VA3I8OI","Not worth the price","I purchased the... (removed for brevity)  ...and definitely no."
"1","20120116","B0040JHVCC","5.0","0","0","January 16, 2012","A34FBZLFAU88UI","Compact version of the 7D, and neck and neck with D7000","This camera is... (removed for brevity) ...focus hunting issues."
"2",...

Redirection can be used to save the output to a file:

./extractAmazonReviews-DivLayout.pl ./amazonreviews/com/B0040JHVCC/ ./amazonreviews/com/B00004ZDB1/ > reviews.csv 

will save the CSV output into the "reviews.csv" file.

Note that there are two versions of the extract script: "extractAmazonReviews-DivLayout.pl" and "extractAmazonReviews-TableLayout.pl". The DivLayout version works on the last (at April 2015) Amazon layout which ditched tables for divs. This layout is currently (again, at April 2015) used on the ".com" domain. The other domains still use the table-based layout, for which the TableLayout version is designed.

Disclaimer

I provide you the tool to download the reviews, not the right to download them. You have to respect Amazon's rights on its own data. Do not release the data you download without Amazon's consent.

History

April, 2015: added different versions of extract script to work with both table- and div-based layouts.

January, 2015: moved to GitHub

November, 2013: (version 2.0) added the extraction of helpfulness scores. Major version update due to the change in output format.

November, 2013: (version 1.1) the extractAmazonReviews.pl script is now able to process all the files in a directory just passing the directory name. This is motivated by the fact that the * char in the command line of Windows is not expanded in the same way of Unix shells. Previous syntax, with *, continues to work.

June, 2013: (version 1.0) moved to fossil scm.

November, 2012: added the possibility to download reviews from many Amazon domains, not just the com one (last time tested it was working on European languages and Chinese, not on Japanese).

amazon-downloader's People

Contributors

aesuli avatar reno-greenleaf avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.