The task for this take-home test was to:
- Use a basic starter node.js axios web scraper to download the HTML from an imaginary cheese store products page https://cdn.adimo.co/clients/Adimo/test/index.html
- Then process the HTML and save out a JSON file with:
- Each product as it's own object containing:
- title
- image url
- price and any discount.
- The total number of items
- The average price of all items.
The task also included 2 challenges:
- Extract the same data as above from https://www.thewhiskyexchange.com/search?q=cider
- Consider how the query parameter could be made dynamic so a user can provide their own search term.
- Products are represented by a Product class, with an instance method to calculate any discount.
- The Product class also includes static methods for returning an average current price, and saving a JSON file.
- The scrapers directory contains two files with scraper functions for axios and puppeteer.
- The app.js file uses Readline to prompt input from the user, and call the relevant scraper function.
- Any errors produced are logged to the console.
- Considered using regular expression to extract the currency value and convert to an integer
- Ultimately I used the currency.js library to streamline currency calculations
- JSON files are exported to an output directory with an ISO datetime as the filename.
- The JSON includes an array of all products, the URL, date retrieved, total number of products, and the average price of all products.
-
A 403 Forbidden response was returned when using Axios with the Whisky Exchange site; setting User-Agent headers, using a proxy and connecting to the Host IP directly returned the same result.
-
While Puppeteer does work for the Whisky Exchange site, it's very slow in comparison to Axios.
-
Other HTTP libraries designed to work around Cloudflare protection could be explored, such as Cloudscraper. The products could then be extracted via a request for each page in turn, using the pagination URL parameter.
-
In production I would implement this as a HTTP REST API using Express.js, with input received via URL parameters.
-
URL parameters for the DOM selectors and variable names could also be used, to make the app more dynamic and usable for different websites.
- A Dockerfile could be used to deploy this to a production server as a Docker container image (an example is included), so that the app and any dependencies are easier to maintain / faster to deploy.
- Mocha - testing framework
- Chai - assertion library for tests
- Axios - HTTP client
- Cheerio - HTML parsing library for Axios responses
- Puppeteer - headless browser control
- File System - file system module
- Currency.js - library for working with currency values
- Install npm dependencies:
npm install
- An 'output' directory will be created automatically the first time the app is executed.
- To run the tests:
npm run test
- To run the program:
node app.js
Or
npm start