Coder Social home page Coder Social logo

lambda-scraper's Introduction

What?

Lambda Scraper is an AWS Lambda function that: Scrapes any number of web pages you define, searching for new links on the page, and (optionally) filters the results by keyword. If it finds results, it sends an email to you via AWS SES.

Use it to add notification functionality to sites that don't natively have notifications. Use it to be notified of new product listings, cheap flights, job listings, or whatever you dream up.

The initial use case was as a Careers page scraper for my ๐Ÿ‘ซ, because it's hard out there for fashion students, and a lot of Careers pages offer no way of being notified of new postings. I'm not sure what else you might use this for, but if you come up with something good, let me know.

Install

Experience working with AWS will be super handy if you want to set this up for yourself. If my instructions are unclear and you'd like to get this setup, ping me and I might put together a video walkthrough if you pressure me enough.

  1. Download this repo and create a config.js file (see config.example.js as an example of the initial format of this file). Go ahead and set the email_to (your email here), email_subject, and AWS API key and secret. You'll set the other values in the next steps.
  2. Create an S3 bucket, then enter the bucket name in config.js from step 1.
  3. Create and verify an SES sending email or domain, then enter the sending email address (email_from) in config.js. Set the aws_region in your config.js to the region you used for your SES email.
  4. Define the pages and their selectors in page.js (See below)
  5. Locally, run npm install
  6. Zip your project folder and upload to Lambda (See below)

HTML pages

As an example, if the HTML you're trying to scrape from a page (ie. https://example.com/listing) looks like below:

<ul>
  <li class="posting">
    <h1>Hello world</h1>
    <a class="more" href="http://example.com">Read more</a>
    <span class="location">NYC</span>
  </li>

  <li class="posting">
    <h1>Hello world</h1>
    <a class="more" href="http://example.com">Read more</a>
    <span class="location">NYC</span>
  </li>
</ul>

In pages.js, you'd enter:

{
  url: "https://example.com/listing", // required
  keywords: ["sales", "marketing", "fashion", "jewlery"], // optional
  parent: ".posting", // required
  selectors: {
    title: "h1", // required
    url: ".more", // required
    location: ".location" // optional key/value (title/selector)
  }
}

Quick note on images: If you enter a key of image or thumbnail, the selector must point to an img tag.

JSON pages

For JSON endpoints, use $ as the top-level object within the parent array

For example, given a JSON endpoint (ie. https://example.com/listing.json) with a response of:

{
  "postings": [{
    "title": "Hello world",
    "link": "http://example.com",
    "location": { "title": "NYC" }
  }]
}

In pages.js you'd enter:

{
  url: "https://example.com/listing.json",
  json: true, // required for JSON
  keywords: ["sales", "marketing", "fashion", "jewlery"],
  parent: "postings",
  selectors: {
    title: "$.title",
    url: "$.link",
    center: "$.location.title"
  }
}

Deploy

Create the zip package

You can zip the package as you normally would, or you can run the zip npm script:

$ npm install
$ npm run zip

(You can ignore the .env file this creates in your root directory)

Create the Lambda function

  1. Start with the "canary" blueprint
  2. Create a CloudWatch event and set the rate (ie. 20 2 * * ? * to run once a day at 2:20 UTC)
  3. In the final review step of creating the Lambda function, make sure to enable the event source.

Lambda settings:

  • Runtime: Node.js 4.3
  • Handler: index.handler
  • Memory: 128mb should be enough
  • Timeout: 2 minutes should be plenty (Mine hasn't gone beyond 15 seconds)

Debugging

In index.js, set var debug to true.

In your termainal, you can then run the below (from the root of the project) to see what results are found for the pages you've defined:

$ node
 > require('./index').handler()
 Found results for...

lambda-scraper's People

Contributors

sawyerh avatar

Watchers

 avatar Mark Friedgan avatar  avatar Lee avatar Victor Pudeyev avatar James Cloos avatar Alexey Nechytailo avatar Anatol Mykolenko avatar Zach Miller avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.