Coder Social home page Coder Social logo

document_search_engine_on_aws's Introduction

Document Search Engine On AWS

Indexer and Scrapper

Medium blog for detailed explanation

https://medium.com/@m.zaradzki/build-your-own-document-search-engine-using-amazon-web-services-82d5b165d96c

Notes on using non-default packages in Lambda Nodejs

To use non-default packages such as cheerio "html dom parser" you need to send your Lamdba code as a zip file to AWS. If your zip is too large because of the packages you wont be able to edit/test the code from the console.

However in that case you can use lambda-local to emulate Lambda locally. See this link : https://github.com/ashiina/lambda-local

Usefull links

What we have so far as Lambda functionality

  • Function that write updates to dynamodb
  • Function that write files to S3
  • Function that reads file meta-data from S3
  • Function that reads HTML
  • Function that download files on the web
  • Function that write/read messages to/from SQS
  • Function that queries from cloudsearch
  • Function that index documents to cloudsearch

Description (changing quickly)

  • legiscrap0 writes all scraped links to SQS
  • legiscrap_manager1 picks-up 1 SQS message and delegates it to scrap1 (triggered by CloudWatch CRON)
  • legiscrap1 fetch html or attachement online and saves it on S3
  • docIndexer listens to S3 file addition events and send index commands to CloudSearch
  • docSearcher could be exposed through an API to query CloudSearch
  • indexCleaner scans CloudSearch documents and check they all have a matching S3 file (triggered by CloudWatch CRON)
  • indexCatcher scans S3 and check all files are in CloudSearch index (triggered by CloudWatch CRON)

Note that:

  • indexCatcher uses an SQS Queue to keep track of its position in the S3 bucket as it processes it by chuncks
  • indexCleaner uses an SQS Queue to keep track of its position in CloudSearch index

Browser code credentials

To invoke Lambda from the Browser the page need to provide credential in the form of an Identity Pool managed by AWS Cognito. See : http://docs.aws.amazon.com/cognito/latest/developerguide/identity-pools.html The pool will allow to control permissions for both authenticated and un-authenticated users via specific roles.

document_search_engine_on_aws's People

Contributors

mzaradzki avatar

Stargazers

Daniel Ilie avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.