Coder Social home page Coder Social logo

atlas-vector-search-pdf's Introduction

Atlas Vector Search Across PDFs

Introduction

This demo is a prototype of how Atlas Vector Search could be used to find relevant PDF documents.

To begin, the text from the PDFs are extracted, split into sentences, and mapped into a 384 dimensional dense vector space. The PDF sentences along with their vectors are stored into MongoDB Atlas.

An Atlas Vector Search index then allows the PDFs to be queried, finding the PDFs that are relevant to the query.

Architecture

Setup

PDFs to Query

For this demo, the text extractor reads the PDFs from a local directory. To get started, I've supplied 5 MongoDB whitepapers, but please try with your own PDFs.

Atlas

Open params.py and configure your connection to Atlas, along with the name of the database and collection you'd like to store your text.

Extract and Encode the PDFs

Install the requirements. This implementation uses:

pip install -r requirements.txt

Run the extract_and_encode_pdf.py

python3 extract_and_encode.py

Create Search Index

Create a default search index on the collection:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "sentenceVector": {
        "type": "knnVector",
        "dimensions": 384,
        "similarity": "euclidean"
      }
    }
  }
}

Demo

You are now ready to search your vast PDF library for the PDFs that may hold the answers to your questions.

Your query will be mapped using the same sentence transformer that was used to encode the data and then submitted to Atlas Search, returning the top 3 matches.

For example:

โœ— python3 find_pdf.py -q "Can I query data that resides in AWS S3?"

The following PDFs may contain the answers you seek:
----------------------------------------------------
PDF:      MongoDB UseCase Guidance.pdf
Page:     4
Sentence: With Atlas Data Lake you can query, combine, and analyze data across AWS S3 and MongoDB Atlas Databases without complex integrations, working with data in its native format using the MongoDB Query API. 

PDF:      MongoDB Atlas Search- Transforming Customer Experience.pdf
Page:     12
Sentence: Query  and combine MongoDB Atlas  application data with other data  assets stored on Amazon S3. 

PDF:      MongoDB Atlas Search- Transforming Customer Experience.pdf
Page:     17
Sentence: Tier aged business data to S3 by using Atlas  Online Archive, then federate queries across  storage tiers via Atlas Data Lake. 

The Search Query

This is the simple query passed to MongoDB:

[
    {
        "$search": {
            "knnBeta": {
                "vector": <geneated query vector>,
                "path": "sentenceVector",
                "k": 150  // Number of neareast neighbors (nn) to return 
            }
        }
    },
    {
        "$limit": 3      
    }
]

The knnBeta operator uses the Hierarchical Navigable Small Worlds algorithm to perform semantic search. You can use Atlas Search support for kNN query to search similar to a selected product, search for images, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.