usaid-bin-rehan / fast_resources_reverse_indexing Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 2.0 1.7 MB

Search-Engine for FAST-Resources

License: Apache License 2.0

Jupyter Notebook 99.95% Python 0.02% CSS 0.01% HTML 0.01% JavaScript 0.01% HCL 0.02%

aws etl-automation github-actions information-retrieval terraform api-gateway lambda s3 search-engine edtech

fast_resources_reverse_indexing's Introduction

Skills

Achievements

fast_resources_reverse_indexing's People

Contributors

Stargazers

Watchers

Forkers

bilalahmed-358 mehdirazajaffri

fast_resources_reverse_indexing's Issues

Add Category & Topic Indices

Index on two new columns along with previous word column:

Category (String) { Dropdown on UI }: Outline, Book, Paper, Slide, Assignment, Practice, Quiz, Mid1, Mid2, Final, Proposal, Report, Presentation or Misc
Topic (Strings) { Textbox on UI where user enters comma separated strings }

This will allow users to filter files using just a Category dropdown for example all files in PDC subdirectory or type comma separated topics for example Fall, 2021 to display all files containing words Fall and 2021 or 21 in results.

Should work on files whose naming convention is: Category_Topic1_Topic2..._TopicN.extension while ignoring only Category and Topic extraction NOT word extraction from those that don't follow naming convention.

File changes

1. search_lambda/lambda_function.py

def query_dynamo_db(keyword):
    okay = table.query(
        KeyConditionExpression=Key('pk').eq(keyword)
    )
    rows = {item['file_path']: item for item in okay['Items']}
    for key, value in rows.items():
        del value['pk']
        del value['file_path']
    return rows

search_lambda/lambda_function.py

def lambda_handler(event, context):
    operations = {
        'GET': lambda dynamo, x: dynamo.scan(**x),
    }
    operation = event['httpMethod']
    if operation in operations:
        payload = event['queryStringParameters'] if operation == 'GET' else json.loads(event['body'])
        print(event)

        query_words = [individual_word.strip() for individual_word in payload['query'].lower().split(',')]
        category = payload.get('category', None)
        topic = payload.get('topic', None)

        results = [query_dynamo_db(word) for word in query_words]

        if category:
            category_results = query_dynamo_db(category)
            results.append(category_results)

        if topic:
            topic_results = query_dynamo_db(topic)
            results.append(topic_results)

        return respond(None, intersect_result(results))
    else:
        return respond(ValueError('Unsupported method "{}"'.format(operation)))

iac/main.tf

resource "aws_api_gateway_resource" "reverse_index" {
    # ... existing configuration ...

    request_parameters = {
        "method.request.querystring.query" = true
        "method.request.querystring.category" = false
        "method.request.querystring.topic" = false
    }
}

resource "aws_api_gateway_method" "reverse_index_GET" {
    # ... existing configuration ...

    request_parameters = {
        "method.request.querystring.query" = true
        "method.request.querystring.category" = false
        "method.request.querystring.topic" = false
    }
}

frontend S3

const query = document.getElementById('query').value;
const category = document.getElementById('category').value;
const topic = document.getElementById('topic').value;

const queryParams = {
    query: query,
    category: category,
    topic: topic
};

const queryString = Object.keys(queryParams)
    .map(key => `${encodeURIComponent(key)}=${encodeURIComponent(queryParams[key])}`)
    .join('&');

fetch(`https://your-api-gateway-url/FAST-Resources_Reverse-Index?${queryString}`, {
    method: 'GET',
    headers: {
        // ... existing headers ...
    }
})
.then(response => response.json())
.then(data => {
    // Process search results
})
.catch(error => {
    // Handle error
});

We can also migrate Frontend to Next.Js for efficiency.

Search performance improvement using Cos_Sim?

Can you please only check if search performance improves by replacing the code in search.py with the code in the cell below.

How it works:
First, calculate the cosine similarity between the query and each document in the search results.
Next, combine the relevance scores of individual words for each document, and then normalize the scores to create the query vector. Then, retrieve the document vectors from the database, normalize them, and calculate the cosine similarity score for each document. Lastly, sort the search results based on the similarity scores, with the most similar documents ranked higher.

search.py:

import math
from database import LiteDatabase

db = LiteDatabase()

search_query = "schema mysql database"


def get_keys_from_dict(source_dict):
    return [key for key, value in source_dict.items()]


def calculate_cosine_similarity(query_vector, document_vector):
    # Calculate the dot product of the query and document vectors
    dot_product = sum(query_vector[word] * document_vector.get(word, 0) for word in query_vector)

    # Calculate the magnitude of the query and document vectors
    query_magnitude = math.sqrt(sum(value * value for value in query_vector.values()))
    document_magnitude = math.sqrt(sum(value * value for value in document_vector.values()))

    # Calculate the cosine similarity score
    cosine_similarity = dot_product / (query_magnitude * document_magnitude)

    return cosine_similarity


def search(query):
    words = query.split()

    results = [db.search(word) for word in words]

    if len(results) == 1:
        return get_keys_from_dict(results[0])

    # Combine the results from individual word searches using cosine similarity
    document_scores = {}
    for result in results:
        for doc_id, doc_data in result.items():
            if doc_id not in document_scores:
                document_scores[doc_id] = doc_data['relevance']
            else:
                document_scores[doc_id] += doc_data['relevance']

    # Convert document_scores to a normalized vector for cosine similarity calculation
    max_score = max(document_scores.values())
    query_vector = {word: document_scores.get(word, 0) / max_score for word in words}

    # Retrieve document vectors from the database and calculate cosine similarity
    search_results = {}
    for doc_id in document_scores:
        document_data = db.search(doc_id)
        document_vector = {word: document_data[word]['relevance'] / max_score for word in words}
        similarity_score = calculate_cosine_similarity(query_vector, document_vector)
        search_results[doc_id] = similarity_score

    # Sort the search results based on cosine similarity score (highest similarity first)
    sorted_results = sorted(search_results.items(), key=lambda x: x[1], reverse=True)

    return [result[0] for result in sorted_results]


search_result = search(search_query)

Batch inserting the data in main.py might reduce Lambda costs?

One potential bottleneck is the insertion of data into the database inside the loop for each file. This can result in a high number of database write operations, which can be slow and impact performance. Consider batching the insertions and performing bulk inserts instead of individual inserts for each file. This can improve the overall performance and reduce the number of database operations.

main.py:

import pathlib
import PyPDF2

from database import LiteDatabase
from extract_text import extract_text_from_word, extract_text_from_pdf, extract_text_from_powerpoint
from process_words import remove_stop_words
from collections import Counter


class Main:
    def __init__(self):
        self.repo_path = '/Users/jazib/Desktop/workrepo/FAST-Resources/'
        self.fast_resources = pathlib.Path(self.repo_path)
        self.db = LiteDatabase()

        # Number of files to process before performing a batch insert
        self.batch_insert_size = 100

    def run(self):
        batch_values = []
        count = 0

        for file_path in self.fast_resources.rglob("*"):
            text = self.extract_text_from_file(file_path)
            if text is None:
                continue

            topic_name = self.extract_topic_name(str(file_path))
            filtered_words = remove_stop_words(text)

            for word, relevance in Counter(filtered_words).items():
                batch_values.append((word, topic_name, str(file_path), relevance))
                count += 1

                if count >= self.batch_insert_size:
                    self.db.insert_index_batch(batch_values)
                    batch_values = []
                    count = 0

            print(file_path)

        if batch_values:
            self.db.insert_index_batch(batch_values)

        self.db.close()

    def extract_text_from_file(self, file_path):
        if str(file_path).endswith(".pdf"):
            try:
                return extract_text_from_pdf(file_path)
            except (PyPDF2.utils.PdfReadError, ValueError) as e:
                print(f"Skipped non-PDF file: {file_path} ({str(e)})")

        elif str(file_path).endswith(".docx"):
            try:
                return extract_text_from_word(file_path)
            except Exception as e:
                print(f"Skipped non-Word file: {file_path} ({str(e)})")
        elif str(file_path).endswith(".pptx"):
            try:
                return extract_text_from_powerpoint(file_path)
            except Exception as e:
                print(f"Skipped non-PowerPoint file: {file_path} ({str(e)})")
        else:
            print(f"file not supported {file_path}")
            return None

    def extract_topic_name(self, file_path):
        file_path = file_path.replace(self.repo_path, "")
        endpoint = file_path.find('/')
        return file_path[:endpoint]


Main().run()

Ad Monetization with Free Analytics to Cover Lambda Costs

Modify the website to Display Sponsor Advertisements (Images / Gifs + Links) uploaded by Frontend (S3) Administrator:

Purpose of using embedded ads is to have control over content, revenue and avoid bad UI/UX and being blocked by Ad-Blockers.

html

<!DOCTYPE html>
<html>
<head>
    <!-- Head content here -->
</head>
<body>
    <header>
        <!-- Heading content here -->
    </header>
    
    <nav>
        <!-- Navigation and filters content here -->
    </nav>
    
    <main>
        <!-- Search results content here -->
    </main>
    
    <aside class="advertisements">
        <!-- Advertisements content will be inserted here -->
    </aside>
    
    <footer>
        <!-- Footer content here -->
    </footer>
</body>
</html>

CSS

/* Add styles to position and style the advertisements */
.advertisements {
    position: fixed;
    top: 100px; /* Adjust as needed to avoid covering other content */
    right: 20px;
    width: 300px;
    padding: 10px;
    background-color: #f0f0f0;
    border: 1px solid #ccc;
    box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
    z-index: 9999; /* Ensure advertisements are above other content */
}

/* Add styles for the advertisement images */
.advertisement-image {
    max-width: 100%;
    height: auto;
}

// Example JavaScript code to fetch and insert advertisements
const advertisementsSection = document.querySelector('.advertisements');

// Assume adData is an array of objects with image URLs and links
const adData = [
    { imageUrl: 'ad1.jpg', link: 'https://www.example.com/ad1' },
    { imageUrl: 'ad2.gif', link: 'https://www.example.com/ad2' },
    // Add more ad objects as needed
];

adData.forEach(ad => {
    const adContainer = document.createElement('div');
    adContainer.classList.add('advertisement');

    const adLink = document.createElement('a');
    adLink.href = ad.link;
    adLink.target = '_blank';

    const adImage = document.createElement('img');
    adImage.classList.add('advertisement-image');
    adImage.src = ad.imageUrl;

    adLink.appendChild(adImage);
    adContainer.appendChild(adLink);
    advertisementsSection.appendChild(adContainer);
});

Integrate Free Version of Google Analytics:

Step 1: Create a Google Analytics Account

Go to the Google Analytics website: https://analytics.google.com/
Sign in with your Google account
Click on the "Start measuring" button.
Follow the prompts to create a new Google Analytics account and property for your website.
You'll receive a Tracking ID, which you'll need for integration.

Step 2: Integrate Google Analytics into Your Website

Add the Tracking Code:
In your website's HTML, locate the <head> section of your HTML document (usually in the index.html file). Paste the Google Analytics tracking code just before the closing </head> tag.

<head>
    <!-- Other head content -->

    <!-- Google Analytics Tracking Code -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=YOUR_TRACKING_ID"></script>
    <script>
        window.dataLayer = window.dataLayer || [];
        function gtag(){dataLayer.push(arguments);}
        gtag('js', new Date());

        gtag('config', 'YOUR_TRACKING_ID');
    </script>
</head>

Replace YOUR_TRACKING_ID with the actual Tracking ID you received from Google Analytics.

Step 3: Set Up Google Analytics Views and Reports

Log in to your Google Analytics account.
Navigate to your property.
Explore the different reports available in Google Analytics, such as Audience, Acquisition, Behavior, and Conversions.

Step 4: View Search Trends for Academic and Advertisement Analytics

In Google Analytics, go to the "Behavior" section.
Select "Site Search" > "Search Terms."
Google Analytics offers various reports under the "Acquisition" section that can provide relevant user insights.

Step 5: Prevent Bots / Scripts Access

Use economic AWS Shield

Note that Google Analytics free version may have limitations. Make sure to review terms of service and privacy policy to ensure compliance. It may collect user data, so ensure you follow applicable data protection laws and obtain user consent if necessary.

Test-Automation and Logging-Monitoring

Following features need to be added in the final version of the application to deal with storage limits & file updates:

Jenkins & Selenium for PR test-automation
CloudWatch for Logging-Monitoring
Git-LFS for accessing individual Semester repos linked to FAST-Resources repo (due to GitHub 10GB single repo limit)
Apply Factory Design Pattern to file reader so that extension reading functions definition can be separated from Lambda
Create a JSON / YML configuration mechanism where those extension reading functions can be easily added and updated

2. CloudWatch Logging-Monitoring

search_lambda/lambda_function.py

import boto3

cloudwatch = boto3.client('cloudwatch')

def publish_cloudwatch_metric(value, metric_name):
    cloudwatch.put_metric_data(
        Namespace='Custom/SearchApp',
        MetricData=[
            {
                'MetricName': metric_name,
                'Value': value,
                'Unit': 'Count'
            },
        ]
    )

Call publish_cloudwatch_metric whenever relevant events occur, such as, a search is executed, an item is indexed and etc.

usaid-bin-rehan / fast_resources_reverse_indexing Goto Github PK

fast_resources_reverse_indexing's Introduction

Skills

Achievements

fast_resources_reverse_indexing's People

Contributors

Stargazers

Watchers

Forkers

fast_resources_reverse_indexing's Issues

Add Category & Topic Indices

Search performance improvement using Cos_Sim?

Batch inserting the data in main.py might reduce Lambda costs?

Ad Monetization with Free Analytics to Cover Lambda Costs

Test-Automation and Logging-Monitoring

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent