usaid-bin-rehan / fast_resources_reverse_indexing Goto Github PK
View Code? Open in Web Editor NEWSearch-Engine for FAST-Resources
License: Apache License 2.0
Search-Engine for FAST-Resources
License: Apache License 2.0
Index on two new columns along with previous word column:
Category (String) { Dropdown on UI }: Outline, Book, Paper, Slide, Assignment, Practice, Quiz, Mid1, Mid2, Final, Proposal, Report, Presentation or Misc
Topic (Strings) { Textbox on UI where user enters comma separated strings }
This will allow users to filter files using just a Category dropdown for example all files in PDC subdirectory or type comma separated topics for example Fall, 2021 to display all files containing words Fall and 2021 or 21 in results.
Should work on files whose naming convention is: Category_Topic1_Topic2..._TopicN.extension while ignoring only Category and Topic extraction NOT word extraction from those that don't follow naming convention.
File changes
1. search_lambda/lambda_function.py
def query_dynamo_db(keyword):
okay = table.query(
KeyConditionExpression=Key('pk').eq(keyword)
)
rows = {item['file_path']: item for item in okay['Items']}
for key, value in rows.items():
del value['pk']
del value['file_path']
return rows
search_lambda/lambda_function.py
def lambda_handler(event, context):
operations = {
'GET': lambda dynamo, x: dynamo.scan(**x),
}
operation = event['httpMethod']
if operation in operations:
payload = event['queryStringParameters'] if operation == 'GET' else json.loads(event['body'])
print(event)
query_words = [individual_word.strip() for individual_word in payload['query'].lower().split(',')]
category = payload.get('category', None)
topic = payload.get('topic', None)
results = [query_dynamo_db(word) for word in query_words]
if category:
category_results = query_dynamo_db(category)
results.append(category_results)
if topic:
topic_results = query_dynamo_db(topic)
results.append(topic_results)
return respond(None, intersect_result(results))
else:
return respond(ValueError('Unsupported method "{}"'.format(operation)))
iac/main.tf
resource "aws_api_gateway_resource" "reverse_index" {
# ... existing configuration ...
request_parameters = {
"method.request.querystring.query" = true
"method.request.querystring.category" = false
"method.request.querystring.topic" = false
}
}
resource "aws_api_gateway_method" "reverse_index_GET" {
# ... existing configuration ...
request_parameters = {
"method.request.querystring.query" = true
"method.request.querystring.category" = false
"method.request.querystring.topic" = false
}
}
frontend S3
const query = document.getElementById('query').value;
const category = document.getElementById('category').value;
const topic = document.getElementById('topic').value;
const queryParams = {
query: query,
category: category,
topic: topic
};
const queryString = Object.keys(queryParams)
.map(key => `${encodeURIComponent(key)}=${encodeURIComponent(queryParams[key])}`)
.join('&');
fetch(`https://your-api-gateway-url/FAST-Resources_Reverse-Index?${queryString}`, {
method: 'GET',
headers: {
// ... existing headers ...
}
})
.then(response => response.json())
.then(data => {
// Process search results
})
.catch(error => {
// Handle error
});
We can also migrate Frontend to Next.Js for efficiency.
Can you please only check if search performance improves by replacing the code in search.py with the code in the cell below.
How it works:
First, calculate the cosine similarity between the query and each document in the search results.
Next, combine the relevance scores of individual words for each document, and then normalize the scores to create the query vector. Then, retrieve the document vectors from the database, normalize them, and calculate the cosine similarity score for each document. Lastly, sort the search results based on the similarity scores, with the most similar documents ranked higher.
search.py:
import math
from database import LiteDatabase
db = LiteDatabase()
search_query = "schema mysql database"
def get_keys_from_dict(source_dict):
return [key for key, value in source_dict.items()]
def calculate_cosine_similarity(query_vector, document_vector):
# Calculate the dot product of the query and document vectors
dot_product = sum(query_vector[word] * document_vector.get(word, 0) for word in query_vector)
# Calculate the magnitude of the query and document vectors
query_magnitude = math.sqrt(sum(value * value for value in query_vector.values()))
document_magnitude = math.sqrt(sum(value * value for value in document_vector.values()))
# Calculate the cosine similarity score
cosine_similarity = dot_product / (query_magnitude * document_magnitude)
return cosine_similarity
def search(query):
words = query.split()
results = [db.search(word) for word in words]
if len(results) == 1:
return get_keys_from_dict(results[0])
# Combine the results from individual word searches using cosine similarity
document_scores = {}
for result in results:
for doc_id, doc_data in result.items():
if doc_id not in document_scores:
document_scores[doc_id] = doc_data['relevance']
else:
document_scores[doc_id] += doc_data['relevance']
# Convert document_scores to a normalized vector for cosine similarity calculation
max_score = max(document_scores.values())
query_vector = {word: document_scores.get(word, 0) / max_score for word in words}
# Retrieve document vectors from the database and calculate cosine similarity
search_results = {}
for doc_id in document_scores:
document_data = db.search(doc_id)
document_vector = {word: document_data[word]['relevance'] / max_score for word in words}
similarity_score = calculate_cosine_similarity(query_vector, document_vector)
search_results[doc_id] = similarity_score
# Sort the search results based on cosine similarity score (highest similarity first)
sorted_results = sorted(search_results.items(), key=lambda x: x[1], reverse=True)
return [result[0] for result in sorted_results]
search_result = search(search_query)
One potential bottleneck is the insertion of data into the database inside the loop for each file. This can result in a high number of database write operations, which can be slow and impact performance. Consider batching the insertions and performing bulk inserts instead of individual inserts for each file. This can improve the overall performance and reduce the number of database operations.
main.py:
import pathlib
import PyPDF2
from database import LiteDatabase
from extract_text import extract_text_from_word, extract_text_from_pdf, extract_text_from_powerpoint
from process_words import remove_stop_words
from collections import Counter
class Main:
def __init__(self):
self.repo_path = '/Users/jazib/Desktop/workrepo/FAST-Resources/'
self.fast_resources = pathlib.Path(self.repo_path)
self.db = LiteDatabase()
# Number of files to process before performing a batch insert
self.batch_insert_size = 100
def run(self):
batch_values = []
count = 0
for file_path in self.fast_resources.rglob("*"):
text = self.extract_text_from_file(file_path)
if text is None:
continue
topic_name = self.extract_topic_name(str(file_path))
filtered_words = remove_stop_words(text)
for word, relevance in Counter(filtered_words).items():
batch_values.append((word, topic_name, str(file_path), relevance))
count += 1
if count >= self.batch_insert_size:
self.db.insert_index_batch(batch_values)
batch_values = []
count = 0
print(file_path)
if batch_values:
self.db.insert_index_batch(batch_values)
self.db.close()
def extract_text_from_file(self, file_path):
if str(file_path).endswith(".pdf"):
try:
return extract_text_from_pdf(file_path)
except (PyPDF2.utils.PdfReadError, ValueError) as e:
print(f"Skipped non-PDF file: {file_path} ({str(e)})")
elif str(file_path).endswith(".docx"):
try:
return extract_text_from_word(file_path)
except Exception as e:
print(f"Skipped non-Word file: {file_path} ({str(e)})")
elif str(file_path).endswith(".pptx"):
try:
return extract_text_from_powerpoint(file_path)
except Exception as e:
print(f"Skipped non-PowerPoint file: {file_path} ({str(e)})")
else:
print(f"file not supported {file_path}")
return None
def extract_topic_name(self, file_path):
file_path = file_path.replace(self.repo_path, "")
endpoint = file_path.find('/')
return file_path[:endpoint]
Main().run()
Modify the website to Display Sponsor Advertisements (Images / Gifs + Links) uploaded by Frontend (S3) Administrator:
Purpose of using embedded ads is to have control over content, revenue and avoid bad UI/UX and being blocked by Ad-Blockers.
html
<!DOCTYPE html>
<html>
<head>
<!-- Head content here -->
</head>
<body>
<header>
<!-- Heading content here -->
</header>
<nav>
<!-- Navigation and filters content here -->
</nav>
<main>
<!-- Search results content here -->
</main>
<aside class="advertisements">
<!-- Advertisements content will be inserted here -->
</aside>
<footer>
<!-- Footer content here -->
</footer>
</body>
</html>
CSS
/* Add styles to position and style the advertisements */
.advertisements {
position: fixed;
top: 100px; /* Adjust as needed to avoid covering other content */
right: 20px;
width: 300px;
padding: 10px;
background-color: #f0f0f0;
border: 1px solid #ccc;
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
z-index: 9999; /* Ensure advertisements are above other content */
}
/* Add styles for the advertisement images */
.advertisement-image {
max-width: 100%;
height: auto;
}
JS
// Example JavaScript code to fetch and insert advertisements
const advertisementsSection = document.querySelector('.advertisements');
// Assume adData is an array of objects with image URLs and links
const adData = [
{ imageUrl: 'ad1.jpg', link: 'https://www.example.com/ad1' },
{ imageUrl: 'ad2.gif', link: 'https://www.example.com/ad2' },
// Add more ad objects as needed
];
adData.forEach(ad => {
const adContainer = document.createElement('div');
adContainer.classList.add('advertisement');
const adLink = document.createElement('a');
adLink.href = ad.link;
adLink.target = '_blank';
const adImage = document.createElement('img');
adImage.classList.add('advertisement-image');
adImage.src = ad.imageUrl;
adLink.appendChild(adImage);
adContainer.appendChild(adLink);
advertisementsSection.appendChild(adContainer);
});
Integrate Free Version of Google Analytics:
Step 1: Create a Google Analytics Account
Step 2: Integrate Google Analytics into Your Website
<head>
section of your HTML document (usually in the index.html
file). Paste the Google Analytics tracking code just before the closing </head>
tag.<head>
<!-- Other head content -->
<!-- Google Analytics Tracking Code -->
<script async src="https://www.googletagmanager.com/gtag/js?id=YOUR_TRACKING_ID"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'YOUR_TRACKING_ID');
</script>
</head>
YOUR_TRACKING_ID
with the actual Tracking ID you received from Google Analytics.Step 3: Set Up Google Analytics Views and Reports
Step 4: View Search Trends for Academic and Advertisement Analytics
Step 5: Prevent Bots / Scripts Access
Use economic AWS Shield
Note that Google Analytics free version may have limitations. Make sure to review terms of service and privacy policy to ensure compliance. It may collect user data, so ensure you follow applicable data protection laws and obtain user consent if necessary.
Following features need to be added in the final version of the application to deal with storage limits & file updates:
2. CloudWatch Logging-Monitoring
search_lambda/lambda_function.py
import boto3
cloudwatch = boto3.client('cloudwatch')
def publish_cloudwatch_metric(value, metric_name):
cloudwatch.put_metric_data(
Namespace='Custom/SearchApp',
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': 'Count'
},
]
)
Call publish_cloudwatch_metric whenever relevant events occur, such as, a search is executed, an item is indexed and etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.