Generative Question and Answer Large Language Model
Project Overview
DataSpeak, one of the industry's largest providers of predictive analytics solutions, needed a proof-of-concept machine learning model that can automatically generate answers to user-input questions.
Machine Learning Skills/Technologies
Text2TextGeneration, Transformers, Tokenizers, PyTorch, Hugging Face, Flan-T5 LLM, spaCy, Streamlit, Render, GPU, BeautifulSoup, Google Colab
Project Conclusions
- Developed a generative language model using
google/flan-t5-base
, fine-tuned on Stack Overflow data. - Conducted cosine semantic similarity analysis on a generated vector embeddings database to identify the top 5 most similar questions in the dataset for user-input questions.
- Developed a web application featuring a chatbot UI that provides generative answers from the model and generates 5 alternative answers based on cosine similarity, along with percent similarity scores.
- Improved training set quality by pre-processing and normalizing raw text data.
Screenshot of Web Application UI
Performance & Evaluation
- Achieved a 19% ROUGE-1 score and an average perplexity of 1.96.
- Demonstrated high efficiency, with response times under 15 seconds.
Requirements
Python libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, nltk, transformers, spacy, torch