This is a Proof-of-Concept application that allows you to ask questions to a python script chatbot, fine-tuned with Hyperledger Standard Documents. I implemented this first version, as mentee, during the Hyperledger Mentorship Program 2023.
This NLP application allows people to access to the Hyperledger Standard Documentation. The scope of the lab is to support the Hyperledger users (users, developer, etc.) to their work, avoiding to wade through oceans of documents to find information they are looking for. Large Language Models have yielded remarkable results, either pay and open source tools. Today we can implement a conversational AI tool which replies to questions related to specific context.
The model is XML-R pre-trained (HuggingFace deepset/xlm-roberta-large-squad2) with SQuAD Dataset. Below the architecture of the model:
In this PoC I use Haystack (Haystack by Deepset) to Build the QA pipeline.
Below an image of the architecture:
I use Elastic Search (Elastic Search website) as Retriever component.
For the installation istructions read the links below:
Haystack installation
Elastic Search Windows installation
In ingest folder, you can find two kinds of files:
- es format (Elastic Search) which contains data for the unstructured documents
- one squad format file (Stanford Question Anwsering Dataset) for the fine-tuning process
That is the first version of a PoC. Below a list of improvements that will be applied soon:
- Model: more sophisticated model (e.g. Zephyr 7B alpha)
- Dataset: currently I implemented only 2 documents as example, but real systems work with hundreds of documents
- Retriever: more sophisticated techniques use embeddings
- QA type: I will use generative (RAG) instead of extractive QA
- Hardware: now the system requires 10 minutes to ingest the files, GPU can help to save much time