Coder Social home page Coder Social logo

geeks-of-data / knowledge-gpt Goto Github PK

View Code? Open in Web Editor NEW
267.0 5.0 51.0 3.44 MB

Extract knowledge from all information sources using gpt and other language models. Index and make Q&A session with information sources.

Home Page: https://pypi.org/project/knowledgegpt/

License: MIT License

Python 99.54% Dockerfile 0.46%
gpt openai context embedding embedding-vectors gpt3-turbo gpt4 huggingface huggingface-transformers information-extraction

knowledge-gpt's Introduction

knowledgegpt

knowledgegpt

knowledgegpt is designed to gather information from various sources, including the internet and local data, which can be used to create prompts. These prompts can then be utilized by OpenAI's GPT-3 model to generate answers that are subsequently stored in a database for future reference.

To accomplish this, the text is first transformed into a fixed-size vector using either open source or OpenAI models. When a query is submitted, the text is also transformed into a vector and compared to the stored knowledge embeddings. The most relevant information is then selected and used to generate a prompt context.

knowledgegpt supports various information sources including websites, PDFs, PowerPoint files (PPTX), and documents (Docs). Additionally, it can extract text from YouTube subtitles and audio (using speech-to-text technology) and use it as a source of information. This allows for a diverse range of information to be gathered and used for generating prompts and answers.

Installation

  1. PyPI installation, run in terminal: pip install knowledgegpt

  2. Or you can use the latest version from the repository: pip install -r requirements.txt and then pip install .

  3. Download needed language model for parsing: python3 -m spacy download en_core_web_sm

How to use

Restful API

uvicorn server:app --reload

Set Your API Key

  1. Go to OpenAI > Account > Api Keys
  2. Create new screet key and copy
  3. Enter the key to example_config.py

How to use the library

# Import the library
from knowledgegpt.extractors.web_scrape_extractor import WebScrapeExtractor

# Import OpenAI and Set the API Key
import openai
from example_config import SECRET_KEY 
openai.api_key = SECRET_KEY

# Define target website
url = "https://en.wikipedia.org/wiki/Bombard_(weapon)"

# Initialize the WebScrapeExtractor
scrape_website = WebScrapeExtractor( url=url, embedding_extractor="hf", model_lang="en")

# Prompt the OpenAI Model
answer, prompt, messages = scrape_website.extract(query="What is a bombard?",max_tokens=300,  to_save=True, mongo_client=db)

# See the answer
print(answer)

# Output: 'A bombard is a type of large cannon used during the 14th to 15th centuries.'

Other examples can be found in the examples folder. But to give a better idea of how to use the library, here is a simple example:

# Basic Usage
basic_extractor = BaseExtractor(df)
answer, prompt, messages = basic_extractor.extract("What is the title of this PDF?", max_tokens=300)
# PDF Extraction
pdf_extractor = PDFExtractor( pdf_file_path, extraction_type="page", embedding_extractor="hf", model_lang="en")
answer, prompt, messages = pdf_extractor.extract(query, max_tokens=1500)
# PPTX Extraction
ppt_extractor = PowerpointExtractor(file_path=ppt_file_path, embedding_extractor="hf", model_lang="en")
answer, prompt, messages = ppt_extractor.extract( query,max_tokens=500)
# DOCX Extraction
docs_extractor = DocsExtractor(file_path="../example.docx", embedding_extractor="hf", model_lang="en", is_turbo=False)
answer, prompt, messages = \
    docs_extractor.extract( query="What is an object detection system?", max_tokens=300)
# Extraction from Youtube video (audio)
scrape_yt_audio = YoutubeAudioExtractor(video_id=url, model_lang='tr', embedding_extractor='hf')
answer, prompt, messages = scrape_yt_audio.extract( query=query, max_tokens=1200)

# Extraction from Youtube video (transcript)
scrape_yt_subs = YTSubsExtractor(video_id=url, embedding_extractor='hf', model_lang='en')
answer, prompt, messages = scrape_yt_subs.extract( query=query, max_tokens=1200)

Docker Usage

docker build -t knowledgegptimage .
docker run -p 8888:8888 knowledgegptimage

How to contribute

  1. Open an issue
  2. Fork the repo
  3. Create a new branch
  4. Make your changes
  5. Create a pull request

FEATURES

  • Extract knowledge from the internet (i.e. Wikipedia)
  • Extract knowledge from local data sources - PDF
  • Extract knowledge from local data sources - DOCX
  • Extract knowledge from local data sources - PPTX
  • Extract knowledge from youtube audio (when caption is not available)
  • Extract knowledge from youtube transcripts
  • Extract knowledge from whole youtube playlist

TODO

  • FAISS support
  • Add a vector database (Pinecone, Milvus, Qdrant etc.)
  • Add Whisper Model
  • Add Whisper Local Support (not over openai API)
  • Add Whisper for audio longer than 25MB
  • Add a web interface
  • Migrate to Promptify for prompt generation
  • Add ChatGPT support
  • Add ChatGPT support with a better infrastructure and planning
  • Increase the number of prompts
  • Increase the number of supported knowledge sources
  • Increase the number of supported languages
  • Increase the number of open source models
  • Advanced web scraping
  • Prompt-Answer storage (the odds are that this will be done in a separate project)
  • Add a better documentation
  • Add a better logging system
  • Add a better error handling system
  • Add a better testing system
  • Add a better CI/CD system
  • Dockerize the project
  • Add search engine support, such as Google, Bing, etc.
  • Add support for opensource OpenAI alternatives (for answer generation)
  • Evaluating dependencies and removing unnecessary ones
  • Providing prompt flexibility for using with whatever model

( To be extended...)

System Architecture

(To be updated with a better image)

knowledge-gpt's People

Contributors

0xcakin avatar eren23 avatar kaanozbudak avatar yemregundogmus avatar younver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

knowledge-gpt's Issues

Prompting Enhancement

The amount of pre/post prompting options we provide are extremely limited, we should utilize a better/more flexible way of constructing prompts.

Improving Parsing Quality

All of our information parsing mechanisms are quite "primitive" at the moment, a re-write could be useful for some of them. i.e. ytaudio, ytsubs(not too bad), pdf/doc/ppt extraction, web scraping,

Hybrid File Reader

When a directory contains more than 1 of the supported types of files, create the knowledgebase using all of them

language support

A user wrote an email and asked for italian and russian language support, we can start adding them one after another.

  • Italian
  • Russian
  • German
  • Spanish
  • Chinese
    ....
    ( To be extended on demand)

Whisper Audio Length Problem

Audio file to transcription not working for files above 25mb because OpenAI API doesn't accept bigger files and requires chunking into smaller pieces, I failed to make it work so far and left it out since it's not pressing, at some point of time it has to be addressed for sure though.

Better Error Handling

The error handling at the moment is based on what feels good, a systematic rewrite of the error handling sections can help

Search Engine Support

Support for making calls to actual search engines to connect/create/update knowledge bases for our prompts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.