anarchy-ai / llm-vm Goto Github PK

irresponsible innovation. Try now at https://chat.dev/

License: MIT License

Python 96.14% Shell 2.60% Dockerfile 0.28% PowerShell 0.98%

artificial-intelligence deep-learning distillation distillation-model llm llm-agent llm-inference llm-local llm-training machine-learning

llm-vm's People

Contributors

Stargazers

Watchers

Forkers

meganmerz andrewhinh unna97 victorodede digitalnomd ajn2004 bilal-aamer plurigrid slitvinov neoathenian itsmeashutosh43 roopeshsn ishaan-jaff krrishdholakia yujonglee ajay-vishnu jaytoday gibsonliketheguitar therealvish shao-shuai arshergon samthakur587 evonloch vaishnavimudaliar eric-epsilla inf800 grepfruit19 fedegr saragracelien nayan2167 anuraag-ch chinmayk0607 epsilla-cloud jasonoblivion sainiaditya1 abdulkhaliq293 nitaiz123 mysteriousbug tiwariaayushi jobhdez us raj921 4imothy teechenxing berkay3500 pahyde 7love7code7 subramanya1997 aks1204 hanyuanhsu jinnabaalu developervishnuvardhan kshitiz305 jaywha blackreaper333 balaji-avk mcamilletti1 jythonkip asigdel29 jaded0 sushmanthreddy trungbui59 myedibleenso avr-arnold t3ch3y bjornaer emekaokoli19 yashkhairnar bearnardd alkahwaji abhirkarande vivasvan1 greydoubt mr-shitij abdevna daredevilteja jainam-r aryan8912 c7gnas vidur2 zhangmjcn suryachereddy ajitalapati rohinish404 bpanahij arjunrkaushik clic-ethiopia ekpu 110antisocial011 mehmetmhy robinokwanma mobley-trent danielmunioz meghananalla saisri0102 daspartho lokeshjonnakuti hkaur008 nametheman lucylililiwang

llm-vm's Issues

Update agents to use optimizing completion API.

This will require re-running of tests. It should be possible to turn off this behavior at a high level. Blocked by both #2 and #18

Create a main.py at top level for starting an inference server and downloading models

Should have a flag that downloads the model from one of the OSS repos.

validate CLI inputs make sense

currently that config setup doesn't check that settings make sense

add localhost control panel

localhost because that easily becomes a remote interface

structure web server as a use of library api

currently we have the library depends on the web server, when theres no good reason to require it

Self referencing

Hello, I downloaded the project (no installation) as I wanted to edit it for the bounties.

As soon as I tried to compile it gave me an error like this:
Code: from llm_vm.utils.keys import *
Error: No module named 'llm_vm'

If I had installed it there´s no doubt it would´ve worked. However, when doing imports from the same project I think the usual way to navigate through folders is using dots, like:
from ...utils.tools import *

Is there another way to go about this?

Add linting and auto-formatting

Add linting and auto-formatting to keep code clean and improve efficiency

Add linting and auto-formatting to requirements.txt (ex. pylint, black)
Add pre-commit to run linting and auto-formatting
Fix errors raised by linter

Keep track of detailed information about inference, data synthesis, response sizes, latency, quality

Add LLM Quantization

QLoRA (an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance).
Converting LMMs to 4-bit quantized with QLoRA will reduce the required resources, speed up the fine-tuning and inference parts, and be much cheaper.

Refer to the following link for your reference

Data_synthesis fix

Currently data_synthesis.py asks GPT to generate all the datapoints in one call. This leads to many issues with many many incorrect datapoints being made and not enough datapoints being made. I fixed this in the experimentation_finetuning branch, and that approach of data synthesis needs to be brought to main and made robust (my current implementation is for research not deployment).

One thing that needs to be done is semantic similarity checking the way that the current data_synthesis.py does. Also my implementation in experimentation_finetuning also does not support lists as inputs, and this needs to be added as well.

Add REBEL Agent to allow for external tools to be added

Pull in agents from LLM-VM-add-agents branch

These need a standardized interface (so they take tool descriptions of the same format),
They should also be updated to use non-api tools (just closures).

generalized constrained token inference interface

@itsmeashutosh43 and @VictorOdede I would love your feedback and Ideas on this

prompted by thinking about what we want out of regex/grammar constrained LLM inference, we've realized that we should just embrace having a more generic interface which those would be examples of.

I'm very much not familiar with python best practice, but heres my attempt at specifying it from a Types and stuff perspective thats slightly simplified.

"this is crudely our current api"
Class LLM(ABC):
   model : AbstractHF_Model
   tokenizer : Abstract_HF_Tokenizer
   genToken : List[Token] -> Map( Token_ID ,Log_Probability )

   generate_simple : ... bunch of stuff we pass to hf generate -> Async( String) 
"""generate_simple is our current generate """

   generate_with_Constraints : TokenConstraint -> ... same as above -> Async( String )

""" So lets talk about TokenConstraint!
"""
class TokenConstraint(ABC):

     constraint_type : type // the type of constraints we wanna have
     state_type : type //  this should be frozen/immutable, but i'll describe stuff in a way that maybe kinda works either way
     copy_state : 
     def is_valid (constraint : constraint_type ,  prefix : List of Tokens) ->  Bool : 

     def allowed_transitions(constraint: constraint_type, prefix : List of Token_id, tk : Tokenizer) -> Set(Token_id) : 

     def construct_state(constraint: constraint_type, prefix : List of Token_id, tk : Tokenizer) -> Union[None, state_type]
     // returns a state that *may* allow faster enumeration/checking of what tokens are allowed transitions next

     construct_crude_filter_set (constraint: constraint_type,tk : tokenizer,)-> Set [TokenID]
    // in eg a regex, this might be something like 
    // "the set of tokens that include only the character classes reference in the regexp"
   // it might be 

   // if we have anything complicated in constructing the current state, this should be *much* 
   // faster for filtering tokens 
   allowed_transitions_from_state (constraint : constraint_type, tk : tokenizer, st : state_type ) -> Set[token id]

    def copy_state(state : state_type)-> state_type 
   // if the state is immutable/frozen this should be the identity function

something like this could maybe be the generic interface for constrained inference perhaps? i'm glossing over a lot of details like we're actually gonna be zeroing out the set of token indices that aren't in these sets in the log probability vectors, and all the mapping to and from tokenid <-> strings, and also the fact that we can do token constraint computation in paralell with running the gen_next_token_inference step

also in principle, things should be such that we can have instances of this interface that correspond with
logically And or Or or Xor or Nandtogether any two constraint languages I think?

identify more robust output format for data synthesis

the first version uses JSON, which can often be malformed, and theres no good error recovery in that case, need to identify and switch to a more "error tolerant" self aligning format. (meaning we can skip a bad pair and recover useful outputs)

add formal-language (regex/grammars) querying to llm interaction

Example:

> r = complete(prompt="how many eyes do cats have, and how many eyes do spiders have?", output_filter_regex="\{ 'cats': [0-9]*, 'spiders': [0-9]* \}"
> print(r)
{ 'cats': 2, 'spiders': 8 }

Excuse my regex though, I'm a bit rusty.

Like https://lmql.ai/ but with easier languages to use

add correct model persistence for Local LLM models

currently, finetuning doesn't persist models after they've been fine tuned

Add LoRA, QLoRA fine-tuning for HF models

Current onsite LLM class uses full parameter fine-tuning which costly. LoRA fine-tuning will require less memory and prevent overfitting by freezing the pretrained weights.

issues finding settings.toml on a fresh install

 File "quickstart_finetune.py", line 5, in <module>
    from llm_vm.config import settings
  File "C:\Users\Abhigya Sodani\Anaconda3\lib\site-packages\llm_vm\config.py", line 55, in <module>
    if settings.big_model not in MODELS_AVAILABLE:
  File "C:\Users\Abhigya Sodani\Anaconda3\lib\site-packages\dynaconf\base.py", line 144, in __getattr__
    value = getattr(self._wrapped, name)
  File "C:\Users\Abhigya Sodani\Anaconda3\lib\site-packages\dynaconf\base.py", line 309, in __getattribute__
    return super().__getattribute__(name)
AttributeError: 'Settings' object has no attribute 'BIG_MODEL'

We need to replicate and fix these issues with dynaconf. I have this issue when I clone the repo and run pip run . .

Mega Ticket - Gallery

open bounty for demo applications that work to add to a curated example gallery

track information about fine tunings that have been persisted, or at least what the base prompt and Q,A pair associated with it

Data synthesis

We should use the larger LLM to synthesize data for training the small LLM with in the optimizing api

Add support for other LLMs

List of open-source LLMs:

Name	Release date[a]	Developer	Number of parameters[b]	Corpus size	License[c]	Notes
BERT	2018	Google	340 million[42]	3.3 billion words[42]	Apache 2.0[43]	An early and influential language model,[2] but encoder-only and thus not built to be prompted or generative[44]
XLNet	2019	Google	~340 million[45]	33 billion words		An alternative to BERT; designed as encoder-only[46][47]
GPT-2	2019	OpenAI	1.5 billion[48]	40GB[49] (~10 billion tokens)[50]	MIT[51]	general-purpose model based on transformer architecture
GPT-3	2020	OpenAI	175 billion[25]	300 billion tokens[50]	public web API	A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[52]
GPT-Neo	March 2021	EleutherAI	2.7 billion[53]	825 GiB[54]	MIT[55]	The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[55]
GPT-J	June 2021	EleutherAI	6 billion[56]	825 GiB[54]	Apache 2.0	GPT-3-style language model
Megatron-Turing NLG	October 2021[57]	Microsoft and Nvidia	530 billion[58]	338.6 billion tokens[58]	Restricted web access	Standard architecture but trained on a supercomputing cluster.
Ernie 3.0 Titan	December 2021	Baidu	260 billion[59]	4 Tb	Proprietary	Chinese-language LLM. Ernie Bot is based on this model.
Claude[60]	December 2021	Anthropic	52 billion[61]	400 billion tokens[61]	Closed beta	Fine-tuned for desirable behavior in conversations.[62]
GLaM (Generalist Language Model)	December 2021	Google	1.2 trillion[63]	1.6 trillion tokens[63]	Proprietary	Sparse mixture-of-experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher	December 2021	DeepMind	280 billion[64]	300 billion tokens[65]	Proprietary
LaMDA (Language Models for Dialog Applications)	January 2022	Google	137 billion[66]	1.56T words,[66] 168 billion tokens[65]	Proprietary	Specialized for response generation in conversations.
GPT-NeoX	February 2022	EleutherAI	20 billion[67]	825 GiB[54]	Apache 2.0	based on the Megatron architecture
Chinchilla	March 2022	DeepMind	70 billion[68]	1.4 trillion tokens[68][65]	Proprietary	Reduced-parameter model trained on more data. Used in the Sparrow bot.
PaLM (Pathways Language Model)	April 2022	Google	540 billion[69]	768 billion tokens[68]	Proprietary	aimed to reach the practical limits of model scale
OPT (Open Pretrained Transformer)	May 2022	Meta	175 billion[70]	180 billion tokens[71]	Non-commercial research[d]	GPT-3 architecture with some adaptations from Megatron
YaLM 100B	June 2022	Yandex	100 billion[72]	1.7TB[72]	Apache 2.0	English-Russian model based on Microsoft's Megatron-LM.
Minerva	June 2022	Google	540 billion[73]	38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[73]	Proprietary	LLM trained for solving "mathematical and scientific questions using step-by-step reasoning".[74] Minerva is based on PaLM model, further trained on mathematical and scientific data.
BLOOM	July 2022	Large collaboration led by Hugging Face	175 billion[75]	350 billion tokens (1.6TB)[76]	Responsible AI	Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica	November 2022	Meta	120 billion	106 billion tokens[77]	CC-BY-NC-4.0	Trained on scientific text and modalities.
AlexaTM (Teacher Models)	November 2022	Amazon	20 billion[78]	1.3 trillion[79]	public web API[80]	bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI)	February 2023	Meta	65 billion[81]	1.4 trillion[81]	Non-commercial research[e]	Trained on a large 20-language corpus to aim for better performance with fewer parameters.[81] Researchers from Stanford University trained a fine-tuned model based on LLaMA weights, called Alpaca.[82]
GPT-4	March 2023	OpenAI	Exact number unknown, approximately 1 trillion [f]	Unknown	public web API	Available for ChatGPT Plus users and used in several products.
Cerebras-GPT	March 2023	Cerebras	13 billion[84]		Apache 2.0	Trained with Chinchilla formula.
Falcon	March 2023	Technology Innovation Institute	40 billion[85]	1 Trillion tokens (1TB)[85]	Apache 2.0[86]	The model is claimed to use only 75% of GPT-3's training compute, 40% of Chinchilla's, and 80% of PaLM-62B's.
BloombergGPT	March 2023	Bloomberg L.P.	50 billion	363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[87]	Proprietary	LLM trained on financial data from proprietary sources, that "outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks"
PanGu-Σ	March 2023	Huawei	1.085 trillion	329 billion tokens[88]	Proprietary
OpenAssistant[89]	March 2023	LAION	17 billion	1.5 trillion tokens	Apache 2.0	Trained on crowdsourced open data
PaLM 2 (Pathways Language Model 2)	May 2023	Google	340 billion[90]	3.6 trillion tokens[90]	Proprietary	Used in Bard chatbot.[91]

Prompt Template for Code Synthesis

Separate into "prompt template" (a GPT-3 recognized word) and variables. Then ask the gpt-3 to change specific parameters for the template.

Prompt template: What is the [variable] of the [object]?

             - variable-object pairs = (currency, Myanmar), (price, Bitcoin)

Now, for generation we might specify differences for either prompt templates or variable/object. Based on following parameters(not comprehensive list):

   i. Style: Keep the meaning of the question same but ask in different styles.
               eg: introduce spelling mistakes, code mixing, cultural tone difference etc.

   ii. Semantic diversity: The task itself should change.

Add the above nlp metrics in code and set thresholds, keep generating till the thresholds are satisfied.

Reference:

https://arxiv.org/pdf/2005.04118.pdf (behavioral testing of the nlp)
2.https://arxiv.org/abs/2111.09963 (behavioral testing of recommender system)
Essentially we will framing the task as NLP task or recommendation task with behavioral metrics satisfied.

eg: keep generating similar sentences with cosine_similarity 0.5 and diversity 0.4 till "n" sentences reached or "x" attempts satisfied.

Sparse LLM Inference

June July tasks

Tasks for end of June and through July 2023

LLM-VM tasks

Error handling

Can cleanup in progress fine tunings when ctrl+c abort the local version of LLM-VM #23
Can retry data synthesis when given malformed json results #23
Maybe later should consider a more robust encoding rather than json, where mismatched brackets from unquoted responses #29

Agents

Get REBEL agent onto main branch LLM-VM/main #2
Get Backward Chaining agent onto main branch LLM-VM/main #2
Get all agents to use the optimizer endpoint, as opposed to the openai api #19
Get Flat agent on LLM-VM/main #2
Agent test suite #20

Documentation

Write documentation for LLM-VM on docusaurus. #22

Data Synthesis #3

Add all the parameters
choose a more robust encoding for question answer response sets #29
K-shot support
Prompt variation aka support alternative prompts, possibly with different defaults for each supported model
Track number of actual responses vs requested
Dedup handling
- - First exact repeat parameters
- - then semantic vector comparisons?
Parameterize the diverse ways you might want to define exact matches or vector comparisions

Parameter Management

Documenting all the parameters and making sure they’re sane #25

Instrumentation / Statistics

Keep track of detailed information about inference, data synthesis, response sizes, latency, quality #26

Inference Determinism (more long term question,not immediate prioriity)

What LLM flavors have explicit rng Seed being surfaces as optional input or something that can be part of the response metadata for reproducibility? #27

misc

Maybe also something about managing/tracking fine tunings you’ve generated #28

add mypy running as a github actions lint?

the current main branch has no issues from mypy's perspective, would that be a helpful lint for contributors?

Improve documentation

We need documentation for the entire project, and each sub-folder, and part. And documentation for each function in the code itself, and a standard/programatic way to display documentation

improve finetune usage

quickstart_finetune.py currently demonstrates a successful finetuning of a local model.
While finetune is set to true in each completion call(lines 8, 14, 20), only the third call (line 20) results in finetuning and saving of the local model.

It is currently not obvious why we need 2 prior calls to completion before fine tuning successfully start.
If this is because we need prior examples this should be detailed in the documents and reflected in the feedback from the call itself in some way.

device selection management

should probably surface the right hooks for this, though hobbyists dont really need or care about this

Load compatibility with Bert models into onsite_llm.py

Create a Small_Local_Bert class capable of managing the bert LLM similar to other code in the file.

parameter/configuration management and documentation

we're starting to add lots of parameters for configuration of LLM invocation for data synthesis and the agents, these need to be documented and sanity checked

upgrade scripting config managment

perhaps https://www.dynaconf.com/, plus XDG support https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html

Fix entry points and directory structure

Check entry points for consistency
Rework app directory structure into src/llm_vm and rename the _ _ init _ _.py file

Get HuggingFace Kwargs working on the optimizer.complete endpoint

All HuggingFace arguments need to work on the optimizer.complete endpoint as well. For the complete endpoint this means the all arguments for the Huggingface .generate function must be able to be added to the optimize.complete function as well. Currently, this not something I was able to do with the max_new_tokens parameter, for example.

Correct fine-tuning on all llms in onsite_llm.py

Only Small_Local_Neo successfully fine-tunes. This logic needs to be extended to the other models in the file.

cleanup trailing whitespace in source files

Add "model" as an optional parameter to generate/finetune function in onsite_llm.py

Right now in optimizer.py on line 238

completion = self.call_small(prompt = dynamic_prompt.strip(), model=model, **kwargs)

we call the small model (accessing the generate function) with the model parameter. Right now this does not do anything, we need to be able to provide a .pt file to that parameter and then call load_model (also defined in onsite_llm.py) in order to load that .pt file as the model to use. This needs to be done in the finetune function as well.

Correct fine-tuning for GPT models in onsite_llm.py

The GPT3 and Chat_GPT class models do not allow for a way to store and load a fine-tuned model into the completion pipeline. Adjust class attributes and methods to allow fine-tuned model with c_ids to be accessed in the openai call.