Coder Social home page Coder Social logo

awesome-cheap-llms's Introduction

awesome-cheap-llms

💛 Costs of RAG based applications
💙 Follow Joanna on LinkedInFollow Magdalena on LinkedIn
🤍 Sign up to DataTalksClub LLM Zoomcamp
⭐ Give this repository a star to support the initiative!


Alt text

👉 Let’s make sure that your LLM application doesn’t burn a hole in your pocket.
👉 Let’s instead make sure your LLM application generates a positive ROI for you, your company and your users.
👉 A nice side effect of choosing cheaper models over expensive models: the response time is shorter!

Techniques to reduce costs

Alt text

1) 📘 Choose model family and type

Selecting a suitable model or combination of models based on factors, such as speciality, size and benchmark results, builds the foundation for developing cost-sensible LLM applications. The aim is to choose a model that fits the complexity of the task. Same as you wouldn't take your BMW 8 Series M to a grocery store, you don't need to use a high-end LLM for simple tasks.

Papers

Tools & frameworks

Blog posts & courses

2) 📘 Reduce model size

After chosing the suitable model family, you should consider models with fewer parameters and other techniques that reduce model size.

  • Model parameter size (i.e. 7B, 13B ... 175B)
  • Quantization (= reducing the precision of the model's parameters)
  • Pruning (= removing unnecessary weights, neurons, channels or layers)
  • Knowledge Distillation (= training smaller model that mimics a larger model)

Papers

Tools & frameworks

Blog posts & courses

3) 📘 Use open source models

Consider self-hosting models instead of using proprietary models if you have capable developers in house. Still, have an oversight of Total Cost of Ownership, when benchmarking managed LLMs vs. setting up everything on your own.

Papers

  • 🗣️ call-for-contributions 🗣️

Tools & frameworks

Blog posts & courses

4) 📘 Reduce input/output tokens

A key cost driver is the amount of input tokens (user prompt + context) and output tokens, that you allow for your LLM. Different techniques to reduce the amount of tokens help in saving costs.
Input tokens:

  • Chunking of input documents
  • Compression of input tokens
  • Summarization of input tokens
  • Test viability of zero-shot prompting before adding few-shot examples
  • Experiment with simple, concise prompts before adding verbose explanations and details

Output tokens:

  • Prompting to instruct the LLM how many output tokens are desired
  • Prompting to instruct the LLM to be concise in the answer, adding no explanation text to the expected answer

Papers

  • 🗣️ call-for-contributions 🗣️

Tools & frameworks

  • LLMLingua by Microsoft to compress input prompts
  • 🗣️ call-for-contributions 🗣️

Blog posts & courses

5) 📘 Prompt and model routing

Send your incoming user prompts to a model router (= Python logic + SLM) to automatically choose a suitable model for actually answering the question. Follow Least-Model-Principle, which means to by default use the simplest possible logic or LM to answer a users question and only route to more complex LMs if necessary (aka. "LLM Cascading").

Tools & frameworks

Blog posts & courses

6) 📘 Caching

If your users tend to send semantically similar or repetitive prompts to your LLM system, you can reduce costs by using different caching techniques. The key lies in developing a caching strategy, that does not only look for exact matches, but rather semantic overlap to have a decent cache hit ratio.

  • 🗣️ call-for-contributions 🗣️

Tools & frameworks

Blog posts & courses

  • 🗣️ call-for-contributions 🗣️

7) 📘 Rate limiting

Make sure one single customer is not able to penetrate your LLM and skyrocket your bill. Track amount of prompts per month per user and either hard limit to max amount of prompts or reduce response time when a user is hitting the limit. In addition, detect unnatural/sudden spikes in user requests (similar to DDOS attacks, users/competitors can harm your business by sending tons of requests to your model).

Tools & frameworks

  • Simple tracking and rate limiting logic can be implemented in native Python
  • 🗣️ call-for-contributions 🗣️

Blog posts & courses

8) 📘 Cost tracking

"You can't improve what you don't measure" --> Make sure to know where your costs are coming from. Is it super active users? Is it a premium model? etc.

Tools & frameworks

  • Simple tracking and cost attribution logic can be implemented in native Python
  • 🗣️ call-for-contributions 🗣️

Blog posts & courses

  • 🗣️ call-for-contributions 🗣️

9) 📘 During development time

  • Make sure to not send endless API calls to your LLM during development and manual testing.
  • Make sure to not send automated API calls to your LLM via automated CICD workflows, integration tests etc.

Contributions welcome

  • We’re happy to review and accept your Pull Request on LLM cost reduction techniques and tools.
  • We plan to divide the content into subpages to further structure all chapters.

awesome-cheap-llms's People

Contributors

magdalenakuhn17 avatar valentinkuhn avatar viaviawe avatar alexheckmann avatar

Stargazers

Maxwell Kambona avatar Atadan İÇEN avatar Alex avatar Hanaa Hammad avatar  avatar Ali Erdoğan avatar Till Meineke avatar Qilong Zhu avatar  avatar Chuka Ekwenze avatar Desti Ratna Komala avatar kevin moore avatar dikip avatar Kyle King avatar Ninad Kulkarni avatar Jose avatar Timophey Molchanov avatar Hanna Behnke avatar Jan Bock avatar  avatar Michael Y. Choi avatar John Patrick Laurel avatar Michael Wiesinger avatar Nikhil Kumar Nayak avatar Stuart Malt avatar  avatar Matija Grcic avatar Muhammad Mudassir Raza avatar Stefan Vujović avatar Markos Zoulias Charatzas avatar Konstantinos Livieratos avatar Leonidas Tsementzis avatar  avatar OscarAV avatar Codie Petersen avatar Jonathan Schweder avatar Markus Junginger avatar Jorge Barrachina Gutiérrez avatar Oyem E avatar  avatar  avatar  avatar  avatar Jesus Pacheco avatar Abdul Gapur avatar Ketul Polara avatar shoeb avatar Aseet Patel avatar  avatar  avatar rollwagen avatar tansuozhe02 avatar Francisco Delca avatar Jörg Müller avatar  avatar GR avatar Scott Jones avatar Peter Kirkham avatar Lawrence Emenike avatar Diogo Miyake avatar Anis Mahmahi avatar Ludovico Comito avatar 3adel avatar Elie Kawerk avatar Stephanie Zeng avatar Miguel GP avatar Touhidul Alam avatar Martin Backes avatar

Watchers

 avatar Jörg Müller avatar  avatar Yuxin Chen avatar Arkadiy Shuvaev avatar Ali Erdoğan avatar Simon Das avatar Tho avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.