Topic: llm-evaluation Goto Github

Some thing interesting about llm-evaluation

👇 Here are 40 public repositories matching this topic...

adamcoscia / iscore

llm-evaluation,Upload, score, and visually compare multiple LLM-graded summaries simultaneously!

User: adamcoscia

Home Page: https://arxiv.org/abs/2403.04760

llm-evaluation visual-analytics learning-sciences summary-evaluation responsible-ai ethical-ai transformers

agenta-ai / agenta

llm-evaluation,The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

Organization: agenta-ai

Home Page: http://www.agenta.ai

human-annotation langchain large-language-models llama-index llm llm-evaluation llm-framework llm-tools llmops llms prompt-engineering prompt-management prompt-toolkit rag rag-evaluation

agenta-ai / job_extractor_template

llm-evaluation,Template for an AI application that extracts the job information from a job description using openAI functions and langchain

Organization: agenta-ai

Home Page: https://agenta.ai

example extract-data extract-information extraction langchain llm llm-evaluation llm-evaluation-toolkit llmops openai openai-function-example template unstructured-text

allenai / commongen-eval

llm-evaluation,Evaluating LLMs with CommonGen-Lite

Organization: allenai

Home Page: https://inklab.usc.edu/CommonGen/

chatgpt evaluation gpt-evaluation llama2 llm llm-evaluation text-generation

antoniogr7 / pratical-llms

llm-evaluation,A collection of hand on notebook for LLMs practitioner

User: antoniogr7

genai llm llm-evaluation llm-inference llm-serving llm-training quantization

llm-evaluation,FactScoreLite is an implementation of the FactScore metric, designed for detailed accuracy assessment in text generation. This package builds upon the framework provided by the original FactScore repository, which is no longer maintained and contains outdated functions.

User: armingh2000

answer-evaluation evaluation gpt-4 gpt-evaluation large-language-models llm-evaluation llms natural-language-processing nlp openai

athina-ai / athina-evals

llm-evaluation,Python SDK for running evaluations on LLM generated responses

Organization: athina-ai

Home Page: https://docs.athina.ai

evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops

aws-samples / fm-leaderboarder

llm-evaluation,FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Organization: aws-samples

llm-benchmarking llm-evaluation llm-evaluation-framework

azminewasi / awesome-llms-iclr-24

llm-evaluation,It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

User: azminewasi

large-language-model large-language-models large-language-models-and-translation-systems large-language-models-for-graph-learning llm llm-agent llm-evaluation llm-framework llm-inference llm-privacy

babelscape / alert

llm-evaluation,Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

Organization: babelscape

Home Page: https://arxiv.org/abs/2404.08676

ai artificial-intelligence llm llm-evaluation llm-safety llm-safety-benchmark nlp nlp-machine-learning red-teaming

chainlit / literal-cookbook

llm-evaluation,Cookbooks and tutorials on Literal AI

Organization: chainlit

Home Page: https://cloud.getliteral.ai/

llm llm-evaluation prompt-engineering rag

chanliang / conner

llm-evaluation,The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“

User: chanliang

Home Page: https://arxiv.org/abs/2310.07289

llm-evaluation hallucinations emnlp2023 large-language-models factuality nlg-evaluation chatgpt llama

confident-ai / deepeval

llm-evaluation,The LLM Evaluation Framework

Organization: confident-ai

Home Page: https://docs.confident-ai.com/

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

davidgir / langchain-familiarization

llm-evaluation,For the purposes of familiarization and learning. Consists of utilizing LangChain framework, LangSmith for tracing, OpenAI LLM models, Pinecone serverless vectorDB using Jupyter Notebook and Python.

User: davidgir

langchain-agent langchain-chains langchain-python llm llm-evaluation models parsers prompt llmchain pinecone rag

euskoog / openai-assistants-evals

llm-evaluation,Visualize LLM Evaluations for OpenAI Assistants

User: euskoog

Home Page: https://openai-assistants-evals-dash.vercel.app/

llm-evaluation llms openai openai-assistants tailwindcss

euskoog / openai-assistants-link

llm-evaluation,Link your OpenAI Assistants to a custom store + Evaluate Assistant responses

User: euskoog

fastapi llm-evaluation llms openai openai-assistant-api openai-assistants python

evaluation-tools / nutcracker

llm-evaluation,Large Model Evaluation Experiments

Organization: evaluation-tools

large-language-models llm llm-evaluation llmops

giacomomeloni / exploringllms

llm-evaluation,Exploring the depths of LLMs 🚀

User: giacomomeloni

generative-ai llm llm-evaluation prompt-engineering rag retrieval-augmented-generation

giskard-ai / giskard

llm-evaluation,🐢 Open-Source Evaluation & Testing framework for LLMs and ML models

Organization: giskard-ai

Home Page: https://docs.giskard.ai

mlops ml-validation ml-testing ai-testing ai-safety ml-safety llmops ethical-artificial-intelligence responsible-ai fairness-ai

henry-yeh / awesome-llm-in-social-science

llm-evaluation,Awesome papers involving LLMs in Social Science.

User: henry-yeh

large-language-models llm-agent llms simulation-environment social-science alignment economics llm-evaluation policy psychology

intuit-ai-research / dcr-consistency

llm-evaluation,DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Organization: intuit-ai-research

blackbox consistency divide-and-conquer-approach hallucinations large-language-models llm llm-evaluation summarization

ivarfresh / interaction_llms

llm-evaluation,[Personalize@EACL 2024] LLM Agents in Interaction: Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models.

User: ivarfresh

bfi generative-agents llm-evaluation llms personality-traits linguistic-alignment

kwinkunks / promptly

llm-evaluation,A prompt collection for testing and evaluation of LLMs.

User: kwinkunks

chatgpt llm-evaluation prompt-engineering prompts

llm-evaluation-s-always-fatiguing / leaf-playground

llm-evaluation,A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

Organization: llm-evaluation-s-always-fatiguing

llm-evaluation agent-based-simulation automation evaluations agent agents chatgpt

minnesotanlp / cobbler

llm-evaluation,Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Organization: minnesotanlp

Home Page: https://minnesotanlp.github.io/cobbler-project-page/

bias evaluation llm nlp bias-detection llm-as-a-judge llm-as-evaluator llm-as-judge llm-evaluation llms

networks-learning / prediction-powered-ranking

llm-evaluation,Code for the paper Prediction-Powered Ranking of Large Language Models, Arxiv 2024.

Organization: networks-learning

llm-eval llm-evaluation llm-evaluation-framework ranking-algorithm prediction-powered-inference rank-sets

parea-ai / parea-sdk-py

llm-evaluation,Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Organization: parea-ai

Home Page: https://docs.parea.ai/sdk/python

llm llm-evaluation llm-tools llmops llms-benchmarking llm-eval llm-evaluation-framework llm-evaluation-toolkit prompt-engineering generative-ai

parea-ai / parea-sdk-ts

llm-evaluation,TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Organization: parea-ai

Home Page: https://docs.parea.ai/sdk/typescript

llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-tools llms llms-benchmarking llm-eval prompt-engineering

promptfoo / promptfoo

llm-evaluation,Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

Organization: promptfoo

Home Page: https://www.promptfoo.dev/

llm prompt-engineering prompts llmops prompt-testing testing rag evaluation evaluation-framework llm-eval

raga-ai-hub / raga-llm-hub

llm-evaluation,Framework for LLM evaluation, guardrails and security

Organization: raga-ai-hub

Home Page: https://www.raga.ai/llms

guardrails llm-evaluation llmops llm-security

re-align / just-eval

llm-evaluation,A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Organization: re-align

Home Page: https://allenai.github.io/re-align/

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

relari-ai / continuous-eval

llm-evaluation,Open-Source Evaluation for GenAI Application Pipelines

Organization: relari-ai

Home Page: https://docs.relari.ai/

evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation

rochitasundar / generative-ai-with-large-language-models

llm-evaluation,This repository contains the lab work for Coursera course on "Generative AI with Large Language Models".

User: rochitasundar

Home Page: https://www.coursera.org/account/accomplishments/certificate/8JAYVEUAQF56

flan-t5 instruction-finetuning kl-divergence large-language-models llm-evaluation low-rank-adaptation parameter-efficient-fine-tuning prompt-engineering proximal-policy-optimization reinforcement-learning

rungalileo / hallucination-index

llm-evaluation,Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

Organization: rungalileo

Home Page: https://www.rungalileo.io/hallucinationindex

hallucinations large-language-models llm llm-evaluation openai rag retrieval-augmented-generation

sharathhebbar / eval_llms

llm-evaluation,

User: sharathhebbar

eleutherai llm-evaluation llms-benchmarking

vidhyavarshanyjs / ensemblex

llm-evaluation,EnsembleX utilizes the Knapsack algorithm to optimize Large Language Model (LLM) ensembles for quality-cost trade-offs, offering tailored suggestions across various domains through a Streamlit dashboard visualization.

User: vidhyavarshanyjs

Home Page: https://ensemblex.streamlit.app

benchmark huggingface knapsack large-language-models llm llm-evaluation python streamlit open-llm-leaderboard