Light

mlabonne / llm-datasets Goto Github PK

View Code? Open in Web Editor NEW

853.0 853.0 88.0 118 KB

High-quality datasets, tools, and concepts for LLM fine-tuning.

llm-datasets's Introduction

🐦 Follow me on X • 🤗 Hugging Face • 💻 Blog • 📙 Hands-on GNN

Hi, I'm a Machine Learning Scientist, Author, Blogger, and LLM Developer.

💼 Projects

The LLM Course: A popular curated list of resources to get into LLMs (>29k ⭐).
Hands-on GNNs: My book about graph neural networks published by Packt (all the code is open source).
LLM Datasets: Curated list of high-quality datasets for LLM fine-tuning.
LLM Tools: Automate LLM pipelines with Colab notebooks like LLM AutoEval, LazyMergekit, LazyAxolotl, and AutoQuant.

🤗 Models

AlphaMonarch-7B: Top performer in terms of reasoning + conversational abilities on a variety of benchmarks. [Demo]
NeuralBeagle14-7B: The most powerful 7B model (rank 10 on the entire Open LLM Leaderboard). [Demo]
Phixtral: Novel Mixture of Experts architecture with phi-2 models. [Demo]
Beyonder-4x7B-v3: Mixture of Experts with four excellent fine-tuned Mistral-7b models. [Demo]
NeuralMarcoro14: My previous best 7B model (rank 1 on the Open LLM Leaderboard 7B param). [Demo]
NeuralHermes: A DPO fine-tuned version of OpenHermes (extremely cost-efficient). [Demo]

llm-datasets's People

Contributors

Stargazers

Watchers

Forkers

jcarlosrm buildtonic sletch yacineali74 granludo thiwankajayasiri potgie claytonsamples yunho0130 lpai-org vonewman paulhendi iykechuks11 evdcush sorokinvld de30 mayankbaluni anacronic-io mbaroudi mekongdelta-mind obohatov thomascherickal apollohuang1 jipyeong-lee craftdata vladkalinichencko nehharshah geronimi73 polya20 mtomas7 gyanachand1 hyeonsangjeon hypoxisaurea nitronomic hydercps zuwei-zhao zjjhit antmikinka heisnotanimposter hughes-research eltociear mihaleon rkp64 bryancris shootmir fourcemjweb chikktinnyschoncu tatraflex-t certready3grimbel musesswor59 nnii20 channetr-targetcoops pretech76 louud70 chamerlireackste ailabteam mortoupe-lucyto a-flavoredbubble m-inhibexio febriian skyck-skunkylysi nguyenthienhy andersonamaral2 valery-shinkevich shaneholloman shobhit-agarwal alohalt nymbo njausxl maximacgfx surgatengit srimouli04 kimwoonggon aruneral01 santyzenith abnershang hugolaurencon skynoid2612 mz0in akash-gupta-parloa jxzhangjhu thanhpham1987 nethajinirmal13 o7s8r6 viznuv songkq siliciuss

llm-datasets's Issues

How to create an instruction dataset from .pdf and .docx documents

Hello I'm in the process of fine-tuning a Large Language Model (LLM) for an NGO and I need to construct an instruction dataset from .pdf and .docx documents containing information in text.

The objective is to extract instructions from these documents and organize them into a structured dataset suitable for fine-tuning the LLM. This involves parsing .pdf and .docx files, extracting relevant text segments, and annotating them.

I'm seeking guidance and recommendations from the community on how to efficiently create this dataset. Specifically, I'm interested in:

Techniques and libraries for parsing .pdf and .docx documents in Python.
Strategies for extracting instructional content from the parsed documents while maintaining context and fidelity.
Approaches for annotating the extracted text segments as instructional content, including identifying key actions, steps, and contextual information.

Any advice, best practices, or resources you can provide to assist in this endeavor would be greatly appreciated. Thank you for your support!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.