Coder Social home page Coder Social logo

community's Introduction

Welcome to the InstructLab Community repository🔬

The mission of the InstructLab (Large-scale Alignment for chatBots) project is to leverage innovative techniques that overcome challenges in Large Language Model (LLM) training. InstructLab uses a taxonomy based curation process, along with synthetic data generation, that allows the open source community to submit contributions to existing LLMs in an accessible way.

InstructLab is made up of several projects that are defined as codebases and services with different release cycles. Collectively, these enable large-model development. This repository shares InstructLab's activity and collaboration details across the community and include the most current information about the project. Related repositories include the following:

Contributing new features, resolving bugs and issues, and refining the documentation experience through pull requests are welcome. More information about contributing to the InstructLab Project, contributor roles, governance and legal, and licenses can be found in proceeding sections of this document.

Community Goals

The goals of this open source community includes the following:

  • Drive adoption of the InstructLab tooling and model API standard.
  • Grow an ecosystem of contribution driven open models
  • Establish deployable patterns, practices, and evidence for sophisticated use cases.

Project Communication Channels

There are many ways to engage with InstructLab project maintainers and community members outside of GitHub. You can find all of these, including timing for our community meetings and office hours, on our Collaboration page.

Getting Started with the InstructLab Project workstreams🥼

InstructLab (Large-scale Alignment for chatBots) is an open source initiative by Red Hat and IBM. It provides a platform for easy engagement with Large Language Models (LLM) by using the ilab command-line interface (CLI) tool. Users can augment the LLM's capabilities by submitting the skills and knowledge that they have tested to the project’s taxonomy repository on GitHub by creating a pull request.

The following documentation shows you an overview of the workflow, and the resources needed, to get started with InstructLab.

💻 InstructLab (ilab) Workflow

Installing and interacting with the ilab CLI tool

The ilab tool allows you to interact with the IBM AI model Merlinite or Granite, contribute your own information, and train the model locally.

Note: Before proceeding, it might be beneficial to check out the Contributing guide for an overview of contributing practices and expectations. Additionally, you should consider joining the InstructLab community Slack channel.

  1. Navigate to the ilab CLI repository and follow the instructions in the README.md. The README.md instructs you on how to perform the following:

    a. In the Getting started section of the README.md file, you can install the ilab tool, set up your local environment, and download the IBM Merlinite-7b (default) AI model. If you run into any issues, you can find many solutions in the in the CLI repository's discussion board.

    b. You can then create your own data sets to feed into and train the model. In the taxonomy project, there are two types of data you can serve to the model: skills and knowledge. There are a few different types of skills and knowledge you can create. For more detailed information on the types, see the Taxonomy README.md.

    c. In your local taxonomy repository, generated after the Initialize ilab step, navigate to the path that you want to add information to. You can see a flow chart of the paths in this file taxonomy_diagram. Create a qna.yaml file in that path with your contributions.

    d. Serve and train the model with your contributions to see if the model can answer questions more accurately.

    e. Congratulations! You trained an AI model locally!

Opening a pull request in the taxonomy repository with your new skills or knowledge

If your contributions improved the model locally, you can contribute your files to the main AI model through the taxonomy repository. For more information see CONTRIBUTING.md in the taxonomy repository.

  1. To contribute your knowledge and skills to the taxonomy repository, follow the documentation in Contribute knowledge and skills to the taxonomy.

    IMPORTANT: Ensure that your files and contributions follow the proper YAML format, see examples in the Skills: YAML format file.

Getting reviews on pull requests

There are teams of contributors from Red Hat and IBM that will review your pull request and determine if it can be merged in the taxonomy repository. For more information, see the Triaging contributions documentation.

See your contributions impact an AI model

The Merlinite-7b and Granite-7b models are built regularly. Sometime after your pull request is merged, Merlinite is updated and you can see locally that the model improved with the skill or knowledge you taught it.

Contribution

Help on open source projects is always welcome and there is always something that can be improved. For example, documentation (like the text you are reading now) can always use improvement, code can always be clarified, variables or functions can always be renamed or commented on, and there is always a need for more test coverage. If you see something that you think should be fixed, take ownership!

To contribute code or documentation, please submit a pull request to the relevant repository. Note that contribution to any repository has its own set of requirements and expectations, and users should familiar themselves with those expectations before contributing.

Contributor roles

The project welcomes new contributors. Not all contributors are able to provide sustained contributions, but they are always welcome. The contributor roles document outlines the various roles to support contributors and help them grow responsibility in the various InstructLab projects. These roles are subject to change, and new roles will be added as necessary.

Maintainers

Project Maintainers are first and foremost contributors that have shown they are committed to the long term success of a project. Maintainership is about building trust with the community and being a person that everyone can depend on to make consistent decisions in the best interest of the project. With enough time and experience, contributors can apply to become Maintainers. The current list of Maintainers can be found in the Maintainers file.

Governance & Legal

  • InstructLab Community Governance

  • InstructLab Code of Conduct

  • You must agree to the terms of the Developer Certificate of Origin (DCO) by signing off your commits in your pull requests. The Developer Certificate of Origin (DCO) is a lightweight way for contributors to certify that they wrote or otherwise have the right to submit the code they are contributing to the project. Here is the full text of the DCO, reformatted for readability:

    By making a contribution to this project, I certify that:

    a. The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or

    b. The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or

    c. The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.

    d. I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.

Contributors sign-off that they adhere to these requirements by adding a Signed-off-by line to commit messages. For more information about how the DCO works with this project, see Developer Certificate of Origin (DCO).

Licenses

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

If you would like to see the detailed LICENSE click, see LICENSE.

Contact resources

Quick Links

community's People

Contributors

lhawthorn avatar bjhargrave avatar caradelia avatar russellb avatar jjasghar avatar nathan-weinberg avatar stevsmit avatar dependabot[bot] avatar cybette avatar joesepi avatar kelbrown20 avatar hickeyma avatar joealdinger avatar makelinux avatar vishnoianil avatar lehors avatar mikebrow avatar danmcp avatar spzala avatar mingxzhao avatar jberkus avatar spotzz avatar schwesig avatar katesoule avatar github-actions[bot] avatar wking avatar moutons avatar psschwei avatar obuzek avatar mairin avatar

Stargazers

j3d1d3v avatar Chawalit Limsowan avatar Srijoni Biswas avatar  avatar  avatar  avatar Hashem Aldhaheri avatar Michael Pruitt avatar Benxa avatar Muhammad Moiz Ahmed avatar Elton_Dev avatar Thomas Südbröcker avatar Joan Reyero avatar Davian (Thant Yazar Hein) avatar  avatar Shawn Hensley avatar Pete Dimitrios avatar Jeffrey Stines avatar Edward Savage avatar  avatar Jailson Souto avatar Brian De Sousa avatar  avatar Kenzo avatar Hafid Haddouti avatar victor immanuel avatar Sriram Raghavan avatar Debashis Dutta avatar A Taylor avatar  avatar Wiley Winters avatar Marketka avatar  avatar  avatar Grim avatar Zaio Klepoyshkov avatar Brandon Graves avatar Mike DuPont avatar Tonic avatar Ashish Ashish avatar Karsten Wade avatar Gabriel Becker avatar Ashley D'Andrea avatar Brent Salisbury avatar Kevin C Myers avatar Enrico Toniato avatar Michael Ryan avatar Pedro Garcia avatar  avatar Nikolaus Schlemm avatar  avatar Kate Blair avatar  avatar  avatar Stephen Meier avatar  avatar  avatar Steven Skeard avatar Gerald Mitchell avatar Christian Kadner avatar Adam Miller avatar Christian Heimes avatar Kunal Sawarkar avatar Tomer Figenblat avatar Jason Froehlich avatar

Watchers

Matt Hicks avatar David Cox avatar  avatar  avatar  avatar Jason T. Greene avatar  avatar  avatar Akash Srivastava avatar  avatar  avatar Roddie Kieley avatar Byron Miller avatar Jeremy Eder avatar Veillard Daniel avatar  avatar Roberto Nozaki avatar Matt Dorn avatar Michael Clifford avatar Trevor Grant avatar Kai Xu avatar Savitha Raghunathan avatar  avatar  avatar Mark Sturdevant avatar Aldo Pareja avatar tmrizzo avatar Florian Schüller avatar Simone Tiraboschi avatar Thomas Hall avatar Tuan, Hoang-Trong avatar Tommy Li avatar Scott Herold avatar Sudha Ponnaganti avatar Christian Kadner avatar  avatar Douglas Viroel avatar  avatar Abraham Miller avatar  avatar Oindrilla Chatterjee avatar Fatima Shaikh avatar Pete Dimitrios avatar  avatar

community's Issues

Code of Conduct - Link from all repos to one CoC

The CoC should live in the .github repo so that it will be automagically added to all repos if necessary.

Each repo should have its own CoC file (I believe and have had this argument many times 😄 ) but they should all point to the single source of truth in the .github repo

Tutorial/Hello World example

@mairin said that there is a video of a RHer walking through soup to nuts on this whole process. And she said that there is an army of technical writers who can help go through docs.

Get Github details in place for website (gh pages) and custom domain to work

I messaged Kate about doing the DNS mapping when the domain name is purchased.

Tasks:

  • Register domain name (happening elsewhere)
  • Point domain name DNS to github IP addresses
  • Add domain name CNAME record pointing to github org dot io
  • Add CNAME file to github repo
  • Update settings in github repo to serve the site

Update legal section in contributing docs

lhawthorn 4 hours ago
We will need to update this section with expectations that we expect contributors to submit a DCO with contributions.

We also need to update this section with expectations on how contributors must annotate submissions which include content, which is still under discussion in instructlab/taxonomy#182

Proceeding under the assumption that these areas are called out in the pull request template - which they should be - linking to the PR template would also be good to do in this section.

cc @lhawthorn

FAQ

Steven Smith (RH) is working on this

Create skills contribution documentation

There is a little bit of work done on the skills contribution guidance at the very bottom of the taxonomy README. That should be moved to a CONTRIBUTING.md file and expanded upon. There is very little there now so there is a lot of work to do on this task. @-kate.soule has a document with learnings from the skills 100 process. That would likely be a good start. Lets get something together and published so folks can add to it as we learn more.
https://github.com/open-labrador/taxonomy/tree/main?tab=readme-ov-file#ways-to-contribute

Scope of license declaration?

The taxonomy repository links this community repository for "general practices for the InstructLab community". This repository says:

Each source file must include a license header for the Apache Software License 2.0. Using the SPDX format is the simplest approach. e.g.

/*
Copyright <holder> All Rights Reserved.

SPDX-License-Identifier: Apache-2.0
*/

But there are currently no such declarations in the taxonomy repository:

$ git --no-pager log --oneline -1
2f189ae (HEAD -> main, origin/pr/39, origin/main, origin/HEAD) doc: add skills triage (#40)
$ git --no-pager grep -l Apache
LICENSE

Should there be? As I understand it, part of the push here is to shift models/training/data into the "things covered by open source license/community" bucket, and having skills YAML and such explicitly licensed as source might help make that case. But skills YAML are expected to be short, maybe legal doesn't think they're copyrightable at all? I dunno. Just floating because I don't understand where the licensed-source vs. random-auxilliary-content boundary is and hoping to have it crisped up by folks who have spend longer thinking about it :)

Update Org level read me

I created an org README pulling together some info but that was a week ago and this needs to be fixed up before launch

DEFINE SECURITY POLICY

Required

  • turn on private vulnerability reporting within Github
  • update community/SECURITY.md to point to private vulnerability reporting
  • turn on secret scanning (required by IBM open sourcing policy)
  • turn on dependabot for all repos (required by IBM open sourcing policy)

Optional

  • (optionally) Run Snyk for security scan, I think Redhat also run Snyk. We do not have license to run anything else in github.com
  • (Recommended) Badge/Certify with OpenSSF - industry standard for best practice and security. At the very minimum we should get a passing badge

The InstructLab repo does not have a brief project description

We need to update this post-haste.

In the interim prior to having official copy, I propose the description be:
The InstructLab project is a novel way to allow contributions to an existing large language model without the need to fully fork and fine-tune the result.

This could be better worded. Suggestions welcome and encouraged.

Contribution Policy

Standard OSS Contribution policy -- looks for LF version.
Combine with anything Betsy (legal) may have.

License

Which license should we use for the taxonomy? I don't think Apache really makes sense since it's not really code. I would think we would rather want something like a CC-BY license. It is crucial that we have the appropriate license before people start making contributions because changing license afterwards is very difficult.

For that matter, although it is less critical, I think the same is true for this repo (community) if all of the content is documentation.

Badge/Certify with OpenSSF

NOT REQUIRED/NICE TO HAVE:
Badge/Certify with OpenSSF - industry standard for best practice and security. At the very minimum we should get a passing badge

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.