Coder Social home page Coder Social logo

minbpe.mojo's Introduction

minbpe.๐Ÿ”ฅ

This project is a port of Andrej Karpathy's minbpe to Mojo, currently in beta.

Minbpe implements the Byte Pair Encoding (BPE) algorithm, which is commonly used in large language models (LLMs) tokenization. For a comprehensive explanation of this project, visit its GitHub page at https://github.com/karpathy/minbpe.

Not all features of minpe are available yet, but will be introduced as the project evolves. Currently, the main focus is on enhancing the performance of the core functionality.

Implementation

Due to differences in language capabilities, the architecture of this port has been modified to fit the constraints and features of Mojo. While the architecture is different, the core functionalities and behaviors of the application remain the same as in the original. As Mojo's language features continue to evolve, we expect to further refine and redesign the project.

Available Tokenizer

Tokenizers in minbpe.mojo are implemented by confirming to the TokenizationStrategy trait, which defines the required methods around tokenization processes.

  • BasicTokenizationStrategy: Implements the BasicTokenizer, the simplest implementation of the BPE algorithm that runs directly on text.
  • RegexTokenizationStrategy: Implements the RegexTokenizer that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.
  • GPT4TokenizationStrategy to be implemented

Quick Start

  • First make sure you have Mojo 24.3 installed.
  • In addtion you need to install the Python library regex. We rely on regex because Mojo currently lacks a powerful native regular expression library. Mojo's ability to utilize Python libraries allows us to enhance functionality in this way. For information on this powerful language feature, see the Python Integration section in the official Mojo documentation.
pip install regex
  • The quick start example from minbpe can be implement with minbpe.mojo as follows:
from mojobpe import Tokenizer,BasicTokenizationStrategy
from mojobpe.utils.tat import print_list_int

fn main() raises:
   var text = "aaabdaaabac"

   var tokenizer = Tokenizer[BasicTokenizationStrategy]()
   tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
   print_list_int(tokenizer.encode(text))
   # [258, 100, 258, 97, 99]

   print(tokenizer.decode(List[Int](258, 100, 258, 97, 99)))
   # aaabdaaabac

   tokenizer.save("toy")
   # writes toy.model (for loading) 

Benchmarks

A detailed benchmark analysis will be available soon.

For now we have included a Mojo port of train.py from the original repository, which times the training of both the Basic and Regex Tokenizer with the text from Taylor Swift's Wikipedia page. In our preliminary tests, the Mojo version proves to be approximately three times faster than the original Python implementation. You can run this training benchmark test using the following command:

mojo train.mojo

Changelog

  • 2024.05.14
    • Status: Beta
    • Performance improvements
  • 2024.05.12
    • Switch to MoString for String concatenation
  • 2024.05.04
    • Initial repository setup and commit.

Remarks

  • We achieved a significant performance boost by utilizing Maxim Zaks' exceptional Mojo library, CompactDict, which provides blazing fast dictionary implementations. We've incorporated a slightly modified version of this library in the mojobe.utils folder (generic_dict and string_dict); all credits go to him.
  • Gregor Purdy has implemented an impressive Rust port of minbpe. In our initial tests, Gregor's port performs similar to our current Mojo port..

License

MIT

minbpe.mojo's People

Contributors

dorjeduck avatar benny-nottonson avatar

Stargazers

Jack Clayton avatar Mark Liteykin avatar Tushar Kanhe avatar Md. Nazrul Islam Khan avatar

Watchers

 avatar

Forkers

benny-nottonson

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.