If we consider the following prompt, then huggingface's tokenizer says there are 1144

Thanks! I just created a PR here to allow pretokenized inputs: <a class="issue-link js

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer about turbopilot HOT 6 OPEN

ravenscroftj commented on July 18, 2024

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer

from turbopilot.

Comments (6)

ravenscroftj commented on July 18, 2024

Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm).

I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically.

Was there a specific use case you have in mind that this is blocking?

from turbopilot.

thakkarparth007 commented on July 18, 2024

Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs.

from turbopilot.

ravenscroftj commented on July 18, 2024

Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted!

from turbopilot.

thakkarparth007 commented on July 18, 2024

Thanks! I just created a PR here to allow pretokenized inputs: ravenscroftj/ggml#2

It seems to work fine for me.

from turbopilot.

ravenscroftj commented on July 18, 2024

That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway.

Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it!

from turbopilot.

thakkarparth007 commented on July 18, 2024

Thanks!

I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and found about 15% degradation in the exact match metric at first line. Here's how the graph looks like:

I manually looked at some of the mispredictions and they seemed okay to me, but were getting penalized because it wasn't an exact match. I think one interesting thing to do would be to check how different the probabilities of the 16bit and 4bit predictions are

from turbopilot.

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer about turbopilot HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent