Hello all, Thank you for your excellent work here! We are using <cod

Fixed by this gist: <a href="https://gist.github.com/jneuff/682d47b786329f19291d166957

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Strange warnings with tokenizer for some models about tokenizers HOT 5 CLOSED

EricLBuehler commented on June 12, 2024

Strange warnings with tokenizer for some models

from tokenizers.

Comments (5)

EricLBuehler commented on June 12, 2024

Fixed by this gist: https://gist.github.com/jneuff/682d47b786329f19291d166957b3274a

Seems to be an issue with the tokenizer.json file.

from tokenizers.

ArthurZucker commented on June 12, 2024

Which files on the hub are you using? And which tokenizers version?
It's a bit weird and should not be happening.

from tokenizers.

EricLBuehler commented on June 12, 2024

@ArthurZucker, I am using tokenizers version 0.19.1 and this tokenizer file:

tokenizers = "0.19.1"

Edit:
Loading with this function demonstrates the issue:

pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
    Tokenizer::from_file(p).map_err(anyhow::Error::msg)
}

But this fixes it:

pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
    let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display());
    let fixed_path = Path::new(&fixed_path);

    if !fixed_path.exists() {
        let raw = std::fs::read(p.clone()).map_err(anyhow::Error::msg)?;
        let mut tokenizer: Value = serde_json::from_slice(&raw).unwrap();
        let added_tokens: Vec<AddedToken> =
            serde_json::from_value(tokenizer["added_tokens"].clone()).unwrap();
        let vocab: HashMap<String, usize> =
            serde_json::from_value(tokenizer["model"]["vocab"].clone()).unwrap();
        for token in added_tokens {
            if !vocab.contains_key(&token.content) {
                tokenizer["model"]["vocab"]
                    .as_object_mut()
                    .unwrap()
                    .insert(token.content, token.id.into())
                    .ok_or(())
                    .unwrap_err();
            }
        }
        let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap();
        std::fs::write(fixed_path, raw_fixed).unwrap();
    }

    Tokenizer::from_file(fixed_path).map_err(anyhow::Error::msg)
}