tjake / jlama Goto Github PK

View Code? Open in Web Editor NEW

294.0 14.0 26.0 3.51 MB

Jlama is a modern Java inference engine for LLMs

License: Apache License 2.0

Java 54.86% Shell 0.22% Makefile 0.15% C 43.23% JavaScript 0.96% CSS 0.12% HTML 0.34% Dockerfile 0.09% Batchfile 0.05%

ai java llm simd transformers gpt llama llama2 openai huggingface

jlama's Issues

Support M1/2/3 with Rosetta enabled

See comments of #18

CodeLlama loading is broken?

This worked in Oct 15 jlama:

$ ./run-cli.sh complete -p "def fib(" -t 0.2 -tc 24 -n 100 models/CodeLlama-7b-hf

Now it OOMs (note that I have doubled the default Xmx, which was not necessary in Oct)

Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (com.github.tjake.jlama.cli.commands.CompleteCommand@32b260fa): java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at picocli.CommandLine.executeUserObject(CommandLine.java:2035)
	at picocli.CommandLine.access$1500(CommandLine.java:148)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
	at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2264)
	at picocli.CommandLine.parseWithHandlers(CommandLine.java:2664)
	at picocli.CommandLine.parseWithHandler(CommandLine.java:2599)
	at com.github.tjake.jlama.cli.JlamaCli.main(JlamaCli.java:30)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at com.github.tjake.jlama.model.ModelSupport.loadModel(ModelSupport.java:111)
	at com.github.tjake.jlama.model.ModelSupport.loadModel(ModelSupport.java:66)
	at com.github.tjake.jlama.cli.commands.CompleteCommand.run(CompleteCommand.java:16)
	at picocli.CommandLine.executeUserObject(CommandLine.java:2026)
	... 8 more
Caused by: java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:74)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at com.github.tjake.jlama.model.ModelSupport.loadModel(ModelSupport.java:107)
	... 11 more
Caused by: java.lang.OutOfMemoryError
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
	at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfInt.evaluateParallel(ForEachOps.java:189)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.base/java.util.stream.IntPipeline.forEach(IntPipeline.java:463)
	at java.base/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:620)
	at com.github.tjake.jlama.model.llama.LlamaModel.loadTransformerBlockWeights(LlamaModel.java:56)
	at com.github.tjake.jlama.model.AbstractModel.<init>(AbstractModel.java:109)
	at com.github.tjake.jlama.model.llama.LlamaModel.<init>(LlamaModel.java:31)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	... 14 more
Caused by: java.lang.OutOfMemoryError: Cannot reserve 180355136 bytes of direct buffer memory (allocated: 25708094948, limit: 25769803776)
	at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
	at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
	at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:360)
	at com.github.tjake.jlama.util.UnsafeDirectByteBuffer.allocateAlignedByteBuffer(UnsafeDirectByteBuffer.java:36)
	at com.github.tjake.jlama.tensor.FloatBufferTensor.<init>(FloatBufferTensor.java:73)
	at com.github.tjake.jlama.safetensors.Weights.load(Weights.java:112)
	at com.github.tjake.jlama.safetensors.WeightLoader.load(WeightLoader.java:16)
	at com.github.tjake.jlama.safetensors.SafeTensorIndex.load(SafeTensorIndex.java:172)
	at com.github.tjake.jlama.model.llama.LlamaModel.lambda$loadTransformerBlockWeights$1(LlamaModel.java:70)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfInt.accept(ForEachOps.java:205)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:712)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Windows build failures

[ERROR] testSaxpy(com.github.tjake.jlama.tensor.operations.TestOperations)  Time elapsed: 0.051 s  <<< ERROR!
java.lang.ClassCastException: a Vector<class java.lang.Integer>: required Species[int, 16, S_512_BIT] but found Species[int, 8, S_256_BIT]
        at com.github.tjake.jlama.tensor.operations.TestOperations.testSaxpy(TestOperations.java:180)

[ERROR] testSxpby(com.github.tjake.jlama.tensor.operations.TestOperations)  Time elapsed: 0.031 s  <<< ERROR!
java.lang.ClassCastException: a Vector<class java.lang.Integer>: required Species[int, 16, S_512_BIT] but found Species[int, 8, S_256_BIT]
        at com.github.tjake.jlama.tensor.operations.TestOperations.testSxpby(TestOperations.java:214)

[ERROR] testAccumulate(com.github.tjake.jlama.tensor.operations.TestOperations)  Time elapsed: 0.019 s  <<< ERROR!
java.lang.ClassCastException: a Vector<class java.lang.Integer>: required Species[int, 16, S_512_BIT] but found Species[int, 8, S_256_BIT]
        at com.github.tjake.jlama.tensor.operations.TestOperations.testAccumulate(TestOperations.java:118)

[ERROR] testDotProduct(com.github.tjake.jlama.tensor.operations.TestOperations)  Time elapsed: 0.144 s  <<< ERROR!
java.lang.ClassCastException: a Vector<class java.lang.Integer>: required Species[int, 16, S_512_BIT] but found Species[int, 8, S_256_BIT]
        at com.github.tjake.jlama.tensor.operations.TestOperations.testDotProduct(TestOperations.java:85)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   TestOperations.testAccumulate:118 » ClassCast a Vector<class java.lang.Integer...
[ERROR]   TestOperations.testDotProduct:85 » ClassCast a Vector<class java.lang.Integer>...
[ERROR]   TestOperations.testSaxpy:180 » ClassCast a Vector<class java.lang.Integer>: re...
[ERROR]   TestOperations.testSxpby:214 » ClassCast a Vector<class java.lang.Integer>: re...
[INFO]
[ERROR] Tests run: 17, Failures: 0, Errors: 4, Skipped: 6
[```

is that ment to be a cli tool, or can I integrate it with another java application, without using cli?

./download-hf-model.sh wont work on Windows

Writing it in Java itself will solve that. So I did. Happy to contribute it so here you go FWIW

package com.github.tjake.jlama.cli;

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class DownloadModel {
    private static final String HF_ACCESS_TOKEN = System.getenv("HF_ACCESS_TOKEN");
    private static final String MODEL_DIR = "models";

    public static void main(String[] args) throws IOException {
        if (args.length != 1) {
            usage();
            System.exit(1);
        }

        String hfModel = args[0];
        String authHeader = null;
        if (HF_ACCESS_TOKEN != null && !HF_ACCESS_TOKEN.isEmpty()) {
            authHeader = "Authorization: Bearer " + HF_ACCESS_TOKEN;
        }

        InputStream modelInfoStream = getResponse("https://huggingface.co/api/models/" + hfModel, authHeader);
        String modelInfo = readInputStream(modelInfoStream);

        if (modelInfo == null) {
            System.out.println("No valid model found or trying to access a restricted model (use HF_ACCESS_TOKEN env. var.)");
            System.exit(1);
        }

        List<String> allFiles = parseFileList(modelInfo);
        if (allFiles.isEmpty()) {
            System.out.println("No valid model found");
            System.exit(1);
        }

        List<String> tensorFiles = new ArrayList<>();
        for (String currFile : allFiles) {
            if (currFile.contains("safetensor")) {
                tensorFiles.add(currFile);
            }
        }

        if (tensorFiles.isEmpty()) {
            System.out.println("Model is not available in safetensor format");
            System.exit(1);
        }

        allFiles.addAll(Arrays.asList("config.json", "vocab.json", "tokenizer.json"));

        Path modelDir = Paths.get(MODEL_DIR, hfModel);
        try {
            Files.createDirectories(modelDir);
        } catch (IOException e) {
            System.out.println("Error creating directory: " + modelDir);
            System.exit(1);
        }

        for (String currFile : allFiles) {
            System.out.println("Downloading file: " + modelDir.resolve(currFile));
            downloadFile(hfModel, currFile, authHeader, modelDir.resolve(currFile));
        }

        System.out.println("Downloading file: " + modelDir.resolve("tokenizer.model") + " (if it exists)");
        downloadFile(hfModel, "tokenizer.model", authHeader, modelDir.resolve("tokenizer.model"));

        System.out.println("Done! Model downloaded in ./" + MODEL_DIR + "/" + hfModel);
    }

    private static void usage() {
        System.out.println("""
                usage: java DownloadModel [-h] owner/model_name

                This program will download a safetensor files and inference configuration from huggingface.
                To download restricted models set the HF_ACCESS_TOKEN environment variable to a valid HF access token.
                To create a token see https://huggingface.co/settings/tokens

                OPTIONS:
                   -h   Show this message

                EXAMPLES:
                    java DownloadModel gpt2-medium
                    java DownloadModel meta-llama/Llama-2-7b-chat-hf""");
    }

    private static List<String> parseFileList(String modelInfo) {
        List<String> fileList = new ArrayList<>();
        try {
            ObjectMapper objectMapper = new ObjectMapper();
            JsonNode rootNode = objectMapper.readTree(modelInfo);
            JsonNode siblingsNode = rootNode.path("siblings");
            if (siblingsNode.isArray()) {
                for (JsonNode siblingNode : siblingsNode) {
                    String rFilename = siblingNode.path("rfilename").asText();
                    fileList.add(rFilename);
                }
            }
        } catch (IOException e) {
            System.out.println("Error parsing JSON: " + e.getMessage());
        }
        return fileList;
    }

    public static InputStream getResponse(String urlString, String authHeader) {
        try {
            URL url = new URL(urlString);
            HttpURLConnection connection = (HttpURLConnection) url.openConnection();

            // Set the request method
            connection.setRequestMethod("GET");

            // Set the request header
            if (authHeader != null)
                connection.setRequestProperty("Authorization", authHeader);

            // Get the response code
            int responseCode = connection.getResponseCode();

            if (responseCode == HttpURLConnection.HTTP_OK) {
                // If the response code is 200 (HTTP_OK), return the input stream
                return connection.getInputStream();
            } else {
                // If the response code is not 200, throw an IOException
                throw new IOException("HTTP response code: " + responseCode);
            }
        }
        catch (IOException ioe)
        {
            System.out.println("WARNING: Fetch of URL " + urlString + " failed due to " + ioe);
            return null;
        }
    }

    public static String readInputStream(InputStream inStream) throws IOException {
        if (inStream == null) return null;

        BufferedReader inReader = new BufferedReader(new InputStreamReader(inStream));
        StringBuilder stringBuilder = new StringBuilder();

        String currLine;
        while ((currLine = inReader.readLine()) != null) {
            stringBuilder.append(currLine);
            stringBuilder.append(System.lineSeparator());
        }

        return stringBuilder.toString();
    }
    private static void downloadFile(String hfModel, String currFile, String authHeader, Path outputPath) throws IOException {
        InputStream inStream = getResponse("https://huggingface.co/" + hfModel + "/resolve/main/" + currFile, authHeader);
        if (inStream == null)
            throw new IOException("WARNING: Fetch of file " + currFile + " failed.");
        Files.copy(inStream, outputPath, StandardCopyOption.REPLACE_EXISTING);
    }
}

streaming server support?

Is there a way to run and expose an API streaming server compatible with OpenAI API specifications?

Multimodal support?

Hi, I was trying this model here:
https://huggingface.co/MoMonir/llava-llama-3-8b-v1_1-GGUF

It also comes with some instructions on how to use it for images. Is this also possible somehow with Jlama, e.g. via setImage in inference params?

Feature request: support for the smallest reasonable codegen model

I want to build a local Copilot with JLama but generalist models are too big and slow.

Three candidates I found:
replit-code-v1_5-3b:

Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (com.github.tjake.jlama.cli.commands.CompleteCommand@32b260fa): java.lang.IllegalArgumentException: No enum constant com.github.tjake.jlama.model.ModelSupport.ModelType.MPT

codegen-2B-multi:

Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (com.github.tjake.jlama.cli.commands.CompleteCommand@32b260fa): java.lang.IllegalArgumentException: No enum constant com.github.tjake.jlama.model.ModelSupport.ModelType.CODEGEN

WizardCoder-1B-V1.0 (using the safetensors branch):

Exception in thread "main" picocli.CommandLine$ExecutionException: Error while running command (com.github.tjake.jlama.cli.commands.CompleteCommand@693fe6c9): java.lang.IllegalArgumentException: No enum constant com.github.tjake.jlama.model.ModelSupport.ModelType.GPT_BIGCODE

File model.safetensors.index.json not found

I downloaded the model directly from meta's repo not hugging face but the code is looking for a file called

model.safetensors.index.json when opening with loadWithWeights

But I do not have this file? Where is this coming from? There is a file called params.json {"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": -1}

is that the same?

Adding FineTune Command

Would you consider adding fine tune support to the list of commands.

tjake / jlama Goto Github PK

jlama's Issues

Support M1/2/3 with Rosetta enabled

CodeLlama loading is broken?

Windows build failures

is that ment to be a cli tool, or can I integrate it with another java application, without using cli?

./download-hf-model.sh wont work on Windows

streaming server support?

Multimodal support?

Feature request: support for the smallest reasonable codegen model

File model.safetensors.index.json not found

Adding FineTune Command

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent