Jlama

Project setup

To install langchain4j to your project, add the following dependency:

For Maven project pom.xml

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j</artifactId>
    <version>1.3.0</version>
</dependency>

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-jlama</artifactId>
    <version>1.3.0-beta9</version>
</dependency>

<dependency>
    <groupId>com.github.tjake</groupId>
    <artifactId>jlama-native</artifactId>
    <!-- for faster inference. supports linux-x86_64, macos-x86_64/aarch_64, windows-x86_64 
       Use https://github.com/trustin/os-maven-plugin to detect os and arch -->
    <classifier>${os.detected.name}-${os.detected.arch}</classifier>
    <version>${jlama.version}</version> <!-- Version from langchain4j-jlama pom -->
</dependency>

For Gradle project build.gradle

implementation 'dev.langchain4j:langchain4j:1.3.0'
implementation 'dev.langchain4j:langchain4j-jlama:1.3.0-beta9'

Model Selection

You can use most safetensor models on HuggingFace and specify them using the owner/model-name format. Jlama maintains a list of pre-quantized popular models under http://huggingface.co/tjake

Models that use the following architecture are supported:

Gemma Models
Llama Models
Mistral Models
Mixtral Models
GPT-2 Models
BERT Models

Chat Completion

The chat models allow you to generate human-like responses with a model fined-tuned on conversational data.

Synchronous

Create a class and add the following code.

import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.jlama.JlamaChatModel;

public class HelloWorld {
    public static void main(String[] args) {
        ChatModel model = JlamaChatModel.builder()
                .modelName("tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4")
                .build();

        String response = model.chat("Say 'Hello World'");
        System.out.println(response);
    }
}

Running the program will generate a variant of the following output

Hello World! How can I assist you today?

Streaming

Create a class and add the following code.

import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.jlama.JlamaStreamingChatModel;
import dev.langchain4j.model.output.Response;

import java.util.concurrent.CompletableFuture;

public class HelloWorld {
    public static void main(String[] args) {
        StreamingChatModel model = JlamaStreamingChatModel.builder()
                .modelName("tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4")
                .build();

        CompletableFuture<ChatResponse> futureResponse = new CompletableFuture<>();         
        model.chat("Tell me a joke about Java", new StreamingChatResponseHandler() {

            @Override
            public void onPartialResponse(String partialResponse) {
                System.out.print(partialResponse);
            }

            @Override
            public void onCompleteResponse(ChatResponse completeResponse) {
                futureResponse.complete(completeResponse);
            }

            @Override
            public void onError(Throwable error) {
                futureResponse.completeExceptionally(error);
            }    
        });

        futureResponse.join();
    }
}

You will receive each chunk of text (token) as it is generated by the LLM on the onPartialResponse method.

You can see that output below is streamed in real-time.

"Why do Java developers wear glasses? Because they can't C#"

Of course, you can combine Jlama chat completion with other features like Set Model Parameters and Chat Memory to get more accurate responses.

In Chat Memory you will learn how to pass along your chat history, so the LLM knows what has been said before. If you don't pass the chat history, like in this simple example, the LLM will not know what has been said before, so it won't be able to correctly answer the second question ('What did I just ask?').

A lot of parameters are set behind the scenes, such as timeout, model type and model parameters. In Set Model Parameters you will learn how to set these parameters explicitly.

Jlama has some special model parameters that you can set

modelCachePath parameter, which allows you to specify a path to a directory where the model will be cached once downloaded. Default is ~/.jlama.
workingDirectory parameter, which allows you to keep a persistent ChatMemory on disk for a given model instance. This is faster than using Chat Memory.
quantizeModelAtRuntime parameter, which will quantize the model at runtime. The current quantization is always Q4. You can also pre-quantize the model using jlama project tools (See Jlama Project for more information).

Function Calling

Jlama supports function calling for models that support it (Mistral, Llama-3.1, etc). See Jlama Examples

JSON mode

Jlama does not support JSON mode (yet). But you can always ask the model nicely to return JSON.

Examples

Jlama Examples

Project setup​

Model Selection​

Chat Completion​

Synchronous​

Streaming​

Function Calling​

JSON mode​

Examples​