AI Services

So far, we have been covering low-level components like ChatModel, ChatMessage, ChatMemory, etc. Working at this level is very flexible and gives you total freedom, but it also forces you to write a lot of boilerplate code. Since LLM-powered applications usually require not just a single component but multiple components working together (e.g., prompt templates, chat memory, LLMs, output parsers, RAG components: embedding models and stores) and often involve multiple interactions, orchestrating them all becomes even more cumbersome.

We want you to focus on business logic, not on low-level implementation details. Thus, there are currently two high-level concepts in LangChain4j that can help with that: AI Services and Chains.

Chains (legacy)

The concept of Chains originates from Python's LangChain (before the introduction of LCEL). The idea is to have a Chain for each common use case, like a chatbot, RAG, etc. Chains combine multiple low-level components and orchestrate interactions between them. The main problem with them is that they are too rigid if you need to customize something. LangChain4j has only two Chains implemented (ConversationalChain and ConversationalRetrievalChain), and we do not plan to add more at this moment.

AI Services

We propose another solution called AI Services, tailored for Java. The idea is to hide the complexities of interacting with LLMs and other components behind a simple API.

This approach is very similar to Spring Data JPA or Retrofit: you declaratively define an interface with the desired API, and LangChain4j provides an object (proxy) that implements this interface. You can think of AI Service as a component of the service layer in your application. It provides AI services. Hence the name.

AI Services handle the most common operations:

Formatting inputs for the LLM
Parsing outputs from the LLM

They also support more advanced features:

Chat memory
Tools
RAG

AI Services can be used to build stateful chatbots that facilitate back-and-forth interactions, as well as to automate processes where each call to the LLM is isolated.

Let's take a look at the simplest possible AI Service. After that, we will explore more complex examples.

Simplest AI Service

First, we define an interface with a single method, chat, which takes a String as input and returns a String.

interface Assistant {

    String chat(String userMessage);
}

Then, we create our low-level components. These components will be used under the hood of our AI Service. In this case, we just need the ChatModel:

ChatModel model = OpenAiChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName(GPT_4_O_MINI)
    .build();

Finally, we can use the AiServices class to create an instance of our AI Service:

Assistant assistant = AiServices.create(Assistant.class, model);

note

In Quarkus and Spring Boot applications, the autoconfiguration handles creating an Assistant bean. This means you do not need to call AiServices.create(...), you can simply inject/autowire the Assistant wherever it is needed.

Now we can use Assistant:

String answer = assistant.chat("Hello");
System.out.println(answer); // Hello, how can I help you?

How does it work?

You provide the Class of your interface to AiServices along with the low-level components, and AiServices creates a proxy object implementing this interface. Currently, it uses reflection, but we are considering alternatives as well. This proxy object handles all the conversions for inputs and outputs. In this case, the input is a single String, but we are using a ChatModel which takes ChatMessage as input. So, AiService will automatically convert it into a UserMessage and invoke ChatModel. Since the output type of the chat method is a String, after ChatModel returns AiMessage, it will be converted into a String before being returned from the chat method.

AI Services in Quarkus Application

LangChain4j Quarkus extension greatly simplifies using AI Services in Quarkus applications.

More information can be found here.

AI Services in Spring Boot Application

LangChain4j Spring Boot starter greatly simplifies using AI Services in Spring Boot applications.

@SystemMessage

Now, let's look at a more complicated example. We'll force the LLM reply using slang 😉

This is usually achieved by providing instructions in the SystemMessage.

interface Friend {

    @SystemMessage("You are a good friend of mine. Answer using slang.")
    String chat(String userMessage);
}

Friend friend = AiServices.create(Friend.class, model);

String answer = friend.chat("Hello"); // Hey! What's up?

In this example, we have added the @SystemMessage annotation with a system prompt template we want to use. This will be converted into a SystemMessage behind the scenes and sent to the LLM along with the UserMessage.

@SystemMessage can also load a prompt template from resources: @SystemMessage(fromResource = "my-prompt-template.txt")

System Message Provider

System messages can also be defined dynamically with the system message provider:

Friend friend = AiServices.builder(Friend.class)
    .chatModel(model)
    .systemMessageProvider(chatMemoryId -> "You are a good friend of mine. Answer using slang.")
    .build();

As you can see, you can provide different system messages based on a chat memory ID (user or conversation).

@UserMessage

Now, let's assume the model we use does not support system messages, or maybe we just want to use UserMessage for that purpose.

interface Friend {

    @UserMessage("You are a good friend of mine. Answer using slang. {{it}}")
    String chat(String userMessage);
}

Friend friend = AiServices.create(Friend.class, model);

String answer = friend.chat("Hello"); // Hey! What's shakin'?

We have replaced the @SystemMessage annotation with @UserMessage and specified a prompt template containing the variable it that refers to the only method argument.

It's also possible to annotate the String userMessage with @V and assign a custom name to the prompt template variable:

interface Friend {

    @UserMessage("You are a good friend of mine. Answer using slang. {{message}}")
    String chat(@V("message") String userMessage);
}

note

Please note that using @V is not necessary when using LangChain4j with Quarkus or Spring Boot. This annotation is necessary only when the -parameters option is not enabled during Java compilation.

@UserMessage can also load a prompt template from resources: @UserMessage(fromResource = "my-prompt-template.txt")

Examples of valid AI Service methods

Below are some examples of valid AI service methods.

UserMessage

String chat(String userMessage);

String chat(@UserMessage String userMessage);

String chat(@UserMessage String userMessage, @V("country") String country); // userMessage contains "{{country}}" template variable

@UserMessage("What is the capital of Germany?")
String chat();

@UserMessage("What is the capital of {{it}}?")
String chat(String country);

@UserMessage("What is the capital of {{country}}?")
String chat(@V("country") String country);

@UserMessage("What is the {{something}} of {{country}}?")
String chat(@V("something") String something, @V("country") String country);

@UserMessage("What is the capital of {{country}}?")
String chat(String country); // this works only in Quarkus and Spring Boot applications

SystemMessage and UserMessage

@SystemMessage("Given a name of a country, answer with a name of it's capital")
String chat(String userMessage);

@SystemMessage("Given a name of a country, answer with a name of it's capital")
String chat(@UserMessage String userMessage);

@SystemMessage("Given a name of a country, {{answerInstructions}}")
String chat(@V("answerInstructions") String answerInstructions, @UserMessage String userMessage);

@SystemMessage("Given a name of a country, answer with a name of it's capital")
String chat(@UserMessage String userMessage, @V("country") String country); // userMessage contains "{{country}}" template variable

@SystemMessage("Given a name of a country, {{answerInstructions}}")
String chat(@V("answerInstructions") String answerInstructions, @UserMessage String userMessage, @V("country") String country); // userMessage contains "{{country}}" template variable

@SystemMessage("Given a name of a country, answer with a name of it's capital")
@UserMessage("Germany")
String chat();

@SystemMessage("Given a name of a country, {{answerInstructions}}")
@UserMessage("Germany")
String chat(@V("answerInstructions") String answerInstructions);

@SystemMessage("Given a name of a country, answer with a name of it's capital")
@UserMessage("{{it}}")
String chat(String country);

@SystemMessage("Given a name of a country, answer with a name of it's capital")
@UserMessage("{{country}}")
String chat(@V("country") String country);

@SystemMessage("Given a name of a country, {{answerInstructions}}")
@UserMessage("{{country}}")
String chat(@V("answerInstructions") String answerInstructions, @V("country") String country);

Multimodality

AI services currently do not support multimodality, please use the low-level API for this.

Return Types

AI Service method can return one of the following types:

String - in this case LLM-generated output is returned without any processing/parsing
Any type supported by Structured Outputs - in this case AI service will parse LLM-generated output into a desired type before returning

Any type can be additionally wrapped into a Result<T> to get extra metadata about AI Service invocation:

TokenUsage - total number of tokens used during AI service invocation. If AI service did multiple calls to the LLM (e.g., because tools were executed), it will summ token usages of all calls.
Sources - Contents retrieved during RAG retrieval
Executed tools
FinishReason

An example:

interface Assistant {
    
    @UserMessage("Generate an outline for the article on the following topic: {{it}}")
    Result<List<String>> generateOutlineFor(String topic);
}

Result<List<String>> result = assistant.generateOutlineFor("Java");

List<String> outline = result.content();
TokenUsage tokenUsage = result.tokenUsage();
List<Content> sources = result.sources();
List<ToolExecution> toolExecutions = result.toolExecutions();
FinishReason finishReason = result.finishReason();

Structured Outputs

If you want to receive a structured output (e.g., a complex Java object, as opposed to an unstructured text inside the String) from the LLM, you can change the return type of your AI Service method from String to some other type.

note

More info on Structured Outputs can be found here.

A few examples:

`boolean` as return type

interface SentimentAnalyzer {

    @UserMessage("Does {{it}} has a positive sentiment?")
    boolean isPositive(String text);

}

SentimentAnalyzer sentimentAnalyzer = AiServices.create(SentimentAnalyzer.class, model);

boolean positive = sentimentAnalyzer.isPositive("It's wonderful!");
// true

`Enum` as return type

enum Priority {
    CRITICAL, HIGH, LOW
}

interface PriorityAnalyzer {
    
    @UserMessage("Analyze the priority of the following issue: {{it}}")
    Priority analyzePriority(String issueDescription);
}

PriorityAnalyzer priorityAnalyzer = AiServices.create(PriorityAnalyzer.class, model);

Priority priority = priorityAnalyzer.analyzePriority("The main payment gateway is down, and customers cannot process transactions.");
// CRITICAL

POJO as a return type

class Person {

    @Description("first name of a person") // you can add an optional description to help an LLM have a better understanding
    String firstName;
    String lastName;
    LocalDate birthDate;
    Address address;
}

@Description("an address") // you can add an optional description to help an LLM have a better understanding
class Address {
    String street;
    Integer streetNumber;
    String city;
}

interface PersonExtractor {

    @UserMessage("Extract information about a person from {{it}}")
    Person extractPersonFrom(String text);
}

PersonExtractor personExtractor = AiServices.create(PersonExtractor.class, model);

String text = """
            In 1968, amidst the fading echoes of Independence Day,
            a child named John arrived under the calm evening sky.
            This newborn, bearing the surname Doe, marked the start of a new journey.
            He was welcomed into the world at 345 Whispering Pines Avenue
            a quaint street nestled in the heart of Springfield
            an abode that echoed with the gentle hum of suburban dreams and aspirations.
            """;

Person person = personExtractor.extractPersonFrom(text);

System.out.println(person); // Person { firstName = "John", lastName = "Doe", birthDate = 1968-07-04, address = Address { ... } }

JSON mode

When extracting custom POJOs (actually JSON, which is then parsed into the POJO), it is recommended to enable a "JSON mode" in the model configuration. This way, the LLM will be forced to respond with a valid JSON.

note

Please note that JSON mode and tools/function calling are similar features but have different APIs and are used for distinct purposes.

JSON mode is useful when you always need a response from the LLM in a structured format (valid JSON). Additionally, there is usually no state/memory required, so each interaction with the LLM is independent of others. For instance, you might want to extract information from a text, such as the list of people mentioned in this text or convert a free-form product review into a structured form with fields like String productName, Sentiment sentiment, List<String> claimedProblems, etc.

On the other hand, tools/functions are useful when LLM should be able to perform some action(s) (e.g. lookup the database, search the web, cancel user's booking, etc.) In this case, a list of tools with their expected JSON schemas is provided to the LLM, and it autonomously decides whether to call any of them to satisfy user request.

Earlier, function calling was often used for structured data extraction, but now we have the JSON mode feature, which is more suitable for this purpose.

Here is how to enable JSON mode:

For OpenAI:

For newer models that support Structured Outputs (e.g., gpt-4o-mini, gpt-4o-2024-08-06):

OpenAiChatModel.builder()
    ...
    .supportedCapabilities(RESPONSE_FORMAT_JSON_SCHEMA)
    .strictJsonSchema(true)
    .build();

See more details here.

For older models (e.g. gpt-3.5-turbo, gpt-4):

OpenAiChatModel.builder()
    ...
    .responseFormat("json_object")
    .build();

For Azure OpenAI:

AzureOpenAiChatModel.builder()
    ...
    .responseFormat(new ChatCompletionsJsonResponseFormat())
    .build();

For Vertex AI Gemini:

VertexAiGeminiChatModel.builder()
    ...
    .responseMimeType("application/json")
    .build();

Or by specifying an explicit schema from a Java class:

VertexAiGeminiChatModel.builder()
    ...
    .responseSchema(SchemaHelper.fromClass(Person.class))
    .build();

From a JSON schema:

VertexAiGeminiChatModel.builder()
    ...
    .responseSchema(Schema.builder()...build())
    .build();

For Google AI Gemini:

GoogleAiGeminiChatModel.builder()
    ...
    .responseFormat(ResponseFormat.JSON)
    .build();

Or by specifying an explicit schema from a Java class:

GoogleAiGeminiChatModel.builder()
    ...
    .responseFormat(ResponseFormat.builder()
        .type(JSON)
        .jsonSchema(JsonSchemas.jsonSchemaFrom(Person.class).get())
        .build())
    .build();

From a JSON schema:

GoogleAiGeminiChatModel.builder()
    ...
    .responseFormat(ResponseFormat.builder()
        .type(JSON)
        .jsonSchema(JsonSchema.builder()...build())
        .build())
    .build();

For Mistral AI:

MistralAiChatModel.builder()
    ...
    .responseFormat(MistralAiResponseFormatType.JSON_OBJECT)
    .build();

For Ollama:

OllamaChatModel.builder()
    ...
    .responseFormat(JSON)
    .build();

For other model providers: if the underlying model provider does not support JSON mode, prompt engineering is your best bet. Also, try lowering the temperature for more determinism.

More examples

Streaming

The AI Service can stream response token-by-token when using the TokenStream return type:

interface Assistant {

    TokenStream chat(String message);
}

StreamingChatModel model = OpenAiStreamingChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName(GPT_4_O_MINI)
    .build();

Assistant assistant = AiServices.create(Assistant.class, model);

TokenStream tokenStream = assistant.chat("Tell me a joke");

tokenStream.onPartialResponse((String partialResponse) -> System.out.println(partialResponse))
    .onRetrieved((List<Content> contents) -> System.out.println(contents))
    .onToolExecuted((ToolExecution toolExecution) -> System.out.println(toolExecution))
    .onCompleteResponse((ChatResponse response) -> System.out.println(response))
    .onError((Throwable error) -> error.printStackTrace())
    .start();

Flux

You can also use Flux<String> instead of TokenStream. For this, please import langchain4j-reactor module:

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-reactor</artifactId>
    <version>1.1.0-beta7</version>
</dependency>

interface Assistant {

  Flux<String> chat(String message);
}

Streaming example

Chat Memory

The AI Service can use chat memory in order to "remember" previous interactions:

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(model)
    .chatMemory(MessageWindowChatMemory.withMaxMessages(10))
    .build();

In this scenario, the same ChatMemory instance will be used for all invocations of the AI Service. However, this approach will not work if you have multiple users, as each user would require their own instance of ChatMemory to maintain their individual conversation.

The solution to this issue is to use ChatMemoryProvider:

interface Assistant  {
    String chat(@MemoryId int memoryId, @UserMessage String message);
}

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(model)
    .chatMemoryProvider(memoryId -> MessageWindowChatMemory.withMaxMessages(10))
    .build();

String answerToKlaus = assistant.chat(1, "Hello, my name is Klaus");
String answerToFrancine = assistant.chat(2, "Hello, my name is Francine");

In this scenario, two distinct instances of ChatMemory will be provided by ChatMemoryProvider, one for each memory ID.

When using ChatMemory in this way it's also important to evict the memory of a no longer needed conversations in order to avoid memory leaks. To make the chat memories internally used by an AI service accessible it's enough that the interface defining it extends the ChatMemoryAccess one.

interface Assistant extends ChatMemoryAccess {
    String chat(@MemoryId int memoryId, @UserMessage String message);
}

This makes it possible to both access the ChatMemory instance of a single conversation and to get rid of it when the conversation is terminated.

String answerToKlaus = assistant.chat(1, "Hello, my name is Klaus");
String answerToFrancine = assistant.chat(2, "Hello, my name is Francine");

List<ChatMessage> messagesWithKlaus = assistant.getChatMemory(1).messages();
boolean chatMemoryWithFrancineEvicted = assistant.evictChatMemory(2);

note

Please note that if an AI Service method does not have a parameter annotated with @MemoryId, the value of memoryId in ChatMemoryProvider will default to a string "default".

note

Please note that AI Service should not be called concurrently for the same @MemoryId, as it can lead to corrupted ChatMemory. Currently, AI Service does not implement any mechanism to prevent concurrent calls for the same @MemoryId.

Tools (Function Calling)

AI Service can be configured with tools that LLM can use:

class Tools {
    
    @Tool
    int add(int a, int b) {
        return a + b;
    }

    @Tool
    int multiply(int a, int b) {
        return a * b;
    }
}

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(model)
    .tools(new Tools())
    .build();

String answer = assistant.chat("What is 1+2 and 3*4?");

In this scenario, the LLM will request to execute the add(1, 2) and multiply(3, 4) methods before providing a final answer. LangChain4j will execute these methods automatically.

More details about tools can be found here.

RAG

AI Service can be configured with a ContentRetriever in order to enable naive RAG:

EmbeddingStore embeddingStore  = ...
EmbeddingModel embeddingModel = ...

ContentRetriever contentRetriever = new EmbeddingStoreContentRetriever(embeddingStore, embeddingModel);

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(model)
    .contentRetriever(contentRetriever)
    .build();

Configuring a RetrievalAugmentor provides even more flexibility, enabling advanced RAG capabilities such as query transformation, re-ranking, etc:

RetrievalAugmentor retrievalAugmentor = DefaultRetrievalAugmentor.builder()
        .queryTransformer(...)
        .queryRouter(...)
        .contentAggregator(...)
        .contentInjector(...)
        .executor(...)
        .build();

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(model)
    .retrievalAugmentor(retrievalAugmentor)
    .build();

More details about RAG can be found here.

More RAG examples can be found here.

Auto-Moderation

Example

Chaining multiple AI Services

The more complex the logic of your LLM-powered application becomes, the more crucial it is to break it down into smaller parts, as is common practice in software development.

For instance, stuffing lots of instructions into the system prompt to account for all possible scenarios is prone to errors and inefficiency. If there are too many instructions, LLMs may overlook some. Additionally, the sequence in which instructions are presented matters, making the process even more challenging.

This principle also applies to tools, RAG, and model parameters such as temperature, maxTokens, etc.

Your chatbot likely doesn't need to be aware of every tool you have at all times. For example, when a user simply greets the chatbot or says goodbye, it is costly and sometimes even dangerous to give the LLM access to the dozens or hundreds of tools (each tool included in the LLM call consumes a significant number of tokens) and might lead to unintended results (LLMs can hallucinate or be manipulated to invoke a tool with unintended inputs).

Regarding RAG: similarly, there are times when it's necessary to provide some context to the LLM, but not always, as it incurs additional costs (more context = more tokens) and increases response times (more context = higher latency).

Concerning model parameters: in certain situations, you may need LLM to be highly deterministic, so you would set a low temperature. In other cases, you might opt for a higher temperature, and so on.

The point is, smaller and more specific components are easier and cheaper to develop, test, maintain, and understand.

Another aspect to consider involves two extremes:

Do you prefer your application to be highly deterministic, where the application controls the flow and the LLM is just one of the components?
Or do you want the LLM to have complete autonomy and drive your application?

Or perhaps a mix of both, depending on the situation? All these options are possible when you decompose your application into smaller, more manageable parts.

AI Services can be used as and combined with regular (deterministic) software components:

You can call one AI Service after another (aka chaining).
You can use deterministic and LLM-powered if/else statements (AI Services can return a boolean).
You can use deterministic and LLM-powered switch statements (AI Services can return an enum).
You can use deterministic and LLM-powered for/while loops (AI Services can return int and other numerical types).
You can mock an AI Service (since it is an interface) in your unit tests.
You can integration test each AI Service in isolation.
You can evaluate and find the optimal parameters for each AI Service separately.
etc

Let's consider a simple example. I want to build a chatbot for my company. If a user greets the chatbot, I want it to respond with the pre-defined greeting without relying on an LLM to generate the greeting. If a user asks a question, I want the LLM to generate the response using internal knowledge base of the company (aka RAG).

Here is how this task can be decomposed into 2 separate AI Services:

interface GreetingExpert {

    @UserMessage("Is the following text a greeting? Text: {{it}}")
    boolean isGreeting(String text);
}

interface ChatBot {

    @SystemMessage("You are a polite chatbot of a company called Miles of Smiles.")
    String reply(String userMessage);
}

class MilesOfSmiles {

    private final GreetingExpert greetingExpert;
    private final ChatBot chatBot;
    
    ...
    
    public String handle(String userMessage) {
        if (greetingExpert.isGreeting(userMessage)) {
            return "Greetings from Miles of Smiles! How can I make your day better?";
        } else {
            return chatBot.reply(userMessage);
        }
    }
}

GreetingExpert greetingExpert = AiServices.create(GreetingExpert.class, llama2);

ChatBot chatBot = AiServices.builder(ChatBot.class)
    .chatModel(gpt4)
    .contentRetriever(milesOfSmilesContentRetriever)
    .build();

MilesOfSmiles milesOfSmiles = new MilesOfSmiles(greetingExpert, chatBot);

String greeting = milesOfSmiles.handle("Hello");
System.out.println(greeting); // Greetings from Miles of Smiles! How can I make your day better?

String answer = milesOfSmiles.handle("Which services do you provide?");
System.out.println(answer); // At Miles of Smiles, we provide a wide range of services ...

Notice how we used the cheaper Llama2 for the simple task of identifying whether the text is a greeting or not, and the more expensive GPT-4 with a content retriever (RAG) for a more complex task.

This is a very simple and somewhat naive example, but hopefully, it demonstrates the idea.

Now, I can mock both GreetingExpert and ChatBot and test MilesOfSmiles in isolation Also, I can integration test GreetingExpert and ChatBot separately. I can evaluate both of them separately and find the most optimal parameters for each subtask, or, in the long run, even fine-tune a small specialized model for each specific subtask.

Testing

An example of integration testing for a Customer Support Agent

LangChain4j AiServices Tutorial by Siva

Chains (legacy)​

AI Services​

Simplest AI Service​

How does it work?​

AI Services in Quarkus Application​

AI Services in Spring Boot Application​

@SystemMessage​

System Message Provider​

@UserMessage​

Examples of valid AI Service methods​

Multimodality​

Return Types​

Structured Outputs​

boolean as return type​

Enum as return type​

POJO as a return type​

JSON mode​

Streaming​

Flux​

Chat Memory​

Tools (Function Calling)​

RAG​

Auto-Moderation​

Chaining multiple AI Services​

Testing​

Related Tutorials​

Chains (legacy)

AI Services

Simplest AI Service

How does it work?

AI Services in Quarkus Application

AI Services in Spring Boot Application

@SystemMessage

System Message Provider

@UserMessage

Examples of valid AI Service methods

Multimodality

Return Types

Structured Outputs

`boolean` as return type

`Enum` as return type

POJO as a return type

JSON mode

Streaming

Flux

Chat Memory

Tools (Function Calling)

RAG

Auto-Moderation

Chaining multiple AI Services

Testing

Related Tutorials