Class DocumentSplitters

java.lang.Object
dev.langchain4j.data.document.splitter.DocumentSplitters

public class DocumentSplitters extends Object
  • Constructor Details

    • DocumentSplitters

      public DocumentSplitters()
  • Method Details

    • recursive

      public static DocumentSplitter recursive(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer)
      This is a recommended DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.
      Parameters:
      maxSegmentSizeInTokens - The maximum size of the segment, defined in tokens.
      maxOverlapSizeInTokens - The maximum size of the overlap, defined in tokens. Only full sentences are considered for the overlap.
      tokenizer - The tokenizer that is used to count tokens in the text.
      Returns:
      recursive document splitter
    • recursive

      public static DocumentSplitter recursive(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
      This is a recommended DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.
      Parameters:
      maxSegmentSizeInChars - The maximum size of the segment, defined in characters.
      maxOverlapSizeInChars - The maximum size of the overlap, defined in characters. Only full sentences are considered for the overlap.
      Returns:
      recursive document splitter