Class DocumentByCharacterSplitter

java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentByCharacterSplitter
All Implemented Interfaces:
DocumentSplitter

public class DocumentByCharacterSplitter extends HierarchicalDocumentSplitter
Splits the provided Document into characters and attempts to fit as many characters as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

The maxSegmentSize can be defined in terms of characters (default) or tokens. For token-based limit, a Tokenizer must be provided.

If multiple characters fit within maxSegmentSize, they are joined together without delimiters.

Each TextSegment inherits all metadata from the Document and includes an "index" metadata key representing its position within the document (starting from 0).

  • Constructor Details

    • DocumentByCharacterSplitter

      public DocumentByCharacterSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
    • DocumentByCharacterSplitter

      public DocumentByCharacterSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter)
    • DocumentByCharacterSplitter

      public DocumentByCharacterSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer)
    • DocumentByCharacterSplitter

      public DocumentByCharacterSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter)
  • Method Details