Class HierarchicalDocumentSplitter

java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
All Implemented Interfaces:
DocumentSplitter
Direct Known Subclasses:
DocumentByCharacterSplitter, DocumentByLineSplitter, DocumentByParagraphSplitter, DocumentByRegexSplitter, DocumentBySentenceSplitter, DocumentByWordSplitter

public abstract class HierarchicalDocumentSplitter extends Object implements DocumentSplitter
Base class for hierarchical document splitters.

Extends DocumentSplitter and provides machinery for sub-splitting documents when a single segment is too long.

  • Field Details

    • maxSegmentSize

      protected final int maxSegmentSize
    • maxOverlapSize

      protected final int maxOverlapSize
    • tokenizer

      protected final Tokenizer tokenizer
    • subSplitter

      protected final DocumentSplitter subSplitter
  • Constructor Details

    • HierarchicalDocumentSplitter

      protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
      Creates a new instance of HierarchicalDocumentSplitter.
      Parameters:
      maxSegmentSizeInChars - The maximum size of a segment in characters.
      maxOverlapSizeInChars - The maximum size of the overlap between segments in characters.
    • HierarchicalDocumentSplitter

      protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter)
      Creates a new instance of HierarchicalDocumentSplitter.
      Parameters:
      maxSegmentSizeInChars - The maximum size of a segment in characters.
      maxOverlapSizeInChars - The maximum size of the overlap between segments in characters.
      subSplitter - The sub-splitter to use when a single segment is too long.
    • HierarchicalDocumentSplitter

      protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer)
      Creates a new instance of HierarchicalDocumentSplitter.
      Parameters:
      maxSegmentSizeInTokens - The maximum size of a segment in tokens.
      maxOverlapSizeInTokens - The maximum size of the overlap between segments in tokens.
      tokenizer - The tokenizer to use to estimate the number of tokens in a text.
    • HierarchicalDocumentSplitter

      protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter)
      Creates a new instance of HierarchicalDocumentSplitter.
      Parameters:
      maxSegmentSizeInTokens - The maximum size of a segment in tokens.
      maxOverlapSizeInTokens - The maximum size of the overlap between segments in tokens.
      tokenizer - The tokenizer to use to estimate the number of tokens in a text.
      subSplitter - The sub-splitter to use when a single segment is too long.
  • Method Details

    • split

      protected abstract String[] split(String text)
      Splits the provided text into parts. Implementation API.
      Parameters:
      text - The text to be split.
      Returns:
      An array of parts.
    • joinDelimiter

      protected abstract String joinDelimiter()
      Delimiter string to use to re-join the parts.
      Returns:
      The delimiter.
    • defaultSubSplitter

      protected abstract DocumentSplitter defaultSubSplitter()
      The default sub-splitter to use when a single segment is too long.
      Returns:
      The default sub-splitter.
    • split

      public List<TextSegment> split(Document document)
      Description copied from interface: DocumentSplitter
      Splits a single Document into a list of TextSegment objects. The metadata is typically copied from the document and enriched with segment-specific information, such as position in the document, page number, etc.
      Specified by:
      split in interface DocumentSplitter
      Parameters:
      document - The Document to be split.
      Returns:
      A list of TextSegment objects derived from the input Document.