dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter

dev.langchain4j.data.document.splitter.DocumentByRegexSplitter

All Implemented Interfaces:: DocumentSplitter

public class DocumentByRegexSplitter extends HierarchicalDocumentSplitter

Splits the provided Document into parts using the provided regex and attempts to fit as many parts as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

The maxSegmentSize can be defined in terms of characters (default) or tokens. For token-based limit, a TokenCountEstimator must be provided.

If multiple parts fit within maxSegmentSize, they are joined together using the provided joinDelimiter.

If a single part is too long and exceeds maxSegmentSize, the subSplitter (which should be provided) is used to split it into sub-parts and place them into multiple segments. Such segments contain only the sub-parts of the split long part.

Each TextSegment inherits all metadata from the Document and includes an "index" metadata key representing its position within the document (starting from 0).

Field Summary

Fields inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
maxOverlapSize, maxSegmentSize, subSplitter, tokenCountEstimator
Constructor Summary

Constructors

Constructor

Description

DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars)

DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter)

DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator)

DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter)
Method Summary

Modifier and Type

Method

Description

protected DocumentSplitter

defaultSubSplitter()

The default sub-splitter to use when a single segment is too long.

String

joinDelimiter()

Delimiter string to use to re-join the parts.

String[]

split(String text)

Splits the provided text into parts.

Methods inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
split

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface dev.langchain4j.data.document.DocumentSplitter
splitAll, splitAll

Constructor Details
- DocumentByRegexSplitter
  
  public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars)
- DocumentByRegexSplitter
  
  public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter)
- DocumentByRegexSplitter
  
  public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator)
- DocumentByRegexSplitter
  
  public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter)
Method Details
- split
  
  public String[] split(String text)
  
  Description copied from class: HierarchicalDocumentSplitter
  
  Splits the provided text into parts. Implementation API.
  
  Specified by:
  
  split in class HierarchicalDocumentSplitter
  
  Parameters:
  
  text - The text to be split.
  
  Returns:
  
  An array of parts.
- joinDelimiter
  
  public String joinDelimiter()
  
  Description copied from class: HierarchicalDocumentSplitter
  
  Delimiter string to use to re-join the parts.
  
  Specified by:
  
  joinDelimiter in class HierarchicalDocumentSplitter
  
  Returns:
  
  The delimiter.
- defaultSubSplitter
  
  protected DocumentSplitter defaultSubSplitter()
  
  Description copied from class: HierarchicalDocumentSplitter
  
  The default sub-splitter to use when a single segment is too long.
  
  Specified by:
  
  defaultSubSplitter in class HierarchicalDocumentSplitter
  
  Returns:
  
  The default sub-splitter.

Class DocumentByRegexSplitter

Field Summary

Fields inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter

Constructor Summary

Method Summary

Methods inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter

Methods inherited from class java.lang.Object

Methods inherited from interface dev.langchain4j.data.document.DocumentSplitter

Constructor Details

DocumentByRegexSplitter

DocumentByRegexSplitter

DocumentByRegexSplitter

DocumentByRegexSplitter

Method Details

split

joinDelimiter

defaultSubSplitter