Class HierarchicalDocumentSplitter
java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
- All Implemented Interfaces:
 DocumentSplitter
- Direct Known Subclasses:
 DocumentByCharacterSplitter, DocumentByLineSplitter, DocumentByParagraphSplitter, DocumentByRegexSplitter, DocumentBySentenceSplitter, DocumentByWordSplitter
Base class for hierarchical document splitters.
Extends DocumentSplitter and provides machinery for sub-splitting documents
when a single segment is too long.
- 
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final intprotected final intprotected final DocumentSplitterprotected final TokenCountEstimator - 
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedHierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter. - 
Method Summary
Modifier and TypeMethodDescriptionprotected abstract DocumentSplitterThe default sub-splitter to use when a single segment is too long.protected abstract StringDelimiter string to use to re-join the parts.Splits a single Document into a list of TextSegment objects.protected abstract String[]Splits the provided text into parts.Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface DocumentSplitter
splitAll, splitAll 
- 
Field Details
- 
maxSegmentSize
protected final int maxSegmentSize - 
maxOverlapSize
protected final int maxOverlapSize - 
tokenCountEstimator
 - 
subSplitter
 
 - 
 - 
Constructor Details
- 
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
 maxSegmentSizeInChars- The maximum size of a segment in characters.maxOverlapSizeInChars- The maximum size of the overlap between segments in characters.
 - 
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
 maxSegmentSizeInChars- The maximum size of a segment in characters.maxOverlapSizeInChars- The maximum size of the overlap between segments in characters.subSplitter- The sub-splitter to use when a single segment is too long.
 - 
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
 maxSegmentSizeInTokens- The maximum size of a segment in tokens.maxOverlapSizeInTokens- The maximum size of the overlap between segments in tokens.tokenCountEstimator- TheTokenCountEstimatorto use to estimate the number of tokens in a text.
 - 
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
 maxSegmentSizeInTokens- The maximum size of a segment in tokens.maxOverlapSizeInTokens- The maximum size of the overlap between segments in tokens.tokenCountEstimator- TheTokenCountEstimatorto use to estimate the number of tokens in a text.subSplitter- The sub-splitter to use when a single segment is too long.
 
 - 
 - 
Method Details
- 
split
 - 
joinDelimiter
Delimiter string to use to re-join the parts.- Returns:
 - The delimiter.
 
 - 
defaultSubSplitter
The default sub-splitter to use when a single segment is too long.- Returns:
 - The default sub-splitter.
 
 - 
split
Description copied from interface:DocumentSplitterSplits a single Document into a list of TextSegment objects. The metadata is typically copied from the document and enriched with segment-specific information, such as position in the document, page number, etc.- Specified by:
 splitin interfaceDocumentSplitter- Parameters:
 document- The Document to be split.- Returns:
 - A list of TextSegment objects derived from the input Document.
 
 
 -