Class HierarchicalDocumentSplitter
java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
- All Implemented Interfaces:
DocumentSplitter
- Direct Known Subclasses:
DocumentByCharacterSplitter, DocumentByLineSplitter, DocumentByParagraphSplitter, DocumentByRegexSplitter, DocumentBySentenceSplitter, DocumentByWordSplitter
Base class for hierarchical document splitters.
Extends DocumentSplitter and provides machinery for sub-splitting documents
when a single segment is too long.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final intprotected final intprotected final DocumentSplitterprotected final TokenCountEstimator -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedHierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator) Creates a new instance ofHierarchicalDocumentSplitter.protectedHierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter. -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract DocumentSplitterThe default sub-splitter to use when a single segment is too long.protected abstract StringDelimiter string to use to re-join the parts.Splits a single Document into a list of TextSegment objects.protected abstract String[]Splits the provided text into parts.Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface DocumentSplitter
splitAll, splitAll
-
Field Details
-
maxSegmentSize
protected final int maxSegmentSize -
maxOverlapSize
protected final int maxOverlapSize -
tokenCountEstimator
-
subSplitter
-
-
Constructor Details
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInChars- The maximum size of a segment in characters.maxOverlapSizeInChars- The maximum size of the overlap between segments in characters.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInChars- The maximum size of a segment in characters.maxOverlapSizeInChars- The maximum size of the overlap between segments in characters.subSplitter- The sub-splitter to use when a single segment is too long.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInTokens- The maximum size of a segment in tokens.maxOverlapSizeInTokens- The maximum size of the overlap between segments in tokens.tokenCountEstimator- TheTokenCountEstimatorto use to estimate the number of tokens in a text.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter.- Parameters:
maxSegmentSizeInTokens- The maximum size of a segment in tokens.maxOverlapSizeInTokens- The maximum size of the overlap between segments in tokens.tokenCountEstimator- TheTokenCountEstimatorto use to estimate the number of tokens in a text.subSplitter- The sub-splitter to use when a single segment is too long.
-
-
Method Details
-
split
-
joinDelimiter
Delimiter string to use to re-join the parts.- Returns:
- The delimiter.
-
defaultSubSplitter
The default sub-splitter to use when a single segment is too long.- Returns:
- The default sub-splitter.
-
split
Description copied from interface:DocumentSplitterSplits a single Document into a list of TextSegment objects. The metadata is typically copied from the document and enriched with segment-specific information, such as position in the document, page number, etc.- Specified by:
splitin interfaceDocumentSplitter- Parameters:
document- The Document to be split.- Returns:
- A list of TextSegment objects derived from the input Document.
-