Class HierarchicalDocumentSplitter
java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
- All Implemented Interfaces:
DocumentSplitter
- Direct Known Subclasses:
DocumentByCharacterSplitter
,DocumentByLineSplitter
,DocumentByParagraphSplitter
,DocumentByRegexSplitter
,DocumentBySentenceSplitter
,DocumentByWordSplitter
Base class for hierarchical document splitters.
Extends DocumentSplitter
and provides machinery for sub-splitting documents
when a single segment is too long.
-
Field Summary
Modifier and TypeFieldDescriptionprotected final int
protected final int
protected final DocumentSplitter
protected final Tokenizer
-
Constructor Summary
ModifierConstructorDescriptionprotected
HierarchicalDocumentSplitter
(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter
.protected
HierarchicalDocumentSplitter
(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter
.protected
HierarchicalDocumentSplitter
(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer) Creates a new instance ofHierarchicalDocumentSplitter
.protected
HierarchicalDocumentSplitter
(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter
. -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract DocumentSplitter
The default sub-splitter to use when a single segment is too long.protected abstract String
Delimiter string to use to re-join the parts.Splits a single Document into a list of TextSegment objects.protected abstract String[]
Splits the provided text into parts.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface dev.langchain4j.data.document.DocumentSplitter
splitAll
-
Field Details
-
maxSegmentSize
protected final int maxSegmentSize -
maxOverlapSize
protected final int maxOverlapSize -
tokenizer
-
subSplitter
-
-
Constructor Details
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) Creates a new instance ofHierarchicalDocumentSplitter
.- Parameters:
maxSegmentSizeInChars
- The maximum size of a segment in characters.maxOverlapSizeInChars
- The maximum size of the overlap between segments in characters.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, HierarchicalDocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter
.- Parameters:
maxSegmentSizeInChars
- The maximum size of a segment in characters.maxOverlapSizeInChars
- The maximum size of the overlap between segments in characters.subSplitter
- The sub-splitter to use when a single segment is too long.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer) Creates a new instance ofHierarchicalDocumentSplitter
.- Parameters:
maxSegmentSizeInTokens
- The maximum size of a segment in tokens.maxOverlapSizeInTokens
- The maximum size of the overlap between segments in tokens.tokenizer
- The tokenizer to use to estimate the number of tokens in a text.
-
HierarchicalDocumentSplitter
protected HierarchicalDocumentSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter) Creates a new instance ofHierarchicalDocumentSplitter
.- Parameters:
maxSegmentSizeInTokens
- The maximum size of a segment in tokens.maxOverlapSizeInTokens
- The maximum size of the overlap between segments in tokens.tokenizer
- The tokenizer to use to estimate the number of tokens in a text.subSplitter
- The sub-splitter to use when a single segment is too long.
-
-
Method Details
-
split
Splits the provided text into parts. Implementation API.- Parameters:
text
- The text to be split.- Returns:
- An array of parts.
-
joinDelimiter
Delimiter string to use to re-join the parts.- Returns:
- The delimiter.
-
defaultSubSplitter
The default sub-splitter to use when a single segment is too long.- Returns:
- The default sub-splitter.
-
split
Description copied from interface:DocumentSplitter
Splits a single Document into a list of TextSegment objects. The metadata is typically copied from the document and enriched with segment-specific information, such as position in the document, page number, etc.- Specified by:
split
in interfaceDocumentSplitter
- Parameters:
document
- The Document to be split.- Returns:
- A list of TextSegment objects derived from the input Document.
-