Class DocumentByCharacterSplitter
java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentByCharacterSplitter
- All Implemented Interfaces:
DocumentSplitter
Splits the provided
Document
into characters and attempts to fit as many characters as possible
into a single TextSegment
, adhering to the limit set by maxSegmentSize
.
The maxSegmentSize
can be defined in terms of characters (default) or tokens.
For token-based limit, a Tokenizer
must be provided.
If multiple characters fit within maxSegmentSize
, they are joined together without delimiters.
Each TextSegment
inherits all metadata from the Document
and includes an "index" metadata key
representing its position within the document (starting from 0).
-
Field Summary
Fields inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
maxOverlapSize, maxSegmentSize, subSplitter, tokenizer
-
Constructor Summary
ConstructorDescriptionDocumentByCharacterSplitter
(int maxSegmentSizeInChars, int maxOverlapSizeInChars) DocumentByCharacterSplitter
(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter) DocumentByCharacterSplitter
(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer) DocumentByCharacterSplitter
(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter) -
Method Summary
Modifier and TypeMethodDescriptionprotected DocumentSplitter
The default sub-splitter to use when a single segment is too long.Delimiter string to use to re-join the parts.String[]
Splits the provided text into parts.Methods inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
split
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface dev.langchain4j.data.document.DocumentSplitter
splitAll
-
Constructor Details
-
DocumentByCharacterSplitter
public DocumentByCharacterSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) -
DocumentByCharacterSplitter
public DocumentByCharacterSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter) -
DocumentByCharacterSplitter
public DocumentByCharacterSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer) -
DocumentByCharacterSplitter
public DocumentByCharacterSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter)
-
-
Method Details
-
split
Description copied from class:HierarchicalDocumentSplitter
Splits the provided text into parts. Implementation API.- Specified by:
split
in classHierarchicalDocumentSplitter
- Parameters:
text
- The text to be split.- Returns:
- An array of parts.
-
joinDelimiter
Description copied from class:HierarchicalDocumentSplitter
Delimiter string to use to re-join the parts.- Specified by:
joinDelimiter
in classHierarchicalDocumentSplitter
- Returns:
- The delimiter.
-
defaultSubSplitter
Description copied from class:HierarchicalDocumentSplitter
The default sub-splitter to use when a single segment is too long.- Specified by:
defaultSubSplitter
in classHierarchicalDocumentSplitter
- Returns:
- The default sub-splitter.
-