Class DocumentByRegexSplitter
java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentByRegexSplitter
- All Implemented Interfaces:
DocumentSplitter
Splits the provided
Document into parts using the provided regex and attempts to fit as many parts
as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.
The maxSegmentSize can be defined in terms of characters (default) or tokens.
For token-based limit, a TokenCountEstimator must be provided.
If multiple parts fit within maxSegmentSize, they are joined together using the provided joinDelimiter.
If a single part is too long and exceeds maxSegmentSize, the subSplitter (which should be provided)
is used to split it into sub-parts and place them into multiple segments.
Such segments contain only the sub-parts of the split long part.
Each TextSegment inherits all metadata from the Document and includes an "index" metadata key
representing its position within the document (starting from 0).
-
Field Summary
Fields inherited from class HierarchicalDocumentSplitter
maxOverlapSize, maxSegmentSize, subSplitter, tokenCountEstimator -
Constructor Summary
ConstructorsConstructorDescriptionDocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars) DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter) DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator) DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter) -
Method Summary
Modifier and TypeMethodDescriptionprotected DocumentSplitterThe default sub-splitter to use when a single segment is too long.Delimiter string to use to re-join the parts.String[]Splits the provided text into parts.Methods inherited from class HierarchicalDocumentSplitter
splitMethods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface DocumentSplitter
splitAll, splitAll
-
Constructor Details
-
DocumentByRegexSplitter
-
DocumentByRegexSplitter
public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter) -
DocumentByRegexSplitter
public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator) -
DocumentByRegexSplitter
public DocumentByRegexSplitter(String regex, String joinDelimiter, int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, TokenCountEstimator tokenCountEstimator, DocumentSplitter subSplitter)
-
-
Method Details
-
split
Description copied from class:HierarchicalDocumentSplitterSplits the provided text into parts. Implementation API.- Specified by:
splitin classHierarchicalDocumentSplitter- Parameters:
text- The text to be split.- Returns:
- An array of parts.
-
joinDelimiter
Description copied from class:HierarchicalDocumentSplitterDelimiter string to use to re-join the parts.- Specified by:
joinDelimiterin classHierarchicalDocumentSplitter- Returns:
- The delimiter.
-
defaultSubSplitter
Description copied from class:HierarchicalDocumentSplitterThe default sub-splitter to use when a single segment is too long.- Specified by:
defaultSubSplitterin classHierarchicalDocumentSplitter- Returns:
- The default sub-splitter.
-