Class DocumentBySentenceSplitter
- All Implemented Interfaces:
DocumentSplitter
Document
into sentences and attempts to fit as many sentences as possible
into a single TextSegment
, adhering to the limit set by maxSegmentSize
.
The maxSegmentSize
can be defined in terms of characters (default) or tokens.
For token-based limit, a Tokenizer
must be provided.
Sentence boundaries are detected using the Apache OpenNLP library with the English sentence model.
If multiple sentences fit within maxSegmentSize
, they are joined together using a space (" ").
If a single sentence is too long and exceeds maxSegmentSize
,
the subSplitter
(DocumentByWordSplitter
by default) is used to split it into smaller parts and
place them into multiple segments.
Such segments contain only the parts of the split long sentence.
Each TextSegment
inherits all metadata from the Document
and includes an "index" metadata key
representing its position within the document (starting from 0).
-
Field Summary
Fields inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
maxOverlapSize, maxSegmentSize, subSplitter, tokenizer
-
Constructor Summary
ConstructorDescriptionDocumentBySentenceSplitter
(int maxSegmentSizeInChars, int maxOverlapSizeInChars) DocumentBySentenceSplitter
(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter) DocumentBySentenceSplitter
(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer) DocumentBySentenceSplitter
(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter) -
Method Summary
Modifier and TypeMethodDescriptionprotected DocumentSplitter
The default sub-splitter to use when a single segment is too long.Delimiter string to use to re-join the parts.String[]
Splits the provided text into parts.Methods inherited from class dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
split
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface dev.langchain4j.data.document.DocumentSplitter
splitAll
-
Constructor Details
-
DocumentBySentenceSplitter
public DocumentBySentenceSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars) -
DocumentBySentenceSplitter
public DocumentBySentenceSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter) -
DocumentBySentenceSplitter
public DocumentBySentenceSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer) -
DocumentBySentenceSplitter
public DocumentBySentenceSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter)
-
-
Method Details
-
split
Description copied from class:HierarchicalDocumentSplitter
Splits the provided text into parts. Implementation API.- Specified by:
split
in classHierarchicalDocumentSplitter
- Parameters:
text
- The text to be split.- Returns:
- An array of parts.
-
joinDelimiter
Description copied from class:HierarchicalDocumentSplitter
Delimiter string to use to re-join the parts.- Specified by:
joinDelimiter
in classHierarchicalDocumentSplitter
- Returns:
- The delimiter.
-
defaultSubSplitter
Description copied from class:HierarchicalDocumentSplitter
The default sub-splitter to use when a single segment is too long.- Specified by:
defaultSubSplitter
in classHierarchicalDocumentSplitter
- Returns:
- The default sub-splitter.
-