Class DocumentBySentenceSplitter

java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentBySentenceSplitter
All Implemented Interfaces:
DocumentSplitter

public class DocumentBySentenceSplitter extends HierarchicalDocumentSplitter
Splits the provided Document into sentences and attempts to fit as many sentences as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

The maxSegmentSize can be defined in terms of characters (default) or tokens. For token-based limit, a Tokenizer must be provided.

Sentence boundaries are detected using the Apache OpenNLP library with the English sentence model.

If multiple sentences fit within maxSegmentSize, they are joined together using a space (" ").

If a single sentence is too long and exceeds maxSegmentSize, the subSplitter (DocumentByWordSplitter by default) is used to split it into smaller parts and place them into multiple segments. Such segments contain only the parts of the split long sentence.

Each TextSegment inherits all metadata from the Document and includes an "index" metadata key representing its position within the document (starting from 0).

  • Constructor Details

    • DocumentBySentenceSplitter

      public DocumentBySentenceSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
    • DocumentBySentenceSplitter

      public DocumentBySentenceSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter)
    • DocumentBySentenceSplitter

      public DocumentBySentenceSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer)
    • DocumentBySentenceSplitter

      public DocumentBySentenceSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter)
  • Method Details