Class DocumentByParagraphSplitter

java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter
All Implemented Interfaces:
DocumentSplitter

public class DocumentByParagraphSplitter extends HierarchicalDocumentSplitter
Splits the provided Document into paragraphs and attempts to fit as many paragraphs as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

The maxSegmentSize can be defined in terms of characters (default) or tokens. For token-based limit, a Tokenizer must be provided.

Paragraph boundaries are detected by a minimum of two newline characters ("\n\n"). Any additional whitespaces before, between, or after are ignored. So, the following examples are all valid paragraph separators: "\n\n", "\n\n\n", "\n \n", " \n \n ", and so on.

If multiple paragraphs fit within maxSegmentSize, they are joined together using a double newline ("\n\n").

If a single paragraph is too long and exceeds maxSegmentSize, the subSplitter (DocumentBySentenceSplitter by default) is used to split it into smaller parts and place them into multiple segments. Such segments contain only the parts of the split long paragraph.

Each TextSegment inherits all metadata from the Document and includes an "index" metadata key representing its position within the document (starting from 0).

  • Constructor Details

    • DocumentByParagraphSplitter

      public DocumentByParagraphSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars)
    • DocumentByParagraphSplitter

      public DocumentByParagraphSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars, DocumentSplitter subSplitter)
    • DocumentByParagraphSplitter

      public DocumentByParagraphSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer)
    • DocumentByParagraphSplitter

      public DocumentByParagraphSplitter(int maxSegmentSizeInTokens, int maxOverlapSizeInTokens, Tokenizer tokenizer, DocumentSplitter subSplitter)
  • Method Details