Class DocumentByLineSplitter

java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentByLineSplitter
All Implemented Interfaces:
DocumentSplitter

public class DocumentByLineSplitter extends HierarchicalDocumentSplitter
Splits the provided Document into lines and attempts to fit as many lines as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

The maxSegmentSize can be defined in terms of characters (default) or tokens. For token-based limit, a Tokenizer must be provided.

Line boundaries are detected by a minimum of one newline character ("\n"). Any additional whitespaces before or after are ignored. So, the following examples are all valid line separators: "\n", "\n\n", " \n", "\n " and so on.

If multiple lines fit within maxSegmentSize, they are joined together using a newline ("\n").

If a single line is too long and exceeds maxSegmentSize, the subSplitter (DocumentBySentenceSplitter by default) is used to split it into smaller parts and place them into multiple segments. Such segments contain only the parts of the split long line.

Each TextSegment inherits all metadata from the Document and includes an "index" metadata key representing its position within the document (starting from 0).