Class DocumentByRegexSplitter

java.lang.Object
dev.langchain4j.data.document.splitter.HierarchicalDocumentSplitter
dev.langchain4j.data.document.splitter.DocumentByRegexSplitter
All Implemented Interfaces:
DocumentSplitter

public class DocumentByRegexSplitter extends HierarchicalDocumentSplitter
Splits the provided Document into parts using the provided regex and attempts to fit as many parts as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

The maxSegmentSize can be defined in terms of characters (default) or tokens. For token-based limit, a Tokenizer must be provided.

If multiple parts fit within maxSegmentSize, they are joined together using the provided joinDelimiter.

If a single part is too long and exceeds maxSegmentSize, the subSplitter (which should be provided) is used to split it into sub-parts and place them into multiple segments. Such segments contain only the sub-parts of the split long part.

Each TextSegment inherits all metadata from the Document and includes an "index" metadata key representing its position within the document (starting from 0).