Package dev.langchain4j.data.document
Interface DocumentSplitter
- All Known Implementing Classes:
DocumentByCharacterSplitter
,DocumentByLineSplitter
,DocumentByParagraphSplitter
,DocumentByRegexSplitter
,DocumentBySentenceSplitter
,DocumentByWordSplitter
,HierarchicalDocumentSplitter
public interface DocumentSplitter
Defines the interface for splitting a document into text segments.
This is necessary as LLMs have a limited context window, making it impossible to send the entire document at once.
Therefore, the document should first be split into segments, and only the relevant segments should be sent to LLM.
DocumentSplitters.recursive()
from a dev.langchain4j:langchain4j
module is a good starting point.-
Method Summary
Modifier and TypeMethodDescriptionSplits a single Document into a list of TextSegment objects.default List
<TextSegment> Splits multipleDocument
instances into a list ofTextSegment
objects.default List
<TextSegment> Splits a list of Documents into a list of TextSegment objects.
-
Method Details
-
split
Splits a single Document into a list of TextSegment objects. The metadata is typically copied from the document and enriched with segment-specific information, such as position in the document, page number, etc.- Parameters:
document
- The Document to be split.- Returns:
- A list of TextSegment objects derived from the input Document.
-
splitAll
Splits a list of Documents into a list of TextSegment objects. This is a convenience method that calls the split method for each Document in the list.- Parameters:
documents
- The list of Documents to be split.- Returns:
- A list of TextSegment objects derived from the input Documents.
-
splitAll
Splits multipleDocument
instances into a list ofTextSegment
objects.This is a convenience method that allows callers to pass a variable number of Document arguments (using varargs) instead of explicitly creating a list. Internally, it delegates to the
splitAll(List)
method by converting the varargs array into aList
.For example:
This is equivalent to:List<TextSegment> segments = documentSplitter.splitAll(doc1, doc2, doc3);
List<TextSegment> segments = documentSplitter.splitAll(Arrays.asList(doc1, doc2, doc3));
- Parameters:
documents
- One or moreDocument
instances to be split. If no documents are provided, an empty list is returned.- Returns:
- A list of
TextSegment
objects derived from the input documents. The resulting list is a flat combination of all segments from all input documents. - See Also:
-