Class ApacheTikaDocumentParser
java.lang.Object
dev.langchain4j.data.document.parser.apache.tika.ApacheTikaDocumentParser
- All Implemented Interfaces:
DocumentParser
Parses an
InputStream into a Document using the Apache Tika library by
automatically detecting the file format and extracting its textual content.
This parser supports a wide range of formats, including PDF, DOC, PPT, XLS, and many others.
Optionally, metadata can also be extracted and attached to the Document.
For a full list of supported formats, refer to the Apache Tika documentation.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Supplier<ContentHandler> static final Supplier<org.apache.tika.metadata.Metadata> static final Supplier<org.apache.tika.parser.ParseContext> static final Supplier<org.apache.tika.parser.Parser> -
Constructor Summary
ConstructorsConstructorDescriptionCreates an instance of anApacheTikaDocumentParserwith the default Tika components.ApacheTikaDocumentParser(boolean includeMetadata) Creates an instance of anApacheTikaDocumentParserwith the default Tika components.ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier) Deprecated, for removal: This API element is subject to removal in a future version.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files and specify whether to include metadata or not.ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier, boolean includeMetadata) Creates an instance of anApacheTikaDocumentParserwith the provided suppliers for Tika components.ApacheTikaDocumentParser(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext) Deprecated, for removal: This API element is subject to removal in a future version.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files. -
Method Summary
Modifier and TypeMethodDescriptionparse(InputStream inputStream) Parses a givenInputStreaminto aDocument.
-
Field Details
-
DEFAULT_PARSER_SUPPLIER
-
DEFAULT_METADATA_SUPPLIER
-
DEFAULT_PARSE_CONTEXT_SUPPLIER
-
DEFAULT_CONTENT_HANDLER_SUPPLIER
-
-
Constructor Details
-
ApacheTikaDocumentParser
public ApacheTikaDocumentParser()Creates an instance of anApacheTikaDocumentParserwith the default Tika components. It usesAutoDetectParser,BodyContentHandlerwithout write limit, emptyMetadataand emptyParseContext. Note: By default, no metadata is added to the parsed document. -
ApacheTikaDocumentParser
public ApacheTikaDocumentParser(boolean includeMetadata) Creates an instance of anApacheTikaDocumentParserwith the default Tika components. It usesAutoDetectParser,BodyContentHandlerwithout write limit, emptyMetadataand emptyParseContext.- Parameters:
includeMetadata- Whether to include metadata in the parsed document
-
ApacheTikaDocumentParser
@Deprecated(forRemoval=true) public ApacheTikaDocumentParser(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext) Deprecated, for removal: This API element is subject to removal in a future version.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.Creates an instance of anApacheTikaDocumentParserwith the provided Tika components. If some of the components are not provided (null, the defaults will be used.- Parameters:
parser- Tika parser to use. Default:AutoDetectParsercontentHandler- Tika content handler. Default:BodyContentHandlerwithout write limitmetadata- Tika metadata. Default: emptyMetadataparseContext- Tika parse context. Default: emptyParseContext
-
ApacheTikaDocumentParser
@Deprecated(forRemoval=true) public ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier) Deprecated, for removal: This API element is subject to removal in a future version.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files and specify whether to include metadata or not.Creates an instance of anApacheTikaDocumentParserwith the provided suppliers for Tika components. If some of the suppliers are not provided (null), the defaults will be used.- Parameters:
parserSupplier- Supplier for Tika parser to use. Default:AutoDetectParsercontentHandlerSupplier- Supplier for Tika content handler. Default:BodyContentHandlerwithout write limitmetadataSupplier- Supplier for Tika metadata. Default: emptyMetadataparseContextSupplier- Supplier for Tika parse context. Default: emptyParseContext
-
ApacheTikaDocumentParser
public ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier, boolean includeMetadata) Creates an instance of anApacheTikaDocumentParserwith the provided suppliers for Tika components. If some of the suppliers are not provided (null), the defaults will be used.- Parameters:
parserSupplier- Supplier for Tika parser to use. Default:AutoDetectParsercontentHandlerSupplier- Supplier for Tika content handler. Default:BodyContentHandlerwithout write limitmetadataSupplier- Supplier for Tika metadata. Default: emptyMetadataparseContextSupplier- Supplier for Tika parse context. Default: emptyParseContextincludeMetadata- Whether to include metadata in the parsed document
-
-
Method Details
-
parse
Description copied from interface:DocumentParserParses a givenInputStreaminto aDocument. The specific implementation of this method will depend on the type of the document being parsed.Note: This method does not close the provided
InputStream- it is the caller's responsibility to manage the lifecycle of the stream.- Specified by:
parsein interfaceDocumentParser- Parameters:
inputStream- TheInputStreamthat contains the content of theDocument.- Returns:
- The parsed
Document.
-