Class ApacheTikaDocumentParser
java.lang.Object
dev.langchain4j.data.document.parser.apache.tika.ApacheTikaDocumentParser
- All Implemented Interfaces:
DocumentParser
Parses files into
Document
s using Apache Tika library, automatically detecting the file format.
This parser supports various file formats, including PDF, DOC, PPT, XLS.
For detailed information on supported formats,
please refer to the Apache Tika documentation.-
Field Summary
Modifier and TypeFieldDescriptionstatic final Supplier
<ContentHandler> static final Supplier
<org.apache.tika.metadata.Metadata> static final Supplier
<org.apache.tika.parser.ParseContext> static final Supplier
<org.apache.tika.parser.Parser> -
Constructor Summary
ConstructorDescriptionCreates an instance of anApacheTikaDocumentParser
with the default Tika components.ApacheTikaDocumentParser
(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier) Creates an instance of anApacheTikaDocumentParser
with the provided suppliers for Tika components.ApacheTikaDocumentParser
(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext) Deprecated, for removal: This API element is subject to removal in a future version.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files. -
Method Summary
Modifier and TypeMethodDescriptionparse
(InputStream inputStream) Parses a givenInputStream
into aDocument
.
-
Field Details
-
DEFAULT_PARSER_SUPPLIER
-
DEFAULT_METADATA_SUPPLIER
-
DEFAULT_PARSE_CONTEXT_SUPPLIER
-
DEFAULT_CONTENT_HANDLER_SUPPLIER
-
-
Constructor Details
-
ApacheTikaDocumentParser
public ApacheTikaDocumentParser()Creates an instance of anApacheTikaDocumentParser
with the default Tika components. It usesAutoDetectParser
,BodyContentHandler
without write limit, emptyMetadata
and emptyParseContext
. -
ApacheTikaDocumentParser
@Deprecated(forRemoval=true) public ApacheTikaDocumentParser(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext) Deprecated, for removal: This API element is subject to removal in a future version.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.Creates an instance of anApacheTikaDocumentParser
with the provided Tika components. If some of the components are not provided (null
, the defaults will be used.- Parameters:
parser
- Tika parser to use. Default:AutoDetectParser
contentHandler
- Tika content handler. Default:BodyContentHandler
without write limitmetadata
- Tika metadata. Default: emptyMetadata
parseContext
- Tika parse context. Default: emptyParseContext
-
ApacheTikaDocumentParser
public ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier) Creates an instance of anApacheTikaDocumentParser
with the provided suppliers for Tika components. If some of the suppliers are not provided (null
), the defaults will be used.- Parameters:
parserSupplier
- Supplier for Tika parser to use. Default:AutoDetectParser
contentHandlerSupplier
- Supplier for Tika content handler. Default:BodyContentHandler
without write limitmetadataSupplier
- Supplier for Tika metadata. Default: emptyMetadata
parseContextSupplier
- Supplier for Tika parse context. Default: emptyParseContext
-
-
Method Details
-
parse
Description copied from interface:DocumentParser
Parses a givenInputStream
into aDocument
. The specific implementation of this method will depend on the type of the document being parsed.Note: This method does not close the provided
InputStream
- it is the caller's responsibility to manage the lifecycle of the stream.- Specified by:
parse
in interfaceDocumentParser
- Parameters:
inputStream
- TheInputStream
that contains the content of theDocument
.- Returns:
- The parsed
Document
.
-