Class ApacheTikaDocumentParser

java.lang.Object
dev.langchain4j.data.document.parser.apache.tika.ApacheTikaDocumentParser
All Implemented Interfaces:
DocumentParser

public class ApacheTikaDocumentParser extends Object implements DocumentParser
Parses files into Documents using Apache Tika library, automatically detecting the file format. This parser supports various file formats, including PDF, DOC, PPT, XLS. For detailed information on supported formats, please refer to the Apache Tika documentation.
  • Field Details

    • DEFAULT_PARSER_SUPPLIER

      public static final Supplier<org.apache.tika.parser.Parser> DEFAULT_PARSER_SUPPLIER
    • DEFAULT_METADATA_SUPPLIER

      public static final Supplier<org.apache.tika.metadata.Metadata> DEFAULT_METADATA_SUPPLIER
    • DEFAULT_PARSE_CONTEXT_SUPPLIER

      public static final Supplier<org.apache.tika.parser.ParseContext> DEFAULT_PARSE_CONTEXT_SUPPLIER
    • DEFAULT_CONTENT_HANDLER_SUPPLIER

      public static final Supplier<ContentHandler> DEFAULT_CONTENT_HANDLER_SUPPLIER
  • Constructor Details

    • ApacheTikaDocumentParser

      public ApacheTikaDocumentParser()
      Creates an instance of an ApacheTikaDocumentParser with the default Tika components. It uses AutoDetectParser, BodyContentHandler without write limit, empty Metadata and empty ParseContext.
    • ApacheTikaDocumentParser

      @Deprecated(forRemoval=true) public ApacheTikaDocumentParser(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext)
      Deprecated, for removal: This API element is subject to removal in a future version.
      Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.
      Creates an instance of an ApacheTikaDocumentParser with the provided Tika components. If some of the components are not provided (null, the defaults will be used.
      Parameters:
      parser - Tika parser to use. Default: AutoDetectParser
      contentHandler - Tika content handler. Default: BodyContentHandler without write limit
      metadata - Tika metadata. Default: empty Metadata
      parseContext - Tika parse context. Default: empty ParseContext
    • ApacheTikaDocumentParser

      public ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier)
      Creates an instance of an ApacheTikaDocumentParser with the provided suppliers for Tika components. If some of the suppliers are not provided (null), the defaults will be used.
      Parameters:
      parserSupplier - Supplier for Tika parser to use. Default: AutoDetectParser
      contentHandlerSupplier - Supplier for Tika content handler. Default: BodyContentHandler without write limit
      metadataSupplier - Supplier for Tika metadata. Default: empty Metadata
      parseContextSupplier - Supplier for Tika parse context. Default: empty ParseContext
  • Method Details

    • parse

      public Document parse(InputStream inputStream)
      Description copied from interface: DocumentParser
      Parses a given InputStream into a Document. The specific implementation of this method will depend on the type of the document being parsed.

      Note: This method does not close the provided InputStream - it is the caller's responsibility to manage the lifecycle of the stream.

      Specified by:
      parse in interface DocumentParser
      Parameters:
      inputStream - The InputStream that contains the content of the Document.
      Returns:
      The parsed Document.