Class HtmlToTextDocumentTransformer

java.lang.Object
dev.langchain4j.data.document.transformer.jsoup.HtmlToTextDocumentTransformer
All Implemented Interfaces:
DocumentTransformer

public class HtmlToTextDocumentTransformer extends Object implements DocumentTransformer
Extracts plain text from a given HTML document. A CSS selector can be specified to extract text only from desired HTML element(s). Also, multiple CSS selectors can be specified to extract metadata from desired HTML elements.
  • Constructor Details

    • HtmlToTextDocumentTransformer

      public HtmlToTextDocumentTransformer()
      Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.
    • HtmlToTextDocumentTransformer

      public HtmlToTextDocumentTransformer(String cssSelector)
      Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
      Parameters:
      cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
    • HtmlToTextDocumentTransformer

      public HtmlToTextDocumentTransformer(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)
      Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
      Parameters:
      cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
      metadataCssSelectors - A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it in Metadata under the key "title".
      includeLinks - Specifies whether links should be included in the extracted text.
  • Method Details

    • transform

      public Document transform(Document document)
      Description copied from interface: DocumentTransformer
      Transforms a provided document.
      Specified by:
      transform in interface DocumentTransformer
      Parameters:
      document - The document to be transformed.
      Returns:
      The transformed document, or null if the document should be filtered out.