dev.langchain4j.data.document.transformer.jsoup.HtmlToTextDocumentTransformer

All Implemented Interfaces:: DocumentTransformer

public class HtmlToTextDocumentTransformer extends Object implements DocumentTransformer

Extracts plain text from a given HTML document. A CSS selector can be specified to extract text only from desired HTML element(s). Also, multiple CSS selectors can be specified to extract metadata from desired HTML elements.

Constructor Summary

Constructors

Constructor

Description

HtmlToTextDocumentTransformer()

Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.

HtmlToTextDocumentTransformer(String cssSelector)

Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.

HtmlToTextDocumentTransformer(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)

Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
Method Summary

Modifier and Type

Method

Description

Document

transform(Document document)

Transforms a provided document.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface DocumentTransformer
transformAll

Constructor Details
- HtmlToTextDocumentTransformer
  
  public HtmlToTextDocumentTransformer()
  
  Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.
- HtmlToTextDocumentTransformer
  
  public HtmlToTextDocumentTransformer(String cssSelector)
  
  Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
  
  Parameters:
  
  cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
- HtmlToTextDocumentTransformer
  
  public HtmlToTextDocumentTransformer(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)
  
  Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
  
  Parameters:
  
  cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
  
  metadataCssSelectors - A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it in Metadata under the key "title".
  
  includeLinks - Specifies whether links should be included in the extracted text.
Method Details
- transform
  
  public Document transform(Document document)
  
  Description copied from interface: DocumentTransformer
  
  Transforms a provided document.
  
  Specified by:
  
  transform in interface DocumentTransformer
  
  Parameters:
  
  document - The document to be transformed.
  
  Returns:
  
  The transformed document, or null if the document should be filtered out.

Class HtmlToTextDocumentTransformer

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface DocumentTransformer

Constructor Details

HtmlToTextDocumentTransformer

HtmlToTextDocumentTransformer

HtmlToTextDocumentTransformer

Method Details

transform