Class HtmlToTextDocumentTransformer
java.lang.Object
dev.langchain4j.data.document.transformer.jsoup.HtmlToTextDocumentTransformer
- All Implemented Interfaces:
DocumentTransformer
Extracts plain text from a given HTML document.
A CSS selector can be specified to extract text only from desired HTML element(s).
Also, multiple CSS selectors can be specified to extract metadata from desired HTML elements.
-
Constructor Summary
ConstructorDescriptionConstructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.HtmlToTextDocumentTransformer
(String cssSelector) Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.HtmlToTextDocumentTransformer
(String cssSelector, Map<String, String> metadataCssSelectors, boolean includeLinks) Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector. -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface dev.langchain4j.data.document.DocumentTransformer
transformAll
-
Constructor Details
-
HtmlToTextDocumentTransformer
public HtmlToTextDocumentTransformer()Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML. -
HtmlToTextDocumentTransformer
Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.- Parameters:
cssSelector
- A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
-
HtmlToTextDocumentTransformer
public HtmlToTextDocumentTransformer(String cssSelector, Map<String, String> metadataCssSelectors, boolean includeLinks) Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.- Parameters:
cssSelector
- A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".metadataCssSelectors
- A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it inMetadata
under the key "title".includeLinks
- Specifies whether links should be included in the extracted text.
-
-
Method Details
-
transform
Description copied from interface:DocumentTransformer
Transforms a provided document.- Specified by:
transform
in interfaceDocumentTransformer
- Parameters:
document
- The document to be transformed.- Returns:
- The transformed document, or null if the document should be filtered out.
-