Comparison of HTML parsers

(Learn how and when to remove this message)

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

Parser License Implementation language(s) Latest date* HTML parsing[1] HTML5-compliant parsing Clean HTML** Update HTML***
HTML Tidy W3C license ANSI C 2021-07-17[2] Yes[3] Yes Yes[3] Yes
HtmlUnit Apache License 2.0 Java 2023-10-31[4] Yes ? No No
Beautiful Soup MIT License Python 2023-04-07[5] Yes Yes ? No
jsoup MIT License Java 2023-12-29[6] Yes Yes Yes Yes
Parser License Implementation language(s) Latest date* HTML Parsing HTML5-compliant Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with ).

References

  1. ^ 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine
  2. ^ HTML Tidy release 5.8.0
  3. ^ a b What is Tidy?
  4. ^ HtmlUnit 3.7.0
  5. ^ Beautiful Soup release 4.10
  6. ^ jsoup Java HTML Parser release 1.17.2