Review HTML element list and ensure complete XML conversion coverage by eyupcanakman · Pull Request #802 · adbar/trafilatura

eyupcanakman · 2025-06-08T22:49:31Z

Closes #720: Review HTML element list and conversion.

Ensured all MDN HTML elements are accounted for and correctly mapped to XML.

Created html_elements_reference.py snapshot including all 95+ MDN elements (modern, legacy, deprecated).
Preserved explicit semantic mappings (headings → head, lists → list).
Added default identity mappings (tag → tag) ensuring no elements are lost.
Explicit mappings defined for semantic elements.
Automatic identity mappings provided as defaults.
New tests verify complete MDN coverage.

Copilot

Pull Request Overview

This PR ensures that all MDN HTML elements are correctly accounted for and mapped to XML, thereby resolving issue #720.

Introduces an explicit conversion mapping (HTML_EL_TO_XML_EL) for HTML-to-XML element conversions.
Adds a loop that fills any missing mapping with an identity rule based on MDN_ELEMENTS.
Provides new tests to validate both explicit and default identity mappings.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
trafilatura/htmlprocessing.py	Added new conversion map with explicit mappings and identity mappings.
trafilatura/html_elements_reference.py	Added a frozen snapshot of MDN HTML element names.
tests/test_html_elements.py	Added tests to verify complete mapping coverage against MDN elements.

Comments suppressed due to low confidence (1)

trafilatura/htmlprocessing.py:111

[nitpick] Using the variable name '_tag' might suggest that the variable is unused. Consider renaming it to 'tag' for improved clarity.

for _tag in MDN_ELEMENTS:

codecov · 2025-06-10T18:07:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.29%. Comparing base (badd594) to head (a4c6730).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #802   +/-   ##
=======================================
  Coverage   99.29%   99.29%           
=======================================
  Files          21       22    +1     
  Lines        3664     3680   +16     
=======================================
+ Hits         3638     3654   +16     
  Misses         26       26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

adbar · 2025-06-10T18:12:04Z

Hi @eyupcanakman, the idea looks good but as it stands your code isn't actually used during the extraction. So it's hard to tell what would be the benefit here.

adbar · 2025-07-14T10:24:07Z

@eyupcanakman Your PR doesn't change anything in the way documents are processed, I will close it if you don't integrate it into the actual code.

eyupcanakman · 2025-07-15T08:13:40Z

Hi @adbar

Thanks for your feedback and the reminder. Sorry for the late reply.

I have pushed a new update. The new HTML mapping is now fully connected to the convert_tags function, so the logic is being used as you suggested.

Changes:

All tags are included: All HTML tags from the MDN list are now processed. They are either converted to a new tag or kept the same (for example, "search" stays "search").
Works with old rules: The new code does not conflict with existing special rules for handling tables, lists, or headings.
New tests: I added new tests to ensure that the new logic works correctly and doesn't cause any issues.

I believe this update resolves the issue you pointed out. It is ready for your review. I look forward to your feedback!

adbar · 2025-07-25T12:55:18Z

@eyupcanakman It works but it doesn't make much sense to keep both conversions active, or am I getting it wrong?

for elem in tree.iter(CONVERSIONS.keys()):
for elem in tree.iter(*_ALL_TAGS_TO_CONVERT):

The code is slower with both (obviously).

eyupcanakman · 2025-07-25T19:30:51Z

@adbar You're right the code was processing the DOM twice which doesn't make sense. I just pushed a fix for that.

…dled HTML elements to XML counterparts (adbar#720)

adbar · 2025-09-12T13:36:21Z

@eyupcanakman The last change looks good but I still need to think about the PR. There is a small negative impact on the benchmark.
You get more coverage if you use all the elements but it may the cause of unexpected behavior further down the line as the elements are converted to the output format.

htmlprocessing: ensure complete MDN HTML element coverage (adbar#720)

80c8588

Copilot AI review requested due to automatic review settings June 8, 2025 22:49

Copilot AI reviewed Jun 8, 2025

View reviewed changes

eyupcanakman force-pushed the feat/html-element-audit-720 branch from 2a28678 to a4c6730 Compare July 15, 2025 06:10

eyupcanakman force-pushed the feat/html-element-audit-720 branch from a4c6730 to a12f09f Compare July 25, 2025 19:29

Ensure complete MDN HTML element coverage by mapping previously unhan…

2824d95

…dled HTML elements to XML counterparts (adbar#720)

eyupcanakman force-pushed the feat/html-element-audit-720 branch from a12f09f to 2824d95 Compare August 12, 2025 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review HTML element list and ensure complete XML conversion coverage#802

Review HTML element list and ensure complete XML conversion coverage#802
eyupcanakman wants to merge 2 commits intoadbar:masterfrom
eyupcanakman:feat/html-element-audit-720

eyupcanakman commented Jun 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Jun 10, 2025 •

edited

Loading

Uh oh!

adbar commented Jun 10, 2025

Uh oh!

adbar commented Jul 14, 2025

Uh oh!

eyupcanakman commented Jul 15, 2025

Uh oh!

adbar commented Jul 25, 2025

Uh oh!

eyupcanakman commented Jul 25, 2025

Uh oh!

adbar commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

eyupcanakman commented Jun 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

codecov bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

adbar commented Jun 10, 2025

Uh oh!

adbar commented Jul 14, 2025

Uh oh!

eyupcanakman commented Jul 15, 2025

Uh oh!

adbar commented Jul 25, 2025

Uh oh!

eyupcanakman commented Jul 25, 2025

Uh oh!

adbar commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jun 10, 2025 •

edited

Loading