Review HTML element list and ensure complete XML conversion coverage#802
Review HTML element list and ensure complete XML conversion coverage#802eyupcanakman wants to merge 2 commits intoadbar:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR ensures that all MDN HTML elements are correctly accounted for and mapped to XML, thereby resolving issue #720.
- Introduces an explicit conversion mapping (HTML_EL_TO_XML_EL) for HTML-to-XML element conversions.
- Adds a loop that fills any missing mapping with an identity rule based on MDN_ELEMENTS.
- Provides new tests to validate both explicit and default identity mappings.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| trafilatura/htmlprocessing.py | Added new conversion map with explicit mappings and identity mappings. |
| trafilatura/html_elements_reference.py | Added a frozen snapshot of MDN HTML element names. |
| tests/test_html_elements.py | Added tests to verify complete mapping coverage against MDN elements. |
Comments suppressed due to low confidence (1)
trafilatura/htmlprocessing.py:111
- [nitpick] Using the variable name '_tag' might suggest that the variable is unused. Consider renaming it to 'tag' for improved clarity.
for _tag in MDN_ELEMENTS:
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #802 +/- ##
=======================================
Coverage 99.29% 99.29%
=======================================
Files 21 22 +1
Lines 3664 3680 +16
=======================================
+ Hits 3638 3654 +16
Misses 26 26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @eyupcanakman, the idea looks good but as it stands your code isn't actually used during the extraction. So it's hard to tell what would be the benefit here. |
|
@eyupcanakman Your PR doesn't change anything in the way documents are processed, I will close it if you don't integrate it into the actual code. |
2a28678 to
a4c6730
Compare
|
Hi @adbar Thanks for your feedback and the reminder. Sorry for the late reply. I have pushed a new update. The new HTML mapping is now fully connected to the convert_tags function, so the logic is being used as you suggested. Changes:
I believe this update resolves the issue you pointed out. It is ready for your review. I look forward to your feedback! |
|
@eyupcanakman It works but it doesn't make much sense to keep both conversions active, or am I getting it wrong?
The code is slower with both (obviously). |
a4c6730 to
a12f09f
Compare
|
@adbar You're right the code was processing the DOM twice which doesn't make sense. I just pushed a fix for that. |
…dled HTML elements to XML counterparts (adbar#720)
a12f09f to
2824d95
Compare
|
@eyupcanakman The last change looks good but I still need to think about the PR. There is a small negative impact on the benchmark. |
Closes #720: Review HTML element list and conversion.
Ensured all MDN HTML elements are accounted for and correctly mapped to XML.
html_elements_reference.pysnapshot including all 95+ MDN elements (modern, legacy, deprecated).head, lists →list).tag→tag) ensuring no elements are lost.