Fix table data px dimension parsing bug#1788
Conversation
andersonhc
left a comment
There was a problem hiding this comment.
Changes requested:
- Fix indentation
- Add a test in
test/html/test_html.pyexercising your change - Add a CHANGELOG entry
If you need clarification or assistance in any of these please let us know.
| for dimension in ("width", "height"): | ||
| if dimension in self.td_th and self.td_th[dimension] is not None: | ||
| self.td_th[dimension] = self.td_th[dimension].replace("px", "").strip() |
There was a problem hiding this comment.
| for dimension in ("width", "height"): | |
| if dimension in self.td_th and self.td_th[dimension] is not None: | |
| self.td_th[dimension] = self.td_th[dimension].replace("px", "").strip() | |
| for dimension in ("width", "height"): | |
| if dimension in self.td_th and self.td_th[dimension] is not None: | |
| self.td_th[dimension] = self.td_th[dimension].replace("px", "").strip() |
Pawansingh3889
left a comment
There was a problem hiding this comment.
Good catch on stripping from HTML table dimensions. Two observations:
- Indentation issue: The and lines need to be indented one level deeper to be inside the loop:
As written, only the line runs in the loop, and the / runs once with set to (the last value).
- Other units: HTML also supports , , , in dimension attributes. You might want to handle those too, or at minimum strip non-numeric characters:
Otherwise the approach is solid. The stripping is the most common case users will hit.
Pawansingh3889
left a comment
There was a problem hiding this comment.
Thanks for tackling this — I hit similar dimension parsing issues when generating PDF audit reports from HTML tables.
I notice an indentation issue in the diff: the if dimension in self.td_th line appears to be at the same indent level as the for loop rather than inside it. This would cause a syntax error or incorrect behavior — the if block needs to be indented one level deeper under the for dimension in ... loop.
Also worth considering:
- Should this also handle other CSS units like
em,rem,pt,%? At minimum a regex likere.sub(r"[a-z%]+$", "", value)would be more robust. - Edge case: what happens with
width="auto"? The.replace("px", "")would pass it through unchanged and it may fail downstream when parsed as a number.
A test case for HTML like <td width="120px"> would help confirm the fix works end-to-end.
|
Hi @HrsLizo Just checking in to see if you saw and the review and if you're still interested in completing this PR? |
e.g. Fixes #0
Checklist:
A unit test is covering the code added / modified by this PR N/A
In case of a new feature, docstrings have been added, with also some documentation in the
docs/folder N/AA mention of the change is present in
CHANGELOG.mdN/AThis PR is ready to be merged
Added a .replace("px", "") strip to safely handle pixel dimensions during HTML table parsing.
By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.