Skip to content

Clean title come from meta tags#824

Open
cikay wants to merge 1 commit intoadbar:masterfrom
cikay:title-extraction
Open

Clean title come from meta tags#824
cikay wants to merge 1 commit intoadbar:masterfrom
cikay:title-extraction

Conversation

@cikay
Copy link
Copy Markdown

@cikay cikay commented Feb 13, 2026

Apply HTMLTITLE_REGEX cleanup to titles extracted from og:title, twitter:title, meta name title, and itemprop headline. Previously, extract() with with_metadata=True returned titles with site name suffixes from meta tags, while extract_title() correctly returned clean titles from h1 tags.

  • Add clean_title() helper function to remove site name suffix/prefix
  • Apply clean_title() in extract_opengraph() for og:title
  • Apply clean_title() in examine_meta() for meta name titles and itemprop headlines
  • Add tests for clean_title() and title cleaning in metadata extraction

The issue is encountered in the following website
https://www.nuhev.com

example url: https://www.nuhev.com/gelo-jiyan-de-li-sala-2050yan-cawa-be/

titles from meta: <meta property="og:title" content="Gelo Jîyan dê li Sala 2050yan Çawa Be? - Nûhev Co. %100 Kurdî">
titles from h1: <h1 class="jeg_post_title">Gelo Jîyan dê li Sala 2050yan Çawa Be?</h1>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant