Skip to content

Unicode tags trimming error #281

@spo0okie

Description

@spo0okie

in some cases all tags from page disappears if it consists urf8 symbols, and reappear if i change tags order
i dont understand meaning of stripping \xe2\x80\x8b and removing /[\x00-\x1F\x7F]/u, but this is root of the problem
syntax/tag.php:

    /**
     * Handle matches of the tag syntax
     *
     * @param string $match The match of the syntax
     * @param int    $state The state of the handler
     * @param int    $pos The position in the document
     * @param Doku_Handler    $handler The handler
     * @return array|false Data for the renderer
     */
    function handle($match, $state, $pos, Doku_Handler $handler) {
        //echo $match: {{tag>bitrix 4 портал опросы }}
        $tags = trim(substr($match, 6, -2));     // strip markup & whitespace
        //echo $tags: bitrix 4 портал опросы
        $tags = trim($tags, "\xe2\x80\x8b"); // strip word/wordpad breaklines
        //echo $tags: bitrix 4 портал опрос.
        //one tag content changed!  ^^^^^^
        $tags = preg_replace(['/[[:blank:]]+/', '/\s+/'], " ", $tags);    // replace linebreaks and multiple spaces with one space character
        //echo $tags: bitrix 4 портал опрос.
        $tags = preg_replace('/[\x00-\x1F\x7F]/u', '', $tags); // strip unprintable ascii code out of utf-8 coded string
        //echo $tags: 
        //ALL TAGS DISSAPEARED ^^^^^^

        if (!$tags) return false;

        // load the helper_plugin_tag
        /** @var helper_plugin_tag $helper */
        if (!$helper = $this->loadHelper('tag')) {
            return false;
        }

        // split tags and returns for renderer
        return $helper->parseTagList($tags);
    }

php version: 7.4
plugin version: 2023-10-17

if i comment out string

        $tags = preg_replace('/[\x00-\x1F\x7F]/u', '', $tags); // strip unprintable ascii code out of utf-8 coded string

then tags just missing one letter:
bitrix 4 портал опрос

if instead (previous string restored) i comment out string

        $tags = trim($tags, "\xe2\x80\x8b"); // strip word/wordpad breaklines

then everything works fine as expected
bitrix 4 портал опросы

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions