Skip to content

Should String Elements and UTF-8 Elements be handled differently? #134

Description

@FreezyLemon

EBML differentiates between String Elements and UTF-8 Elements.

They are almost exactly the same, except for this:

  • "String" only allows ASCII and terminator bytes.
  • "UTF-8" only allows UTF-8 and terminator bytes.

Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does:

impl<'a> EbmlParsable<'a> for String {
fn try_parse(data: &'a [u8]) -> Result<Self, ErrorKind> {
String::from_utf8(data.to_vec()).map_err(|_| ErrorKind::StringNotUtf8)
}
}

However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.

A small overview:

String = UTF-8 String != UTF-8
only String String + new type
from_utf8 (std) from_utf8 (std) + from_ascii (custom, maybe faster?)
mostly compliant compliant

Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    spec-complianceRelated to EBML or Matroska specification compliance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions