EBML differentiates between String Elements and UTF-8 Elements.
They are almost exactly the same, except for this:
- "String" only allows ASCII and terminator bytes.
- "UTF-8" only allows UTF-8 and terminator bytes.
Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does:
|
impl<'a> EbmlParsable<'a> for String { |
|
fn try_parse(data: &'a [u8]) -> Result<Self, ErrorKind> { |
|
String::from_utf8(data.to_vec()).map_err(|_| ErrorKind::StringNotUtf8) |
|
} |
|
} |
However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.
A small overview:
| String = UTF-8 |
String != UTF-8 |
only String |
String + new type |
from_utf8 (std) |
from_utf8 (std) + from_ascii (custom, maybe faster?) |
| mostly compliant |
compliant |
Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?
EBML differentiates between String Elements and UTF-8 Elements.
They are almost exactly the same, except for this:
Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does:
matroska/src/ebml/parse.rs
Lines 80 to 84 in 7f15b7c
However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.
A small overview:
StringString+ new typefrom_utf8(std)from_utf8(std) +from_ascii(custom, maybe faster?)Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?