Replies: 23 comments
-
|
Following the example works: use icu::locale::langid;
use icu::segmenter::options::WordBreakOptions;
...
let mut options = WordBreakOptions::default();
let langid = &langid!("en");
options.content_locale = Some(langid);
let segmenter = WordSegmenter::try_new_auto(options).unwrap();But why the (gratuitous?) inconsistency?
https://docs.rs/icu/2.1.1/icu/segmenter/struct.WordSegmenter.html#method.try_new_auto pub fn try_new_auto(
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub struct WordBreakOptions<'a> {
pub content_locale: Option<&'a LanguageIdentifier>,
pub invariant_options: WordBreakInvariantOptions,
}
https://docs.rs/icu/2.1.1/icu/datetime/struct.DateTimeFormatter.html#method.try_new pub fn try_new(
prefs: DateTimeFormatterPreferences,
field_set_with_options: FSet,
) -> Result<DateTimeFormatter<FSet>, DateTimeFormatterLoadError>
pub struct DateTimeFormatterPreferences {
pub locale_preferences: LocalePreferences,
pub numbering_system: Option<NumberingSystem>,
pub hour_cycle: Option<HourCycle>,
pub calendar_algorithm: Option<CalendarAlgorithm>,
}
pub struct LocalePreferences { /* private fields */ }What I would expect:
Because the ideal would be that once you know how to do one i18n operation, you know them all (more or less). |
Beta Was this translation helpful? Give feedback.
-
|
"options" and "preferences" are inherently different:
Some APIs accept both options and preferences: https://docs.rs/icu/latest/icu/collator/struct.Collator.html#method.try_new |
Beta Was this translation helpful? Give feedback.
-
The names don't communicate anything about that, they are just generic.
That's completely inconsistent with everything else. And it is not really the language of the content, it is the language of the segmenter. Last, even if you want to look at it as the language of the content, it is clunky to use different types ( I understand why ICU4X created 2 different types. There might be technical / implementations justifications for all of this. |
Beta Was this translation helpful? Give feedback.
-
|
This design is the result of an extensive discussion about what "locale" means for text-oriented components like segmenter. You can read the discussion here: #3284 I just recently merged some improved docs: #7136 The observed error is doing its job of informing the client that the user locale should not be used when configuring a segmenter. The locale should instead be a hint derived from the text content. |
Beta Was this translation helpful? Give feedback.
-
|
@mihnita, can you verify if the improved docs in #7136 clarify this for you, in case you hadn't seen them yet (they weren't included in the 2.1 release)? |
Beta Was this translation helpful? Give feedback.
-
No, it does not, not much. It is good as documentation, but I should not have to read it. In my opinion the ideal collection of APIs allow me to move between various functional areas without carefully reading pages and pages of documentation. And the difference in APIs might make sense for implementation, but it is gratuitous for a user. Ultimately in the current design But LDML defines extensions that affect segmentation:
https://www.unicode.org/reports/tr35/#Key_And_Type_Definitions_ It is very-very convenient to pass a locale to segmenter and "magically" have all preferences honored Let's look at pub struct LineBreakOptions<'a> {
pub strictness: Option<LineBreakStrictness>,
pub word_option: Option<LineBreakWordOption>,
pub content_locale: Option<&'a LanguageIdentifier>,
}If I get a locale (that's usually what you get, think a request to a server) and I want to do line breaking I have to take that locale, split the That is clunky and is not forward compatible. If (for example) LDML adds another extension that affects segmentation, and I start getting that in requests, my server using icu4x can't "magically" honor it. If there would be a way to create a But as it is the APIs are inconsistent, and seem to expose decisions taken for implementation reasons. Why not allow segmenters to take a
I did. And I see that both Mark and Markus made the same argument for "Add it later" is an easy way out for a library that can't make decisions looking at the big picture. It comes at a cost for the developers: they must change their code to get correct functionality (see above about "forward compatible"). Nitpick: in segmenter(s) the field is named |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the comments. Much appreciated. My reply below attempts to provide background on how ICU4X arrived at the current design, without taking a position on whether the design is "right" or "wrong".
As chair of the TC, I take great pride in our open decision-making process by consensus. I reviewed the thread, and I don't see an example of Mark or Markus being "overridden without much explanation". There were 10 people who contributed to the discussion, from multiple Unicode groups. I encourage you to share feedback privately on how this decision-making process can be improved.
This is true. This is a point that was not raised in the thread. My attempt at an explanation, which may or may not be satisfactory: ICU4X and ECMA-402 have both decided to support a subset of the locale extension keywords. They do not support those that are more about developer preference over user preference. The two line break options ICU4X supports were determined to be more about the developer preference, and therefore associating them with the user locale would have been the wrong design. As for why
Sure, maybe My personal opinion: I've seen enough evidence in ECMA-402 that the Intl.Segmenter locale option is often used the wrong way, passing a user locale instead of a text content locale. (Intl.Collator has the same problem.) I believe that |
Beta Was this translation helpful? Give feedback.
-
I strongly doubt that. It is just a name... Even that's weird. And it really depends how you look at it. When I create a segmenter("th") and I try to segment Japanese I will get crap results. When I created it it loaded Thai segmentation rules. It IS a Thai segmenter. If this is the locale of the content then that would be passed with the content: let segmenter = WordSegmenter::new();
let seg = segmenter.segment(text, locale);The thing is, if decisions are taken piece-meal, one at the time, no matter how reasonable in isolation, the resulting API is a hot mess. Let's take a look: ** CaseMapper ** let cm = CaseMapper::new();
cm.uppercase_to_string("hello world", &langid!("en"))There is a langid (not a locale), and it is passed with the content, not with the Collator pub fn try_new(
prefs: CollatorPreferences,
options: CollatorOptions,
) -> Result<CollatorBorrowed<'static>, DataError>So it takes Preferences and Options. These terms are so close that the separation is basically meaningless. I choose an option or another based on my preferences :-) But let's go on: pub struct CollatorPreferences {
pub locale_preferences: LocalePreferences,
pub collation_type: Option<CollationType>,
pub case_first: Option<CollationCaseFirst>,
pub numeric_ordering: Option<CollationNumericOrdering>,
}So here we have a ListFormatter pub fn try_new_and(
prefs: ListFormatterPreferences,
options: ListFormatterOptions,
) -> Result<ListFormatter, DataError>
pub fn try_new_and(
prefs: ListFormatterPreferences,
options: ListFormatterOptions,
) -> Result<ListFormatter, DataError>
pub struct ListFormatterPreferences {
pub locale_preferences: LocalePreferences,
}So this is consistent with Collator. PluralRules pub fn try_new(
prefs: PluralRulesPreferences,
options: PluralRulesOptions,
) -> Result<PluralRules, DataError>
pub struct PluralRulesPreferences {
pub locale_preferences: LocalePreferences,
}
So this is consistent with Collator and ListFormatter.
---
**DecimalFormatter**
```rust
pub fn try_new(
prefs: DecimalFormatterPreferences,
options: DecimalFormatterOptions,
) -> Result<DecimalFormatter, DataError>
pub struct DecimalFormatterPreferences {
pub locale_preferences: LocalePreferences,
pub numbering_system: Option<NumberingSystem>,
}So this is consistent with Collator, ListFormatter, and PluralRules. pub fn try_new_auto(
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub struct WordBreakOptions<'a> {
pub content_locale: Option<&'a LanguageIdentifier>,
pub invariant_options: WordBreakInvariantOptions,
}So this one has no What's the difference between segmenters and collator, to justify this big differences? Like everything else in software (and not only) there is no perfect solution, it is all a matter of compromises. Consistency also helps with ease of use. The documentation for icu4x is really good, with lots of examples. But I don't want to read it unless I don't understand how to use something. Ideally, if I am already familiar with a library, I should understand how to use a new piece of that library without reading docs of history. |
Beta Was this translation helpful? Give feedback.
-
I reviewed (again) that thread. Everybody said "locale", in all comments. So in the end that discussion didn't help clarify the reasons for the API as it is. |
Beta Was this translation helpful? Give feedback.
-
|
"Preferences" is a user locale with structured fields, introduced in ICU4X 2.0. "Options" are things that should be set by the developer based on application requirements, not user preferences. This naming is used consistently across all components. The content locale is in the constructor of Segmenter because it impacts data loading, and in the terminal function of CaseMapper because it does not impact data loading. This did result in an unfortunate inconsistency. #3234 (comment) In Collator, the user locale matters sometimes; for example, you might have a contact list with names from various different languages, but you sort them according to your language. #6033 I put this issue into the 3.0 milestone when we can re-evaluate these questions. |
Beta Was this translation helpful? Give feedback.
-
|
Segmenter and case mapper are two exceptions where we use content languages instead of user locales, because it makes sense in those contexts (they should not be sensitive to user preferences). Yes the field should have probably been called
We've spent a lot of time designing this, and the API we have come up with is principled and satisfies our constraints. You were not part of any of those discussions, so you don't get to say things like this. |
Beta Was this translation helpful? Give feedback.
-
If anything the name
You say that they are exceptions, and are kind-of the same. This is the CaseMapper: let cm = CaseMapper::new();
cm.uppercase_to_string("hello world", &langid!("en"))That makes sense, the CaseMapper is locale neutral, and the content is passed on with a locale. If the segmenter was the same, it would have looked like this: let segmenter = WordSegmenter::new();
let seg = segmenter.segment(text, locale);But it does not. The locale (or langid) is not passed with the content. It is passed at construction. Let's put it differently... The CaseMapping API: What is the color of the ball (the language of the content)? Now the Segmenter API: What is the color of the ball (the language of the content)? Well, we have no clue! OK, let's twist it a bit more and say that the parameter we pass to the box creation is the color of the balls. What is the color of the ball (the language of the content)? Because "Thai" is a property of the collator / segmenter, not of the content. Conceptually the Collator is like the segmenter as like the CaseMapper. Would you agree with that statement? But the 3 APIs are all different. |
Beta Was this translation helpful? Give feedback.
-
|
I appreciate the feedback on the inconsistency. As I noted in #7261 (comment), the inconsistency is the result of technical constraints, due to differences in how Segmenter and CaseMapper load their data, as well as choices that were made regarding the locale extension keywords. We can explore ways to make the API design more intuitive in the next major release (3.0). |
Beta Was this translation helpful? Give feedback.
-
Please don't take a small text fragment out of context to make it sound worse than it is. The full text is "The thing is, if decisions are taken piece-meal, one at the time, no matter how reasonable in isolation, the resulting API is a hot mess." This is 100% true, taken without context, not necessarily about this library or these APIs.
Not being part of the discussions makes me a regular user. If the attitude is "you don't get to say anything because you were not part of the discussions", You will not hear anything from me from now on, about this or anything else. |
Beta Was this translation helpful? Give feedback.
-
And in our design this doesn't matter. We do not subscribe to the philosophy of encoding every possible argument in a single magic string/locale. We differentiate between preferences, which are globally controlled by the user/OS, and are encoded as Segmentation options, when implemented, will be typed fields on
I think we already explained that this is due to data loading. We could have made it a constructor argument for the case mapper, but then people would have to unnecessarily recreate case mappers. It's a consistency trade-off that we took.
As a user you can share constructive criticism, but you can do this without name calling. I'm failing to see the constructive part here though. Getting rid of the distinction between preferences and options is not going to happen, this is a core part of ICU4X design. Making the content locale be |
Beta Was this translation helpful? Give feedback.
-
I did clearly not say "you don't get to say anything". The attitude is "you don't get to insult other people's work, only your own", specifically referring to the "hot mess" statement, which I very much read as applying to our API, even if you wanted to use it generally. |
Beta Was this translation helpful? Give feedback.
-
|
(as a service to everyone involved, I will lock this thread if the conversation continues to be heated.) |
Beta Was this translation helpful? Give feedback.
-
|
Thank you Shane! |
Beta Was this translation helpful? Give feedback.
-
|
I would attribute some of this heated discussion the fact that we wall want to see icu4x be the best it can be. I understand Robert's passion, as one o the owners. And if I didn't care I would have left this thread long ago, instead of trying to get my point across. Mix that with cultural differences. As an Eastern European I would have said "this API is a hot mess", if that is what I meant :-) As an American I would have made it a "praise sandwich", with a highly toned-down negative in the middle :-) Not being face to face (even on video or audio) does not help either (not hearing the tone, not seeing the fact). Sure, face to face might also make it worse sometimes :-) So maybe let's restart? If it helps, this is some of the "clues" I try to use: when I sprinkle "I think" or "in my opinion" is not because I don't know, but a "tone down". I also try to add smileys and winks when I don't mean something seriously. I will start with the bigger picture, before going back to the concrete APIs. As we all know, writing software is a juggle between many priorities, a give and take, of compromises. In some of my previous projects we saved a lot of arguments and grief by writing them down, explicitly. Like correctness come first. Then our users (developers). The size, or performance, then x, y, z. Whatever. It also helps in these kinds of discussions. Instead of pointing to a github issue with many people chiming in, one can point to "the principles". Does icu4x have such a thing? For example, I don't remember ever seeing an API that takes options and preferences in the same method. As a non-native English speaker a function (constructor) wold take options. That me, as developer (sometimes channeling the user) I set based on my preferences. |
Beta Was this translation helpful? Give feedback.
-
Hight level / principles
Fair enough. Since this is not a pattern I've seen before, it seems like a special icu4x innovation / improvement.
A "principles" document would also help with this.
Yes, it would. And that's actually something that bothers me in the current ECMAScript Intl. There are things that can be specified both as a preference, and as an extension on the locale. BUT! Then no API should take a locale. I am tempted to say "locale everywhere", with all the extensions honored. This is a Unicode project, the locale is Unicode spec (LDML). Extensions and all. And I can explain use cases in detail. But the "magic string" in locale is very-very handy for communication between otherwise separated layers. Example: client calling a server. Java / Dart / Python calling Rust. It also helps with forward compatibility.
There might be no such OS today, but there might be tomorrow. For example Android already encodes some of the user preferences as locale extensions (nu, mu, ms, fw, proposed hc)
OK, maybe. But maybe it's a chicken and egg problem? Can we know that? |
Beta Was this translation helpful? Give feedback.
-
About these APIsOnce there are are clear principles, one would expect that all the APIs can be explained by the principles. There are no exceptions, because if performance is above API consistency in the principles, then the inconsistency between case conversion / segmenter is explained by that. So let's get to an example that asked more than once: Both require a lot more data than case mapping. And both are about the language of the content to be segmented / compared. Would there be any reason for the APIs to be different? |
Beta Was this translation helpful? Give feedback.
-
https://docs.rs/icu_locale_core/latest/icu_locale_core/preferences/index.html
And none do! All our APIs take type preference objects, which can be constructed from locales. Each domain-specific preference type parses out the relvant parts of the locale, but the strings are not persisted.
That's why preferences convert from locales. But you as a dev don't want your segmentation to change behaviour because a locale outside of your control changed and the underlying library started reading some flag from that. The behaviour of a segmenter should be fully specified by the developer, because they have built a complex text pipeline on top of it.
And all of those are preferences in ICU4X.
No. I expect my Android phone book's sorting to respect my system settings, such as locale and collation settings (if available). I expect my Android text rendering to not change based on system locale. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much, that is a good read. It does not mean I agree with the take :-), but it is a fair position to take. Even if I don't agree, I am also not saying it is wrong :-) It is a decision informed by a "philosophical" position on what icu4x is. So any kind of differentiation between OS / developer / user preference is lost. So a generic library has no way to make these kind of distinctions.
OK, why not? I argue that it should be. The "knob" I use to pass that info should be a locale, so that the info about
Actually, that is exactly what many developers want! When I use "ar-DZ" I expect that the proper digits are used for that country, and I expect that that change if a country decides to do that. And one can make the same argument about other
EXACTLY! Why those, and not other? As a user of the library that seems inconsistent and random.
As I explained above, the distinction between "system setting" and other such setting is not relevant at library level. And in fact both examples are incorrect. The collation does not usually change based on some kind of system setting, it is determined by use case. And text rendering changes based on the content locale, not system locale. One of the changes we did in Android N was to change how the font fallback was done. Until N if my system was set to English, any Traditional Chinese or Japanese text (without locale info) was rendered using a Simplified Chinese font. With N one can specify more than one locale. And this change was most welcomed by the millions of users that were forced to see Simplified Chinese for their language, even if not appropriate. Are there situations where one does not want this behavior? Yes. But it is not the role of a low level library to decide that. If a |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I would expect the code below to work:
But it fails with
Beta Was this translation helpful? Give feedback.
All reactions