In this case, the tidbit is: "grapheme clusters exist and they are useful".
The misleading part is that the article draws a false equivalence between what the author calls "UTF-32 code units" and UTF-16 code units.
UTF-32 code units are Unicode code points. This is a general Unicode concept that exists in all Unicode encodings. UTF-16 code units, on the other hand, are an implementation detail of UTF-16. It is wrong to present them as equally arbitrary concepts.
https://ruby-doc.org/3.2.2/String.html#class-String-label-Me...
5 is the number of code points, and 17 is the number of bytes. Both are reasonable answers.
7 is the number of code units for utf-16. Seems like the least useful option.
void main() {
String x = "(that emoji here)";
System.out.println("Chars: " + x.length());
System.out.println("Codepoints: " + x.codePointCount(0, x.length()));
System.out.println("As stream of chars (= UTF16-esque with surrogate pairs):");
x.chars().forEach(System.out::println);
System.out.println("As a stream of codepoints:");
x.codePoints().forEach(System.out::println);
}
This ends up printing: Chars: 7
Codepoints: 5
As stream of chars (= UTF16-esque with surrogate pairs):
55358
56614
55356
57340
8205
9794
65039
As a stream of codepoints:
129318
127996
8205
9794
65039
NB: Apparently many hackernews readers know java but don't use it all that often day-to-day. The provided java snippet is vanilla valid and can be executed with `java ThatFile.java` (no need to compile it first), though it does use preview features.The fact that the codepoint counter is a very awkward `codePointCount` call has the dubious benefit of highlighting this method loops through and therefore would be quite slow on very large strings.
However such an API would be pretty cumbersome because for all non-edge cases (read: a western language and a reasonable encoding that language - which when looking at world demographics is a very narrow way of saying non-edge case) we just want to ignore all that fancy stuff and assume it's latin-1/ascii and use "Length" and get on with it, usually accepting that it doesn't work for many scripts or emoji.
So almost every api I have encountered has both the dangerous or ambiguous "length" and any number of the more specific counts. Good? No. But good enough, I guess.
A much worse related API that exists every where is that for parsing and formatting numbers to and from text. How that's done "depends" but most languages I have seen - unfortunately - offers a "default way". In the worst examples - looking at you .NET - this default uses the system env and assumes formatting and parsing numbers should us the OS locale. Horrible horrible idea when used in conjunction with automatic type conversions. WriteLine($"The size is {3.5}"); shouldn't print "3.5" in the US and "3,5" somewhere else.
Apparently the thing to use is a library with a very strange name, which does glyph placement. I’ll go look for it.
EDIT: harfbuzz https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.ht...
https://unicode-x-ray.com/?t=%F0%9F%A4%A6%F0%9F%8F%BC%E2%80%... (sorry if link looks scary, that's just the URL encoding of this emoji)
> =LEFT(F280,2) & LEFT(F281,2) & LEFT(F282,2) & LEFT(F283,2)
Since the Emojis are actually 2 bytes.
Then upon opening the post I was 100% ready to believe that js has three different string length functions that all handle Unicode differently.
Even for limiting input field sizes byte count is much better, as otherwise you are opening up yourself for unicode denial of service. I think the game Minecraft has such an exploit where you can fit in absurd amounts of utf-8 data (to the point of data corruption in multiplayer games) since it's limited by visual length.
My personal favorite dealing with UTF-8: pretend it's ascii and assume everything above 128 is an alphabetic character. It just works. For 99% of use cases it doesn't matter if the content is emojis, families of emojis, or ancient sumerian scripts. You can parse JSON and most other formats this way without caring about code points at all. The trend of unicodizing everything was a mistake, just treat strings as bytes and parse them as utf-8 only when you really need it (like when building a text editor or a browser engine from scratch).
mysql> WITH chars AS (SELECT ' ' c)
-> SELECT LENGTH(c), CHAR_LENGTH(c) FROM chars;
+-----------+----------------+
| LENGTH(c) | CHAR_LENGTH(c) |
+-----------+----------------+
| 17 | 5 |
+-----------+----------------+
1 row in set (0.01 sec)
Note that the doesn't seem to render in preformatted text on HN.This should be easier to reproduce:
mysql> WITH chars AS (SELECT 0xF09FA4A6F09F8FBCE2808DE29982EFB88F c)
-> SELECT CONVERT(c USING utf8mb4), LENGTH(c), CHAR_LENGTH(c) FROM chars;
+--------------------------+-----------+----------------+
| CONVERT(c USING utf8mb4) | LENGTH(c) | CHAR_LENGTH(c) |
+--------------------------+-----------+----------------+
| | 17 | 17 |
+--------------------------+-----------+----------------+
1 row in set (0.00 sec)
perl -e 'use utf8; print length(""). "\n";'
1
iex(3)> String.length(" ")
1
Edit: looks like HN doesn't support that emoji in code blocks, at least.
For all intents and purposes, a user will count it as one character. Truncating the string without including the whole cluster would change the meaning of it, and is not an operation anyone would do as a general purpose thing any more than someone would want to randomly replace the last character with random letters.
It looks like one character. I'd rather APIs let us continue pretending it is one character.
String = List ( Char )
Chars don’t have a length, like a number doesn’t have a length - unless you talk about number of bits. If you are working with strings stick with strings. The string of a single character should be “1”. Just enforce proper typing. Anything else is not consistent.
If you want to do Unicode string manipulation and length counting, then use specific functions for that - but the base internal .length function should just output bytes.
if the language default was anything other than this, THAT WOULD BE WRONG and unexpected. I would prefer the default to be the dumb, fast thing. then if I want the slow, fancy thing, I can import some first or third party package.
[..." "].length === 5 // in JS
Length = 5
Size: depends on the encoding
Width = 1
Also please make sure to read the first heading after the title, which summarizes the whole point of this essay.
Did I miss the part where he explains this take? It's made up of 5 valid unicode code units. For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?
The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.
I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.