grayhatter
> Python 3’s approach is unambiguously the worst one, though.

Did I miss the part where he explains this take? It's made up of 5 valid unicode code units. For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?

The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.

I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.

xeeeeeeeeeeenu
I'm not a fan of "everything you know about X is wrong" articles. Very often they try to present some little tidbit of knowledge as a revelation and mislead the reader in the process.

In this case, the tidbit is: "grapheme clusters exist and they are useful".

The misleading part is that the article draws a false equivalence between what the author calls "UTF-32 code units" and UTF-16 code units.

UTF-32 code units are Unicode code points. This is a general Unicode concept that exists in all Unicode encodings. UTF-16 code units, on the other hand, are an implementation detail of UTF-16. It is wrong to present them as equally arbitrary concepts.

omoikane
Ruby gives you the choice to iterate over all types, via `each_byte`, `each_char`, `each_codepoint`, or `each_grapheme_cluster`.

https://ruby-doc.org/3.2.2/String.html#class-String-label-Me...

dahfizz
Maybe not wrong, but it's the worst option.

5 is the number of code points, and 17 is the number of bytes. Both are reasonable answers.

7 is the number of code units for utf-16. Seems like the least useful option.

rzwitserloot
Java loaded full unicode code point semantics into its standard `java.lang.String` class. These _are not guaranteed_ to have `O(1)` performance characteristics, because the underlying storage format is dynamically either a UTF-16-esque variant (with surrogate pairs for characters that don't fit in 16 bit), or a single-byte-per-char format if the string does not contain any non-ASCII. This has the advantage of being very very slightly more obvious, given that both methods exist and are documented:

  void main() {
    String x = "(that emoji here)";
    System.out.println("Chars: " + x.length());
    System.out.println("Codepoints: " + x.codePointCount(0, x.length()));
    System.out.println("As stream of chars (= UTF16-esque with surrogate pairs):");
    x.chars().forEach(System.out::println);
    System.out.println("As a stream of codepoints:");
    x.codePoints().forEach(System.out::println);
  }
This ends up printing:

  Chars: 7
  Codepoints: 5
  As stream of chars (= UTF16-esque with surrogate pairs):
  55358
  56614
  55356
  57340
  8205
  9794
  65039
  As a stream of codepoints:
  129318
  127996
  8205
  9794
  65039
NB: Apparently many hackernews readers know java but don't use it all that often day-to-day. The provided java snippet is vanilla valid and can be executed with `java ThatFile.java` (no need to compile it first), though it does use preview features.

The fact that the codepoint counter is a very awkward `codePointCount` call has the dubious benefit of highlighting this method loops through and therefore would be quite slow on very large strings.

frou_dh
I encountered some real world unicode/emoji breakdown recently. I set my surname in a webapp to an emoji country flag because I needed a way to communicate where I was. Elsewhere in the app, it showed surnames as just their initial, e.g. "John S". There, mine showed as a featureless black flag rather than the flag I set. Presumably because that is the first codepoint of several that make up the flag.
hgs3
Python 3's approach is the most correct: Unicode defines text as a sequence of code points. UTF-whatever is an implementation detail.
alkonaut
Really the correct way to design string APIs would be to not have an ambiguous "length" at all, but to always require specifying whether you want UTF8-bytes, memory bytes, code points, graphemes, whatever.

However such an API would be pretty cumbersome because for all non-edge cases (read: a western language and a reasonable encoding that language - which when looking at world demographics is a very narrow way of saying non-edge case) we just want to ignore all that fancy stuff and assume it's latin-1/ascii and use "Length" and get on with it, usually accepting that it doesn't work for many scripts or emoji.

So almost every api I have encountered has both the dangerous or ambiguous "length" and any number of the more specific counts. Good? No. But good enough, I guess.

A much worse related API that exists every where is that for parsing and formatting numbers to and from text. How that's done "depends" but most languages I have seen - unfortunately - offers a "default way". In the worst examples - looking at you .NET - this default uses the system env and assumes formatting and parsing numbers should us the OS locale. Horrible horrible idea when used in conjunction with automatic type conversions. WriteLine($"The size is {3.5}"); shouldn't print "3.5" in the US and "3,5" somewhere else.

sillysaurusx
Measuring the length of text is really hard. Font fallback is hard. All of these things, you take for granted till you write your own game engine.

Apparently the thing to use is a library with a very strange name, which does glyph placement. I’ll go look for it.

EDIT: harfbuzz https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.ht...

Jach
Unsurprising that (at least some implementation of) Swift does the least wrong thing in returning 1. I think it's also one of the few languages that will return a count of 1 for the madness that is country flag emojis https://docs.swift.org/swift-book/documentation/the-swift-pr...
namaria
I have read somewhere that you should learn 2 or 3 programming languages from the get go. If you learn one, you run the risk of letting it's shape dictate how you mentally model computation. At some point someone who learned a dynamically typed programming language first is bound to find out why data types matter.
secret-noun
Related: I wrote a little web app that lets you see the codepoints for text like this

https://unicode-x-ray.com/?t=%F0%9F%A4%A6%F0%9F%8F%BC%E2%80%... (sorry if link looks scary, that's just the URL encoding of this emoji)

paddw
I think this is a really a naming convention issue. Len() is ambiguous, you really want either num_chars() or utfxx_len(). Of course, the issue of what counts as a character is confusing in its own right...
hateful
The wife and I have a Google sheet that we use for our shared calendar - and we put an emoji before each "event" and in top row of each day I show the Emoji for that days entries. But I need to do:

> =LEFT(F280,2) & LEFT(F281,2) & LEFT(F282,2) & LEFT(F283,2)

Since the Emojis are actually 2 bytes.

julik
So many of these conversations could be easier if there would not be `length()` functions but `length_in_<whats_exactly>()` functions instead.
eviks
Very good and informative article, though still not convincing that the nudge to make the shortest "len" command use the human readable size of grapheme clusters like in Swift isn't the best design approach, all the non-intuitive sizes should be special
farhanhubble
Why should length in programming be devoid of units? Why can't we have length be (8, UTF32_CODEPOINTS) <class UTF32_CODEPOINT_SZ(8)>?
jlebar
I was 100% prepared to believe that the length of empty string in js is 7.

Then upon opening the post I was 100% ready to believe that js has three different string length functions that all handle Unicode differently.

PhilipRoman
I cannot think of a single common case where grapheme cluster count is important. If you want to print them aligned to a terminal - guess what, double width characters exist, so the only reliable way is to print them first, measure the cursor movement using escape sequences, calculate length and erase the originally printed data.

Even for limiting input field sizes byte count is much better, as otherwise you are opening up yourself for unicode denial of service. I think the game Minecraft has such an exploit where you can fit in absurd amounts of utf-8 data (to the point of data corruption in multiplayer games) since it's limited by visual length.

My personal favorite dealing with UTF-8: pretend it's ascii and assume everything above 128 is an alphabetic character. It just works. For 99% of use cases it doesn't matter if the content is emojis, families of emojis, or ancient sumerian scripts. You can parse JSON and most other formats this way without caring about code points at all. The trend of unicodizing everything was a mistake, just treat strings as bytes and parse them as utf-8 only when you really need it (like when building a text editor or a browser engine from scratch).

dveeden2
MySQL has two different length functions...

  mysql> WITH chars AS (SELECT ' ' c) 
      -> SELECT LENGTH(c), CHAR_LENGTH(c) FROM chars;
  +-----------+----------------+
  | LENGTH(c) | CHAR_LENGTH(c) |
  +-----------+----------------+
  |        17 |              5 |
  +-----------+----------------+
  1 row in set (0.01 sec)
Note that the doesn't seem to render in preformatted text on HN.

This should be easier to reproduce:

  mysql> WITH chars AS (SELECT 0xF09FA4A6F09F8FBCE2808DE29982EFB88F c)
      -> SELECT CONVERT(c USING utf8mb4), LENGTH(c), CHAR_LENGTH(c) FROM chars;
  +--------------------------+-----------+----------------+
  | CONVERT(c USING utf8mb4) | LENGTH(c) | CHAR_LENGTH(c) |
  +--------------------------+-----------+----------------+
  |                        |        17 |             17 |
  +--------------------------+-----------+----------------+
  1 row in set (0.00 sec)
justinator

    perl -e 'use utf8; print length(""). "\n";'
    1
chefandy
It's 1 in elixir which measures graphemes by default.

  iex(3)> String.length(" ")
  1
Edit: looks like HN doesn't support that emoji in code blocks, at least.
eternityforest
1 seems like the only acceptable answer.

For all intents and purposes, a user will count it as one character. Truncating the string without including the whole cluster would change the meaning of it, and is not an operation anyone would do as a general purpose thing any more than someone would want to randomly replace the last character with random letters.

It looks like one character. I'd rather APIs let us continue pretending it is one character.

kolibril13
In python, `print("-".join(f"{ord(c):x}" for c in emoji))` will give `'1f926-1f3fc-200d-2642-fe0f'`. That can be useful to request emojis, see https://github.com/hfg-gmuend/openmoji/blob/ad588c8fb4b028d7...
tylergetsay
Interesting; emojis also make SMS really weird. according to twilio including a emoji will change the character limit from 160 to 70
Obscurity4340
Can anyone comment as to whether there are any problems associated with using emojiis to enhance the entropy of passwords? For passwords you only need to autofill but never actually type, I feel like it would be an easy way to augment passwords but I don't know whether it would directly translate in every situation.
foxes
All these abominations are because of non strict typing

String = List ( Char )

Chars don’t have a length, like a number doesn’t have a length - unless you talk about number of bits. If you are working with strings stick with strings. The string of a single character should be “1”. Just enforce proper typing. Anything else is not consistent.

ars
Am I wrong for assuming the .length should return a length in bytes? If you want to use 32bit units, then multiply your output by 4.

If you want to do Unicode string manipulation and length counting, then use specific functions for that - but the base internal .length function should just output bytes.

neallindsay
Until reading this I had never heard of UTF-32. It doesn't seem like a good way to encode strings.
quaintdev
How the hell you inserted an emoji in title? Afaik we cant use emoji on this website
bitwize
Once again, strings are not simple sequences of characters. It's also useless to "index" into a string without specifying what you're indexing for the same reason.
2h
this one one of those things that people point to when comparing languages, but in reality rarely matters. with Go, you just get the number of bytes, which the the correct default thing to do:

https://godocs.io/builtin#len

if the language default was anything other than this, THAT WOULD BE WRONG and unexpected. I would prefer the default to be the dumb, fast thing. then if I want the slow, fancy thing, I can import some first or third party package.

ceeam
These emoticons should never have been a part of Unicode in the first place. Second big mistake of that org after the Unihan fiasco.
iandanforth
Life imitates Hitchhikers Guide to the Galaxy. Sure the answer is 7 but do you know what the question is?
joshspankit
My RSS reader showed the title as “It’s not wrong that “ “.length == 7” and I had to click on it
11235813213455

    [..." "].length === 5 // in JS
JohnFen
It's things like this that make me hate unicode so much.
geoffpado
(2019), for what it's worth.
duxup
Is there a situation where I am going to try to get the length of an emoji and I care about the outcome?
rurban
Naming, folks.

Length = 5

Size: depends on the encoding

Width = 1

lifthrasiir
HN discards emojis in the title. The original emoji was https://emojipedia.org/man-facepalming-medium-light-skin-ton... which consists of 5 Unicode code points.

Also please make sure to read the first heading after the title, which summarizes the whole point of this essay.

asimpletune
What does swift or rust return?
sr.ht