Escape pods

Well then, I suppose we need to talk about escaping.

This is going to be extremely impressionistic because I’m not an historian. Apologies for all the wrong details to follow. It is easy for people of my age – the age of ASCII – to forget that string escaping has usually been complicated and weird. In the ASCII world a byte was a character, a list of characters was a string, and there were two exceptions that you could keep in your head without a worry.

It wasn’t the case before. Before ASCII in the IBM world came EBCDIC, and in EBCDIC most bytes weren’t characters. For a while a byte corresponded to a keyboard entry (a key or key+modifiers) and bytes were as many bits long as it took to do the keyboard; the inevitable proliferation of modifier keys called “buckybits” led to 9-bit and even 12-bit bytes, the reason that standards prefer “octet” to “byte” for clarity.

(I had a former coworker who was fond of making keyboard shortcuts that used weird buckybits – hyper-super-meta-greek-shift-L or what-have-you – that only he could type on his old Symbolics keyboard. Emacs let you do that back in those days.)

Really we have always had ugly strings – we left the state of grace when we left three-finger octal and let no one tell you otherwise.

Now we have unicode and so forth and nobody younger than, say, Return of the Jedi is ever going to think that strings are easy. Easy strings were a temporal island of privilege in a sea of complexity.

How Escaping Work{ed|s}

A string is a list of numbers or numerically indexed tokens chosen from a (usually finite, but see later) set, often called the alphabet. An encoding is way of writing a string in alphabet A1 in different alphabet A2, such that every unique A1 string has a unique A2 string (not necessarily the other way around).

Start with the easiest case. The “invariant subset” of EBCDIC is an alphabet of 162 symbols (give or take). Most of the other possible binary number combinations were omitted for important reasons like “if we punch too many holes in a row on this punchcard the cardboard will tear”. So if you wanted to store a string of EBCDIC symbols in 8-bit RAM, your representation could just be “copy over those binary bits.” A hundred-ish combinations of bits are left over – symbols in 8-bit RAM that don’t correspond to symbols in EBCDIC, so you had plenty of room to put weird control tokens in like “end of string” or “raise a particular electrical connection on your serial port.” Sure there were weird problems like “j” not coming after “i” or people randomly filling in the unused bit combinations with mathematical symbols that you couldn’t represent with punchcards. But other than that … life was good!

Then people got wise to binary, stopped using punch cards, and all heck broke out. ASCII had 127 symbols, and then 128, and then suddenly 256 (this is a lie) and people started to want “character” and “byte” to mean the same thing. You have an alphabet of 256 symbols, you have a computer that works in 8-bit units, and so life sucks. Life sucks because you always need at least one spare representation, and more often two or three, so that you can also encode metadata like “this is the end of the string.”

Because when your encoding of A1-strings to A2-strings has A1 and A2 of equal size, you quickly find out that you didn’t quite get A1 right after all.

There were tricks of course. If you know where a string starts, you can make the first byte be its length. If it’s longer than 256, you can make the second byte be more of its length? Opinions vary. Cleaner: Reserve one byte value as “not a real character” and make that the end-of-string mark. But then, oops, people suddenly have the clever idea of using the end-of-string mark to indicate other things, like the end of strings inside of your strings. Look up the -0 flag to xargs. Cry a bit.

in the final analysis, an encoding must always reserve space either along the length of the encoded string (a “how long is this string” position) or in the width of the alphabets (one or two reserved symbols – usually more). A great programming language war of the past is memorialized in the names: The former is a “Pascal String”; the latter a “C String”. C could actually use both, which is part of why we don’t have Pascal to kick around any more.

So there’s an answer. Instead of reserving a separate symbol for each weird thing you want to do, reserve a symbol for “treat the next symbol as me doing something weird.” Today everyone except haskell and podcast sponsors use backslash for this.

(Old pre-1980 editors used to have modes, usually an “append” mode and an “edit” mode at least. Going between modes was called “escaping” and used the “escape” key. Since keyboard keys and characters were co-extensive, there had to be a way to get the escape key’s character. So the trick to get a literal escape character was called escaping. Or at least that’s how I heard it. I used SuperText back on the Apple 2; it was secretly an Apple port of vi, I think.)

Originally the idea was, “backslash means treat the next character as itself and not as any sort of control command.” So backslash-nul represented nul and not end-of-string. Joy! But nobody wants to type nul (there’s no nul key on your keyboard) so how about backslash-zero? And of course backslash-backslash means backslash so you haven’t actually given up any expressive power. But you can’t just give people that power… there are lots of characters that are inconvenient to type. So backslash-p could be “the number of characters left in the string, expressed as a byte”! And backslash-whitespace could mean “remove all the contiguous whitespace after here” so you can indent your code nicely! And backslash-n could mean “however this machine represents a newline” and backslash-r could mean “the other way of representing a newline that machines other than this one use, those dirty commies.”

Pretty soon you have a whole backslash menagerie. Never give programmers an inch, is the moral here.

Unicode. Dear god.

In the days of EBCDIC, there was a very simple rule: All languages get Roman-alphabet letters plus as many extras as can fit in a line of a punch card. Foreigners had to buy extra-sturdy punch cards because the cards got flimsy if you punched out too many spots. Foreigners with more letters than fit in the 90 or so extra letters left over… nope. Don’t ask. China was still Red China at the time and IBM couldn’t sell very many computers there.

Actually people came up with lots of clever tricks to handle this. TROFF encoding worked like old typewriters where you could write á by writing a-backspace-apostrophe. It’s still used in a few places, naturally.

ASCII rationalized all of this by standardizing to 128 characters that normal people use, one extra bit to use for a checksum (7-o-1 ride-or-die IYKYK, UART 4eva), and then don’t use it for a checksum, use it to hide all the weird foreign things in “upper ASCII.”

Well it turns out that there are foreign languages with more than 128 characters. Don’t sweat the details. You could write a new system where there’s a table of every symbol that has ever been used for human writing, and it has some bytes associated with it, and then some rules for how you can skip bytes that repeat too much. They got the size wrong and had to do it a second time.

The second time they had a simpler rule: Every symbol ever used in human writing (including simplified Chinese, which they messed up the first time) gets a number. A string is a list of numbers. The numbers can be represented in memory or on disk in a bunch of different ways, as long as we agree that it’s a list of numbers – numbers that might be quite large.

I am not going to write an explanation of Unicode 2; lots of people have written those already. “Every character has a story” was a great blog.

Anyway the numbers are all magic. Some of them make the other letters nearby work differently. Some of them change the way that bits and bytes work. Some of them make time run backwards. It’s a funny world.

The important thing is that the old idea of “characters are written as bytes; there are a couple of exceptions but you can write them down” was dead dead dead at this point. Java put the nail in the coffin by saying that you don’t just put user display text in unicode, your program is unicode. Name your variables poop-emoji; it’s fine.

Wait, really?

No, of course not. Unicode programming languages were a catastrophe. Simple questions like “what characters end a comment” imploded into impossible complexity. Nobody could write parsers. It was madness. All of the high-throughput filesystem operations that count numbers of bytes could unknowingly read fractions of a symbol. It’s like nobody ever learned anything from stateful serial protocols and trying to get a serial mouse work with a PS/2 connection, but at epic scale.

However the case for unicode was compelling and people muddled through. The principle remained:

You write code in some alphabet. Your code has things that aren’t literal strings, so the things you write in your code are not identical to the things they represent.
So some things in your alphabet have to represent more than one thing. Like quotation marks. Quine Quine Quine Hofstadter Quine Quine.
So we still need a magical character to “escape” from literal “raw” strings.

Unicode escaping has more edge cases than Triplette Competition Arms. It’s far beyond the realm of what any one person can handle. It’s so bad that as formally specified unicode regular expressions can’t be matched using regular grammars.

But in then end, there are characters, which are numbers, and are different from bytes. Most character numbers aren’t bytes, not all bytes are character numbers, and not all numbers are characters. Some lists of bytes are not strings, and strings turn into lists of bytes via encodings that you have to choose between. And it’s not nice, but it could never have been any other way.(Then emoji happened, and that story will make your hair catch fire. Not here or now.)

Today’s PR will attempt to fix the literal string round trip in the grammar by using a builtin string-escaping feature of nom. It’s not good enough, but it’s just about good enough for now.

Worth noting here is that I’ve added some more aggressive round-trip testing. Handling vacant cases (nans, empty strings, Nones if you are cursed) always introduces opportunities for mischief, so an actual semantic test was in order. Revisit my earlier post on round-trip testing for the gist.

This blog post corresponds to repository state post_08

Lunar metadata: This is a contraction phase; the density of the codebase grows. Note that we have taken multiple contractions in a row; the state of our code is not propitious, and it yearns to grow.