Serialization Formats are not Toys

There was a great video back in the day with that title.

Most of the issues identified in it are long-since fixed, and it didn’t really cover the essentials, but the title is gospel, and it gave a solid flavor of how badly wrong you can go in the simple business of turning structured data into sequential bytes and back.

I estimate without evidence that 90% of as-yet undiscovered security bugs are parser/serializer mismatches, and that’s fine because nobody is actually good enough to reliably identify and exploit them.

However we would be wise to not introduce more.

Representation and Round-Trip

Say we’re writing json. I mean, we will be writing json, but we aren’t now,and won’t be soon, but we will be in this example.

Of course, the on-disk stored json doesn’t match the user-input json, because json is not ordered (RFC 7159). This is the first of what will be many instances where semantically identical objects (objects that would pass their own equality test) have syntactically distinct representations. In some cases these representations are even nondeterministic – for instance objects are often hashed by their pointer address, which differs from run to run.(json has similar issues with floating point numbers; floating point is a wholebag of tragedy that will get its own post or two later)

This is especially acute with set-like datasets, where the absence of ordering and presence of cross-element constraints form a foul sludge of ambiguity. Databases have a lot of those, so we’ll be living in that world.

There are lots of things that we can do to mitigate this – defining hash and equality functions, for instance, greatly reduces the frequency of this problem (but again has problems with floats, which lack total order andequality testing). But the underlying problem is intrinsic: Representationsare not one-to-one with semantics.

This is especially true when we are layering data formats – Unicode has a complex relationship with uniqueness – and when we are doing cross-platform coding, where data types can have different sizes and where collation rules may not always be identical on all supported planets.

In short, trying to enforce rigid data specifications is a fool’s errand for a small organization; even large standards bodies typically fail in some awful edge case or another. You are much, much better off focusing on the invariants of parsing and serialization than on bitwise-identical objects and representations.

Because of this it is very important that we say ahead of time exactly what we require of a representation – that we call our shots.

Eight ball, corner pocket

I’m no mathematician, even less so a computational linguist, so some details below are wrong.

The minimal requirements of a representation are:

Total Serializability: Every semantically valid object is representable, without exception.
Unique Serialization: If two objects have the same serialized form, they are semantically equal.

This implies:

Unique Parsing: Every representation either represents a single semantically unique object or else is an error.

If a representation meets these properties, we have the corollary:

Semantic Round Trip: The parse of the serialization of an object is semantically equal to the original object.

Running the table

In some cases, we have a canonical serialization of an object. A representation is canonical if:

Canonicity: Every serialization of a semantically equal object is identical.

If we have a canonical serialization, then we obtain two corollaries:

Syntactic Round Trip: The canonical serialization of an object, when parsed and re-serialized, gives the original serialization.
Double Round Trip: For any data, if you apply the sequence parse, serialize, parse, serialize then either the first parse is an error or the two serialize actions return the same result.

Each of the properties above is a hint as to a category of testing that we can pursue. I have a personal fondness for the semantic round trip as a test, I know other experienced engineers who have different preferences, but most folks seem to quickly develop a habit of preferring one of these as their default procedure for testing a format.

This PR

This PR backfills reasonably sound test coverage around the previous PR’s parser example. Unsurprisingly it finds a whole bunch of problems, and fixes them.

It’s worth a look at the special cases in the Float case. Floating point numbers' human-oriented serializations are not 1:1 on floats, and NaN != NaN, to start with. Both of these can be worked around (using Debug instead of Display as the base of serialization; special-casing the comparison). There are further issues with NaN being a category of values rather than a single unique value that will have to wait for later.

However string parsing and serialization is too complex to finish in this PR, so its test gets an xfail – it’s the right test code but it can’t pass as the code is currently written. In such cases it’s best to write the test and flip its sense to expect failure; that keeps the test code live for when you need it next.

This blog post corresponds to repository state post_07

Lunar metadata: This is a contraction phase; the density of the codebase grows.