Starting Our Engine
Okay, time to begin in earnest. The simplest entity that exists at each of the front, middle, and back-end is the literal – a literal representation of data.
I’ll get to data model issues later, but for now we are going to start with json as our data model because, frankly, every system these days ends up needing json support somewhere. We are going to need more types than json supports (we’ll need a set type and a special case of sets, the relation type; eventually we’ll also want some gnarlier things like elias-encoded bignums, tensors, gensyms, and atoms) but we start here.
So for our first thread through the system, we say that the front end has a
real parser (because it is meant to have human-friendly features); the middle
end uses vanilla Vector
and HashMap
representation of that data
(eventually we will want less-vanilla conctree lists, and maybe some sort of
headwords list); the back end will store that data back as something dumb (in
the near term json, a format simple enough that it doesn’t need a “real”
parser).
Obviously this is useless; that’s fine, we’re not trying to be useful yet.
Never write your own parser
If you have classical CS training, you spent a chunk of time on computational grammars, LR(1) and LALR and maybe if you are a young person PackRat or something. And as part of this training you had to write a parser, or a parser-lexer stack, or a semantic actions interpreter, or something.
Due to your level of experience at this point in your career you have met Zalgo at least once, and have at least some vague notion that the horror of that encounter had something to do with the Chomsky hierarchy. Possibly you know one or more jokes about Nim Chimpsky, or clever facts like that regular expressions are not themselves a regular grammar that seemed much more interesting when you were in an atmosphere with a measurable THC1 content.
In short, you are intellectually qualified to write a parser. DO NOT DO THIS THING.
The point of learning to write a parser was never to write parsers. Smart, neurologically Lovecraftian individuals have written parser generators and parser combinators so that those of us whose neural spices come in “medium verbal” or milder don’t have to.
The point of learning how to write a parser is that it is the only way to understand parser error messages. When a parser generator tells you that you have a serial repetition ambiguity or something, your experience writing a parser will activate a long-dormant neuron that will tell you, oh, that’s the thing that happens because order of operations, or maybe that’s the thing that order of operations happens because of, or something, and thereby you will save yourself from a level of frustration typically associated with snapping and driving an armored bulldozer over your neighbor’s oleaceae forsythiae.
Every programming language that doesn’t suck has a parser generator or (more modernly) parser combinator or (if they’re kinky) a parser expression grammar library of which everyone says, “oh, just use that thing, even though its authors are scary and make bizarre lifestyle and/or build system choices.” Use that one.
Rust has not quite reached that level of maturity, but it’s closing in on it:
serde
is overwhelmingly canonical for serialization/deserialization, and
nom
has about twice as many users as its nearest competitor for parser
generation.
We will use nom
for now, and although we won’t need sophisticated
deserialization for some time we’ll plan to use serde
when the time comes.
One small complexity appears instantly: nom
changes its API regularly,
including changing the return types of existing functions. So we need to
version-pin nom
. For the moment we’ll just pin its major version and hope
that it respects semver.
cargo add nom@~8 serde
A brief note
It is a depressingly common convention in query languages that their output
grammars are not in their input grammars; for instance, the result of an SQL
query is typically not valid SQL. There’s a (bad) reason for this, which I’ll
probably get to when I get on my query language soapbox in a later post; for
now understand that that’s why I’m not concerned about nom
’s lack of
serialization.
We will pay a large price for this later, but we can’t avoid it.
This PR
Introduces serde
and nom
into the codebase around a trivial tagged union
of literals. No new or interesting system-level behaviour.
This PR is considerably more than 50 lines, so it certainly contains serious defects, as we will see in the next post. I have deliberately run implementation a bit ahead of testing and comments for illustrative purposes, so the code quality at this revision is somewhat poor.
This blog post corresponds to repository state post_06
Lunar metadata: This is an expansion phase; the scope of the codebase grows.
-
Note for modern readers: “THC” is the chemical people used to recover from staring into the NP-abyss after LSD-25 went out of fashion but before spironolactone and fursuits became widely available. ↩︎