Newline-delimited JSON

Dec 13, 2014 - 12:37 PM

There are a couple of opinions on line-delimited JSON. There’s even an officious but totally [needs citation] Wikipedia article. Pretty much everybody agrees on how it’s done (the only difference there is LF vs. CR vs. CRLF), but not what to call it. Tools like jq handle this format already, by default, and it makes a lot of sense for most streaming purposes. Anyway, the main dispute is what MIME Type to use. Here are some:

application/x-ldjson or application/ldjson
application/x-json-stream or application/json-stream
application/json; boundary=LF
application/json; boundary=CRLF
application/json; boundary=CR
application/json; boundary=NL
application/json; boundary=EOL

Another tricky bit is that JSON can have newlines in it, whether in strings or by being pretty-printed, but it doesn’t have to. You don’t have to pretty print it, and you can use one of the escape sequences, \n or \u000A, wherever you might have used an actual newline.

Twitter also has a cute delimited=length option, where the JSON objects in the stream are preceded by decimal numbers indicating how many of the subsequent bytes the JSON object occupies. However, this is not even remotely valid JSON, and if you lose track of where you are in the stream, you either have to start parsing JSON to figure out if you’re currently inside or outside an object, or restart the stream (which is fine for Twitter, but not all streaming use-cases).

My synopsis

The boundary= variations are based on the multipart/ MIME types, and I like these the best. By saying boundary=LF, you are promising:

there are no 0x0A (\n) characters inside your individual JSON objects
each individual JSON object is separated by precisely one literal 0x0A character

A parser for newline-delimited JSON is also written around the idea of a boundary. You collect a stream of bytes into a buffer, then whenever you hit the boundary character, you use the generic JSON.parse(...) on whatever is in the buffer. Pretty simple and you don’t have to write an entire JSON parser.

One more option could be to output a proper JSON array for clients that aren’t suspecting multiple JSON objects, but with special handling around the commas separated the objects in that array. I.e., the stream would always start off with a [ character, then as objects are streamed to it, it serializes them to the stream output, and prefixes all but the first with a comma, and indicates the end of the stream with a literal ] character. E.g.:

[{"first":"Chris","last":"Brown"}
,{"first":"Publius","last":"Maximus"}
,{"first":"Optimus","last":"Prime"}
]

That’s a valid JSON representation of an JSON array, but it’s also somewhat easy to parse as a stream, and even easier to stringify as a stream. There is no boundary, and the “streaming” aspect would be purely by convention.

What’s wrong with the novel MIME types?

application/json-stream doesn’t really say anything new. Typical JSON is a stream if you parse it like one.
application/ldjson doesn’t specify what sort of line endings will be used. CR/LF/CRLF is still an issue. And what if you wanted to use 0x0A (NULL) characters as delimiters?
IANA would have to approve anything new. I don’t know if the boundary=XYZ suffix counts as new, but I imagine it could be standardized more easily.

File extension

The file extension is also a contentious point. Some say plain .json, some say .ldjson, or even .ldj. I personally like .njson, but that’s no less arbitrary than the other options.

Judgment calls

Use application/json; boundary=LF when you really need to return newline-delimited JSON, like when your client doesn’t know how to process an array as a stream.
Use plain application/json with the convention of newline placement described above when your client doesn’t necessarily know how to process newline-separated JSON, but some clients may want the streaming cues.
What about the file extension? I don’t know. File extensions aren’t very descriptive in the first place. Maybe use .json and then annotate your data in a README somewhere saying it’s boundary=NL?