Analysis of a Rant on JSON

I was linked to this rant about JSON being a minefield earlier today. I decided to give it a good-faith lookover.

The conclusions are interesting, but I dispute that JSON is truly a minefield in practise.

What the article got right

The article was right about one thing: parsing JSON in 2014 was a disaster on wheels. It was awful. Parsers were all broken and ECMA-404 and RFC4627 ruled the land, and both were slightly differing in important ways. RFC4627 even made horrible suggestions such as "you can use eval() to parse JSON in JavaScript under some conditions!" (Do not do this, ever! This recommendation was later withdrawn via an erratum.) It wasn't always guaranteed what would parse correctly with one person's parser would parse correctly with yours due to incompatible extensions. Handling of null values was especially annoying (a lot of parsers did stupid things like create empty strings with them, which is obviously wrong). There was also the dreaded ECMAScript-264, which was followed by browsers.

Given there were three competing standards, it's no wonder parsing JSON was a mess.

Wheel in the sky keeps on turning

However, times have changed. Parsers have improved. ECMA-404 is officially in harmony with RFC8259 and explicitly references it thus:

This specification, ECMA-404, replaces those earlier definitions of the JSON syntax. Concurrently, the IETF published RFC 7158/7159 and in 2017 RFC 8259 as updates to RFC 4627. The JSON syntax specified by this specification and by RFC 8259 are intended to be identical.

Furthermore:

This specification and RFC 8259 both provide specifications of the JSON grammar but do so using different formalisms. The intent is that both specifications define the same syntactic language. If a difference is found between them, Ecma International and the IETF will work together to update both documents. If an error is found with either document, the other should be examined to see if it has a similar error, and fixed if possible. If either document is changed in the future, Ecma Internationaland the IETF will work together to ensure that the two documents stay aligned through the change.

The only major point of contention:

RFC 8259 also defines various semantic restrictions on the use of the JSON syntax. Those restrictions are not normative for this specification.

The semantics referenced are essentially security considerations, and have nothing to do with the syntax and ECMA-404 (which does not include those).

Better yet, ECMAScript-264 in newer versions explicitly defers to ECMA-404:

The JSON interchange format used in this specification is exactly that described by ECMA-404.

Which... goes back to RFC8259. No more contention there.

The tests

His tests bump corner cases in JSON and behaviour which should be undefined or disallowed. That's not a bad smoke test in of itself, and ensuring the security and standards compliance of a parser is never a bad thing. He makes valid points about the handling of scalars (something often not done right), white space (although I argue as an extension any whitespace character in Unicode is acceptable, although inadvisable), nested structures, and a few other things. Much of this has not been addressed in the latest specification in any way, nor should it; these are clearly bugs in people's parsers, and not bloating the standard with useless and little-used feature is a goal of JSON.

I strongly believe in Postel's Law, being liberal in what you accept and conservative in what you send. I don't consider accepting technically malformed but obviously unambiguous JSON to be a bad thing (unless it would be a security risk), but I believe strongly parsers should only emit standards-compliant JSON.

I hope sincerely most of the parsers in this table have had their crashing bugs fixed since this was published, because many of these problems are gaping DoS bugs.

Numbers

There are a few bugs relating to numbers that have been explicitly mentioned have been fixed. One is numerical precision; RFC8259 explicitly says:

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

And finally, a sane recommendation:

Note that when such software is used, numbers that are integers and are in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

And then there is resolving the messy and error-prone grammar of exponentiation:

The representation of numbers is similar to that used in most programming languages. A number is represented in base 10 using decimal digits. It contains an integer component that may be prefixed with an optional minus sign, which may be followed by a fraction part and/or an exponent part. Leading zeros are not allowed. A fraction part is a decimal point followed by one or more digits. An exponent part begins with the letter E in uppercase or lowercase, which may be followed by a plus or minus sign. The E and optional sign are followed by one or more digits. Numeric values that cannot be represented in the grammar ... (such as Infinity and NaN) are not permitted.

This does mess up a lot of parsers which rely on the values Infinity and NaN being allowed. In my opinion, it was a mistake not to include them. Nonetheless, this can be worked around with strings.

Variant encodings

UTF-16 and UTF-32 are still permitted in the newest standards, but RFC8259 notes:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8

So the reality is, these encodings don't matter for anything touching the Internet. Anything that touches the Internet must use UTF-8. Developers will not tolerate UTF-16 or UTF-32 usage on the Internet. Don't do it, people will hate you and locusts will eat your crops.

Escaped invalid characters

RFC8259 now requires UTF-16 literals to be paired with a surrogate when required, fixing the bugs he mentioned. I reckon most parsers already do the right thing by now, especially those in languages with good Unicode support.

Note about extensions

RFC8259 places more emphasis on interoperability. Although many extensions are allowed, many are explicitly prohibited now, and those implementing them are no longer in compliance with RFC8259. In practise, I don't really believe these extensions create a major problem, unless parsers are actively emitting them (I highly doubt this, as it would break interoperability to an unacceptable degree for all but the most conservative ones). I believe no one should use anything but the base RFC8259 profile on the Internet, with all interoperability recommendations.

Disagreement with his conclusion

Many of his results now are out of date due to these recent changes. I also believe he has not tested enough implementations in enough languages, enough platforms, or enough libraries (there are many). Although I appreciate his rigour, I believe very little of what he presented is a real-world issue, aside from crashing bugs. The fact of the matter is, JSON is more widely used now than ever, with a wide variety of servers emitting it and parsing it, and a lot of the kinks have been worked out because of this. Real-world testing is the ultimate bug-finder ;).

He mentions JSON as "being complex" but offers no alternatives; what are we supposed to use, ASCII fields and Unix cut like it's 1980? I'm sure that's adequate for some terminal junkies, researchers, and sysadmins, but it is not adequate for programmers who have to deal with this every day, and deal with the even worse complexity of bespoke formats. And it's certainly loads better than XML, whose parsers for many languages still have not been fixed.

I agree JSON parsing can be wrought with danger if done incorrectly, but the same can be said of any format. I believe its flexibility and simpliclity are good things, not bad. I disagree it is a complicated format, at least compared to many of its contemporaries. I also don't think it's going to be replaced anytime soon. I see it used more and more on the web, which is just fine with me.

links

social