aboutsummaryrefslogtreecommitdiff
path: root/doc/json_parser_design.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/json_parser_design.md')
-rw-r--r--doc/json_parser_design.md138
1 files changed, 138 insertions, 0 deletions
diff --git a/doc/json_parser_design.md b/doc/json_parser_design.md
new file mode 100644
index 0000000..3d7cc6a
--- /dev/null
+++ b/doc/json_parser_design.md
@@ -0,0 +1,138 @@
+# JSON Parser Design
+
+The overall principle of the json parser is as follows:
+
+`flatcc/flatcc_json.h` contains functions to parse json primitives
+and type conversions, and a generic json parser to skip unrecognized
+fields.
+
+For each table all known fields are sorted and a trie is constructed.
+After recognizing a json object, each member name is read partially
+and matched against the trie, and more of the field name is read as
+demanded by the trie and a match or rejection is decided.
+
+The member name is read as a single 64 bit word read in big endian
+format (zero padded at the end of file). Big endian makes the first
+character in the name most significant and allows the word to contain
+trailing "garbage". The efficiency depends on unaligned 64-bit reads
+with a fast byteswapping operation, but there is a fall-back that
+works in all cases via an unaligned read support function.
+
+The trie is constructed by sorting all field names and taking a
+median split. If the split name has preceeding name which is a strict
+prefix, that name is chosen as split instead. This is important
+because words are read with trailing "garbage". If the median name is
+longer than a 64 bit word, it is treated specially where a match
+triggers reading more data and repeating the process.
+
+The median is tested for less than, but not equality. Each side if
+the choice uses a new median like above, until there is only one
+option left. This option is then tested for exact match by masking
+out any trailing data. If the match succeeds the input member name
+must be checked for closing `"` or other valid termination if
+allowing unquoted names.
+
+Note that this match only visits data once, unlike a hash table where
+a lookup still requires matching the name. Since we expect the number
+of field names per table to be reasonable, this approach is likely
+faster than a hash table.
+
+If the match fails, then an error can be generated with a "unknown
+field" error, or the field can be consumed by the generic json
+parser.
+
+When a member name is successfully matched againts a known field, the
+member value is expected to be of a primitive type such as a string
+or an integer, the value is parsed using a json support parser and
+type converter which may issue overflow and format errors. For
+example, integers won't accept decimal points and ubyte fields won't
+accept integers of value 256 or higher. If the parse was successful
+the member is added the current flatbuffer table opened at start of
+the json object. In fact, the value is parsed directly into the
+flatbuffer builders entry for the field.
+
+If a scalar field fails to parse its value, a second attempt is
+done parsing the value as a symbolic value. This parse is similar
+to parsing member names but uses a global set of known constants
+from enumerations. If this also fails, the field is invalid and
+an error is generated.
+
+When the field is parsed, comma is detected and the process is
+repeated. It may fail on duplicate fields because the builder will
+complain.
+
+Once a closing bracket is matched, the table is closed in the
+builder.
+
+In the above we discussed tables, but structs work the same with the
+exception that any fields not present are simply set to zero and it
+is not checked. Nor are duplicate fields checked. This is to avoid
+allocating extra space for something that isn't very important.
+
+Complex member types require more consideration. We wish to avoid a
+recursive descent parser because it imposes limits on nesting depth,
+but we want to have a composable parser. This leads to solutions such
+as a trampoline, returning a parser function for the matched field
+which is called by a driver function, before resuming the current
+parser. We wish to avoid this as it adds overhead, especially in
+inlining.
+
+To avoid trampolines we compile all tables and structs into one large
+function where each type can be reached by a goto label or a switch
+statement. We can then use a variant of Simon Tathams famous
+co-routine by duff device to abort and resume the current parse. We
+still need a stack to track return state.
+<http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html>
+
+Outside the giant state machine we provide wrappers so we can parse
+each type from a clean call interface to parse root type.
+
+The above solution has one major problem: we apparently have to
+include types from included schema as well. But we don't have to it
+turns out: Each included schema is closed and can never have a member
+type in our current schema. Therefore mutual recursion is not
+possible. This means we can afford to use function call recursion
+into included parsers with recursion depth bounded by the number of
+included schema.
+
+Spaces are optimized by first testing to see if there is a character
+above space and exit immediately if so. Otherwise, if unaligned
+access is valid and the buffer holds at least 16 bytes, we descend
+hierarchically to test ever small unligned loads against a word of
+spaces. This loops as long as 2x64 bit words can be matched to a word
+of pure spaces (ASCII 0x20). If there is any control character, or
+if we reach near the end or, or if unaligned access is not available,
+we bail out to a standard character parsing loop that also does line
+counting.
+
+A final, and major issue, is how to handle unions:
+
+Googles `flatc` compiler makes the assumtion that the union type
+field appear before the union table field. However, this order cannot
+be trusted in standard JSON. If we were to sort the fields first
+or prescan for type, it would complete ruin our parsing strategy.
+
+To handle unions efficiently we therefore require either that union
+fields appear before the union table, and that the associated
+union table is present before another unions type field, or we
+require that the member name is tagged with its type. We prefer
+the tagged approach, but it is not compliant with `flatc v1.2`.
+
+By using a tag such as `"test_as_Monster"` instead of just "test",
+the parser can treat the field as any other table field. Of course
+there is only allowed to be one instance, so `"test_as_Weapon"`
+in the same object would not be allowed. To the parser it just
+triggers a duplicate field error.
+
+In addition, the `"test_type"` field is allowed to appear anywhere,
+or to be absent. If it is present, it must have a type that
+matches the tagged member, unless it is of type NONE. NONE may
+also be represented as `"test_as_NONE": null`.
+
+While the tagged approach has no use of the type field, it may still
+be very useful to other JSON consumers so they can know what tagged
+field to look up in the JSON object.
+
+The parser handles the type field by setting the type field in the
+flatcc builder table and checking if it is already set when
+seeing a tagged or untagged union table field.