aboutsummaryrefslogtreecommitdiff
path: root/flatcc/doc/json_parser_design.md
blob: 3d7cc6a5b2f7903e939d2b68638b7ce19dc1df57 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# JSON Parser Design

The overall principle of the json parser is as follows:

`flatcc/flatcc_json.h` contains functions to parse json primitives
and type conversions, and a generic json parser to skip unrecognized
fields.

For each table all known fields are sorted and a trie is constructed.
After recognizing a json object, each member name is read partially
and matched against the trie, and more of the field name is read as
demanded by the trie and a match or rejection is decided.

The member name is read as a single 64 bit word read in big endian
format (zero padded at the end of file). Big endian makes the first
character in the name most significant and allows the word to contain
trailing "garbage". The efficiency depends on unaligned 64-bit reads
with a fast byteswapping operation, but there is a fall-back that
works in all cases via an unaligned read support function.

The trie is constructed by sorting all field names and taking a
median split. If the split name has preceeding name which is a strict
prefix, that name is chosen as split instead. This is important
because words are read with trailing "garbage". If the median name is
longer than a 64 bit word, it is treated specially where a match
triggers reading more data and repeating the process.

The median is tested for less than, but not equality. Each side if
the choice uses a new median like above, until there is only one
option left. This option is then tested for exact match by masking
out any trailing data. If the match succeeds the input member name
must be checked for closing `"` or other valid termination if
allowing unquoted names.

Note that this match only visits data once, unlike a hash table where
a lookup still requires matching the name. Since we expect the number
of field names per table to be reasonable, this approach is likely
faster than a hash table.

If the match fails, then an error can be generated with a "unknown
field" error, or the field can be consumed by the generic json
parser.

When a member name is successfully matched againts a known field, the
member value is expected to be of a primitive type such as a string
or an integer, the value is parsed using a json support parser and
type converter which may issue overflow and format errors. For
example, integers won't accept decimal points and ubyte fields won't
accept integers of value 256 or higher. If the parse was successful
the member is added the current flatbuffer table opened at start of
the json object. In fact, the value is parsed directly into the
flatbuffer builders entry for the field.

If a scalar field fails to parse its value, a second attempt is
done parsing the value as a symbolic value. This parse is similar
to parsing member names but uses a global set of known constants
from enumerations. If this also fails, the field is invalid and
an error is generated.

When the field is parsed, comma is detected and the process is
repeated. It may fail on duplicate fields because the builder will
complain.

Once a closing bracket is matched, the table is closed in the
builder.

In the above we discussed tables, but structs work the same with the
exception that any fields not present are simply set to zero and it
is not checked. Nor are duplicate fields checked. This is to avoid
allocating extra space for something that isn't very important.

Complex member types require more consideration. We wish to avoid a
recursive descent parser because it imposes limits on nesting depth,
but we want to have a composable parser. This leads to solutions such
as a trampoline, returning a parser function for the matched field
which is called by a driver function, before resuming the current
parser. We wish to avoid this as it adds overhead, especially in
inlining.

To avoid trampolines we compile all tables and structs into one large
function where each type can be reached by a goto label or a switch
statement. We can then use a variant of Simon Tathams famous
co-routine by duff device to abort and resume the current parse. We
still need a stack to track return state.
<http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html>

Outside the giant state machine we provide wrappers so we can parse
each type from a clean call interface to parse root type.

The above solution has one major problem: we apparently have to
include types from included schema as well. But we don't have to it
turns out: Each included schema is closed and can never have a member
type in our current schema. Therefore mutual recursion is not
possible. This means we can afford to use function call recursion
into included parsers with recursion depth bounded by the number of
included schema.

Spaces are optimized by first testing to see if there is a character
above space and exit immediately if so.  Otherwise, if unaligned
access is valid and the buffer holds at least 16 bytes, we descend
hierarchically to test ever small unligned loads against a word of
spaces. This loops as long as 2x64 bit words can be matched to a word
of pure spaces (ASCII 0x20).  If there is any control character, or
if we reach near the end or, or if unaligned access is not available,
we bail out to a standard character parsing loop that also does line
counting.

A final, and major issue, is how to handle unions:

Googles `flatc` compiler makes the assumtion that the union type
field appear before the union table field. However, this order cannot
be trusted in standard JSON. If we were to sort the fields first
or prescan for type, it would complete ruin our parsing strategy.

To handle unions efficiently we therefore require either that union
fields appear before the union table, and that the associated
union table is present before another unions type field, or we
require that the member name is tagged with its type. We prefer
the tagged approach, but it is not compliant with `flatc v1.2`.

By using a tag such as `"test_as_Monster"` instead of just "test",
the parser can treat the field as any other table field. Of course
there is only allowed to be one instance, so `"test_as_Weapon"`
in the same object would not be allowed. To the parser it just
triggers a duplicate field error.

In addition, the `"test_type"` field is allowed to appear anywhere,
or to be absent. If it is present, it must have a type that
matches the tagged member, unless it is of type NONE. NONE may
also be represented as `"test_as_NONE": null`.

While the tagged approach has no use of the type field, it may still
be very useful to other JSON consumers so they can know what tagged
field to look up in the JSON object.

The parser handles the type field by setting the type field in the
flatcc builder table and checking if it is already set when
seeing a tagged or untagged union table field.