## Introduction to tuple_vector `tuple_vector` is a data container that is designed to abstract and simplify the handling of a "structure of arrays" layout of data in memory. In particular, it mimics the interface of `vector`, including functionality to do inserts, erases, push_backs, and random-access. It also provides a `RandomAccessIterator` and corresponding functionality, making it compatible with most STL (and STL-esque) algorithms such as ranged-for loops, `find_if`, `remove_if`, or `sort`. When used or applied properly, this container can improve performance of some algorithms through cache-coherent data accesses or allowing for sensible SIMD programming, while keeping the structure of a single container, to permit a developer to continue to use existing algorithms in STL and the like. ## Review of "Structure of arrays" data layouts When trying to improve the performance of some code, it can sometimes be desirable to transform how some data is stored in memory to be laid out not as an "array of structures", but as a "structure of arrays". That is, instead of storing a series of objects as a single contiguous chunk of memory, one or more data members are instead stored as separate chunks of memory that are handled and accessed in parallel to each other. This can be beneficial in two primary respects: 1) To improve the cache coherency of the data accesses, e.g. by utilizing more data that is loaded per cache line loaded from memory, and thereby reducing the amount of time waiting on memory accesses from off-CPU memory. This presentation from Mike Acton touches on this, among other things: https://www.youtube.com/watch?v=rX0ItVEVjHc 2) To allow the data to be more easily loaded and utilized by SIMD kernels, by being able to load memory directly into a SIMD register. This is touched on in this presentation from Andreas Fredriksson for writing code with SIMD intrinsics: http://www.gdcvault.com/play/1022249/SIMD-at-Insomniac-Games-How ...and as well in this guide for writing performant ISPC kernels: https://ispc.github.io/perfguide.html ## How TupleVecImpl works `tuple_vector` inherits from `TupleVecImpl`, which provides the bulk of the functionality for those data containers. It manages the memory allocated, marshals data members to each array of memory, generates the necessary iterators, and so on. When a `tuple_vector` is declared, it is alongside a list of types, or "tuple elements", indicating what data to store in the container, similar to how `tuple` operates. `TupleVecImpl` uses this list of tuple elements to then inherit from a series of `TupleVecLeaf` structures, which each have their own pointer to an array of their corresponding type in memory. When dereferencing the container, either to fetch a tuple of references or just fetching pointers to the memory, it is these pointers that are utilized or fetched. While each `TupleVecLeaf` contains a pointer to its own block of memory, they are not individual memory allocations. When `TupleVecImpl` needs to grow its capacity, it calculates the total size needed for a single allocation, taking into account the number of objects for the container, the size of each tuple element's type, and the alignment requirements for each type. Pointers into the allocation for each tuple element are also determined at the same time, which are passed to each `TupleVecLeaf`. From there, many of the interactions with `TupleVecImpl`, to modify or access members of the container, then reference each `TupleVecLeaf`'s data pointer in series, using parameter packs to repeat each operation for each parent `TupleVecLeaf`. ## How tuple_vector's iterator works `TupleVecImpl` provides a definition to an iterator type, `TupleVecIter`. As mentioned above, `TupleVecIter` provides all of the functionality to operate as a `RandomAccessIterator`. When it is dereferenced, it provides a tuple of references, similar to `at()` or `operator[]` on `TupleVecImpl`, as opposed to a reference of some other type. As well, a customization of `move_iterator` for `TupleVecIter` is provided, which will return a tuple of rvalue-references. The way that `TupleVecIter` operates internally is to track an index into the container, as well as a copy of all of the `TupleVecImpl`'s `TupleVecLeaf` pointers at the time of the iterator's construction. As a result, modifying the iterator involves just changing the index, and dereferencing the iterator into the tuple of references involves dereferencing each pointer with an offset specified by that index. Of the various ways of handling the multitude of references, this tended to provide the best code-generation. For example, having a tuple of pointers that are collectively modified with each iterator modification resulted in the compiler not being able to accurately determine which pointers were relevant to the final output of some function, creating many redundant operations. Similarly, having the iterator refer to the source `TupleVecImpl` for the series of pointers often resulted in extra, unnecessary, data hops to the `TupleVecImpl` to repeatedly fetch data that was not practically mutable, but theoretically mutable. While this solution is the heaviest in terms of storage, the resulted assembly tends to be competitive with traditional structure-of-arrays setups. ## How to work with tuple_vector, and where to use it Put simply, `tuple_vector` can be used as a replacement for `vector`. For example, instead of declaring a structure and vector as: ``` struct Entity { bool active; float lifetime; Vec3 position; } vector entityVec; ``` ...the `tuple_vector` equivalent of this can be defined as: ``` tuple_vector entityVec; ``` In terms of how `tuple_vector` is modified and accessed, it has a similar featureset as `vector`, except where `vector` would accept or return a single value, it instead accepts or returns a tuple of values or unstructured series of equivalent arguments. For example, the following functions can be used to access the data, either by fetching a tuple of references to a series of specific values, or the data pointers to the tuple elements: ``` tuple operator[](size_type) tuple at(size_type) tuple iterator::operator*() tuple move_iterator::operator*() tuple data() // extract the Ith tuple element pointer from the tuple_vector template T* get() // e.g. bool* get<0>(), float* get<1>(), and Vec3* get<2>() // extract the tuple element pointer of type T from the tuple_vector // note that this function can only be used if there is one instance // of type T in the tuple_vector's elements template T* get() // e.g. bool* get(), float* get(), and Vec3* get() ``` And `push_back(...)` has the following overloads, accepting either values or tuples as needed. ``` tuple push_back() push_back(const bool&, const float&, const Vec3&) push_back(tuple) push_back(bool&&, float&&, Vec3&&) push_back(tuple) ``` ...and so on, and so forth, for others like the constructor, `insert(...)`, `emplace(...)`, `emplace_back(...)`, `assign(...)`, and `resize(...)`. As well, note that the tuple types that are accepted or returned for `tuple_vector` have typedefs available in the case of not wanting to use automatic type deduction: ``` typedef eastl::tuple value_tuple; typedef eastl::tuple reference_tuple; typedef eastl::tuple const_reference_tuple; typedef eastl::tuple ptr_tuple; typedef eastl::tuple const_ptr_tuple; typedef eastl::tuple rvalue_tuple; ``` With this, and the fact that the iterator type satisfies the `RandomAccessIterator` requirements, it is possible to use `tuple_vector` in most ways and manners that `vector` was previously used, with few structural differences. However, even if not using it strictly as a replacement for `vector`, it is still useful as a tool for simplifying management of a traditional structure of arrays. That is, it is possible to use `tuple_vector` to just perform a single large memory allocation instead of a series of smaller memory allocations, by sizing the `tuple_vector` as needed, fetching the necessary pointers with `data()` or `get<...>()`, and carrying on normally. One example where this can be utilized is with ISPC integration. Given the following ISPC function definition: export void simple(uniform float vin[], uniform float vfactors[], uniform float vout[], uniform int size); ...which generates the following function prototype for C/C++ usage: extern void simple(float* vin, float* vfactors, float* vout, int32_t size); ...this can be utilized with some raw float arrays: ``` float* vin = new float[NumElements]; float* vfactors = new float[NumElements]; float* vout = new float[NumElements]; // Initialize input buffer for (int i = 0; i < NumElements; ++i) { vin[i] = (float)i; vfactors[i] = (float)i / 2.0f; } // Call simple() function from simple.ispc file simple(vin, vfactors, vout, NumElements); delete vin; delete vfactors; delete vout; ``` or, with `tuple_vector`: ``` tuple_vector simpleData(NumElements); float* vin = simpleData.get<0>(); float* vfactors = simpleData.get<1>(); float* vout = simpleData.get<2>(); // Initialize input buffer for (int i = 0; i < NumElements; ++i) { vin[i] = (float)i; vfactors[i] = (float)i / 2.0f; } // Call simple() function from simple.ispc file simple(vin, vfactors, vout, NumElements); ``` `simpleData` here only has a single memory allocation during its construction, instead of the three in the first example, and also automatically releases the memory when it falls out of scope. It is possible to also skip a memory allocation entirely, in some circumstances. EASTL provides "fixed" counterparts of many data containers which allows for a data container to have an inlined buffer of memory. For example, `eastl::vector` has the following counterpart: eastl::fixed_vector This buffer allows for enough space to hold a `nodeCount` number of `T` objects, skipping any memory allocation at all, until the requested size becomes greater than `nodeCount` - assuming `enableOverflow` is True. There is a similar counterpart to `eastl::tuple_vector` available as well: eastl::fixed_tuple_vector This does the similar legwork in creating an inlined buffer, and all of the functionality of `tuple_vector` otherwise is supported. Note the slight difference in declaration, though: `nodeCount` and `enableOverflow` are defined first, and `enableOverflow` is not a default parameter. This change arises out of restrictions surrounding variadic templates, in that they must be declared last, and cannot be mixed with default template parameters. Lastly, `eastl::vector` and other EASTL data containers support custom Memory Allocator types, through their template parameters. For example, `eastl::vector`'s full declaration is actually: eastl::vector However, because such a default template parameter cannot be used with variadic templates, a separate type for `tuple_vector` is required for such a definition: eastl::tuple_vector_alloc Note that `tuple_vector` uses EASTLAllocatorType as the allocator. ## Performance comparisons/discussion A small benchmark suite for `tuple_vector` is included when running the EASTLBenchmarks project. It provides the following output on a Core i7 3770k (Skylake) at 3.5GHz, with DDR3-1600 memory. The `tuple_vector` benchmark cases compare total execution time of similar algorithms run against `eastl::tuple_vector` and `std::vector`, such as erasing or inserting elements, iterating through the array to find a specific element, sum all of the elements together via operator[] access, or just running `eastl::sort` on the data containers. More information about the EASTLBenchmarks suite can be found in EASTL/doc/EASTL Benchmarks.html Benchmark | STD execution time | EASTL execution time | Ratio --------- | -------- | ---------- | ----- `tuple_vector/erase ` | 1.7 ms | 1.7 ms | 1.00 `tuple_vector/erase ` | 104.6 ms | 106.3 ms | 0.98 `tuple_vector/reallocate ` | 1.3 ms | 1.7 ms | 0.77 - | | | `tuple_vector/erase ` | 3.4 ms | 3.5 ms | 0.98 `tuple_vector/insert ` | 3.4 ms | 3.4 ms | 0.99 `tuple_vector/iteration ` | 56.3 us | 81.4 us | 0.69 - `tuple_vector/operator[] ` | 67.4 us | 61.8 us | 1.09 `tuple_vector/push_back ` | 1.3 ms | 818.3 us | 1.53 + `tuple_vector/sort ` | 5.8 ms | 7.3 ms | 0.80 | | | `tuple_vector/erase ` | 34.7 ms | 32.9 ms | 1.05 `tuple_vector/insert ` | 41.0 ms | 32.6 ms | 1.26 `tuple_vector/iteration ` | 247.1 us | 80.5 us | 3.07 + `tuple_vector/operator[]` | 695.7 us | 81.1 us | 8.58 + `tuple_vector/push_back ` | 10.0 ms | 6.0 ms | 1.67 + `tuple_vector/sort ` | 8.2 ms | 10.1 ms | 0.81 | | | `vector/erase ` | 1.3 ms | 1.2 ms | 1.05 `vector/erase ` | 104.4 ms | 109.4 ms | 0.95 `vector/reallocate ` | 1.5 ms | 1.5 ms | 0.95 | | | `vector/erase ` | 4.3 ms | 3.6 ms | 1.20 `vector/insert ` | 4.8 ms | 4.8 ms | 1.01 `vector/iteration ` | 71.5 us | 77.3 us | 0.92 `vector/operator[] ` | 90.7 us | 87.2 us | 1.04 `vector/push_back ` | 1.6 ms | 1.2 ms | 1.38 + `vector/sort ` | 7.7 ms | 8.2 ms | 0.93 First off, `tuple_vector`'s performance versus `std::vector` is comparable, as expected, as the `tuple_vector`'s management for one type becomes very similar to just a regular vector. The major notable exception is the iteration case, which runs `eastl::find_if`. This performance differences is a consequence of the iterator design, and how it works with indices, not a direct pointer, so the code generation suffers slightly in this compute-bound scenario. This is worth noting as a demonstration of a case where falling back to pointer-based iteration by fetching the `begin` and `end` pointers of that tuple element may be preferable, instead of using the iterator constructs. The set of `tuple_vector` tests are more interesting. This is a comparison between a single `std::vector` with a structure containing a `uint64` and 56 bytes of padding, and a `tuple_vector` with two elements: one for `uint64` and one for 56 bytes of padding. The erase, insert, push_back, and sort cases all perform at a similar relative rate as they did in the `tuple_vector` tests - demonstrating that operations that have to touch all of elements do not have a significant change in performance. However, iteration and operator[] are very different, because those only access the `uint64` member of both `vector` and `tuple_vector` to run some operation. The iteration test now runs 3x faster whereas before it ran 0.7x as fast, and operator[] runs 8.5x faster, instead of 1.1x. This demonstrates some of the utility of `tuple_vector`, in that these algorithms end up being limited by the CPU's compute capabilities, as opposed to being limited by how fast they can load memory in from DRAM. In a series of other tests, generally speaking, `tuple_vector` tends to perform on par with manual management of multiple arrays in many algorithms and operations, often even generating the same code. It should be noted that significant degrees of inlining and optimization are required to get the most out of `tuple_vector`. Compared to accessing a series of arrays or vectors, `tuple_vector` does perform a multitude of extra trivial function calls internally in order to manage the various elements, or interact with `eastl::tuple` through its interface, so running in debug configurations can run significantly slower in some cases, e.g. sometimes running at 0.2x the speed compared to vector. ## The problem of referencing tuple elements This will be experienced shortly after using `tuple_vector` in most capacities, but it should be noted that the most significant drawback is that there is no way to **symbolically** reference each tuple element of the `tuple_vector` - much in the same way as `tuple`. For example, if translating a struct such as... ``` struct Entity { float x, y, z; float lifetime; }; ``` ...to `tuple_vector`, it will exist as: ``` tuple_vector entityVec; ``` ...and can only be accessed in a manner like `entityVec.get<3>()` to refer to the `lifetime` member. With existing tools, the only good alternatives are to encapsulate each float as a separate struct to give it unique typenames... ``` struct entityX { float val; }; struct entityY { float val; }; struct entityZ { float val; }; struct entityLifetime { float val; }; tuple_vector entityVec; ``` ...and then access each tuple element by typename like `entityVec.get()`; or, creating an enumerated value to replace the indices... ``` enum EntityTypeEnum { entityX = 0, entityY = 1, entityZ = 2, entityLifetime = 3 }; tuple_vector entityVec; ``` ...and then access each tuple element by the enumerated value: `entityVec.get()`. Either way, there is a fairly significant maintenance and readability issue around this. This is arguably more severe than with `tuple` on its own because that is generally not intended for structures with long lifetime. Ideally, if the language could be mutated to accommodate such a thing, it would be good to have some combination of typenames and symbolic names in the declaration, e.g. something like ``` tuple_vector entityVec; ``` and be able to reference the tuple elements not just by typename or index, but through their corresponding symbol, like `entityVec.get()`. Or, it may be interesting if the necessary `get` functions could be even automatically generated through a reflection system, e.g. `entityVec.get_lifetime()`. All of this remains a pipe dream for now.