Commit 128cb5d8 authored by EXT Arnaud Clère's avatar EXT Arnaud Clère
Browse files

Documented requirements and design (WIP)

parent d038b698
......@@ -57,4 +57,6 @@ HEADERS += \
QXml_impl.h
DISTFILES += \
README.md \
design.md \
persons.proto
# QBind, a convenient and efficient (de)serialization mechanism on top of Qt
QBind was developed to demonstrate the feasibility and advantages of a novel (de)serialization mechanism on top of
existing Qt data formats (Settings, QDataStream, Json, Cbor, Xml) and generic data types (containers, QVariant, QMetaObject, etc.).
Its requirements are convenient for most (de)serialization use cases but become necessary to analyse how software
is used in the field and diagnose complex issues:
- QDebug is conveniently used for debugging. But it may not be enough time- or space- efficient for running software
in the field, or require too much log parsing work to detect issues from multiple tracepoints or traces.
- LTTng/ETW (or Qt tracegen tool) allows analysing performance using a few statically-defined tracepoints with
high-performance and structured data. But analysing unexpected uses and issues in the field requires a lot more tracepoints
than can be conveniently defined statically.
## Requirements
### Serialization (Write)
W1. Fast: not as fast but comparable to protobuf/lttng/etl ; potentially without dynamic memory allocation or write thread locking
W2. Easily extensible to user-defined and third-part types
W3. Support almost all features of simple Qt data (QJson*, QDataStream, QSettings) and most features of complex Qt data (QCbor*, QXml*, QMetaObject, QModel*)
W4. Provide most well-formedness guarantees with low impact on performance
W5. Keep data format separate from serialization code (a compiled tracepoint in a library must be able to generate Json or Cbor as desired by library user)
W6. Fully-compatible with existing QDebug << statements
### Deserialization (Read)
R1. Encourage standard and explicit formats like Json/Cbor to:
- avoid mismatches between actual data and the "schema" at hand (be it a data schema or just code)
- favor interoperability
R2. Support simple, type-by-type, data-schema evolution like:
- adding, deleting or moving named items in a record
- changing from required to optional (or adding) an item with default value (in a sequence)
- changing from optional or required to repeated an item (provided it is not itself a sequence) (TBD)
R3. Allow reporting all errors and mismatches between what was expected and what is read (unless the data format was implicit as below)
R4. Allow implicit formats like QDataStream when the reader knows exactly what to read
R5. Avoid writing redundant code for simple read/write cases (as required with QDataStream << and >>)
R6. Allow optional metadata to inform translation to complex formats (CBOR tags, XML tags and attributes, QModel* columnNames, etc.)
### Common Data Model
QDebug and QDataStream translate all data to a "flat" sequence of characters/bytes which structure becomes implicit and can only
be determined for sure by reading the code that produced it. But R1, R2, R3 require that we describe our data in a little bit more
detail. Moreover, W5 and W3 require that we choose a careful compromise between data format features.
Our proposal models data as either:
* A `sequence` of adjacent `data` `item`s
* A `record` of named `data` `item`s
* A `null` value (meaning no `data` available)
* Natively supported values with an explicit textual representation (and optional optimized representations) among:
- text encodings (utf16/utf8)
- absence of information (null)
- boolean (true/false)
- numbers (integral/floating-point, unsigned/signed, 8/16/32/64 bits)
- ...
* Generically supported T values for which a specialized QBind<T>::bind() is defined
Unlike boost::serialization, pickle and other libraries, this model requires to encode graphs using some kind of "reference"
values. But this makes the model translatable to data formats without native support for references like Json. Our proposal supports metadata
as an optional way to encode such special values for data formats supporting it like [CBOR value sharing tags](http://cbor.schmorp.de/value-sharing)
and XML.
Although this model looks like a high-level literal description of data, it is better to understand it as a high-level traversal
of the data (without backtracking nor cycles), because this traversal may be partial (e.g. the model only 'traverses' parts of the
data that the user is interested in), and the same traversal may be used to read/write data files or visit/built generic data structures.
We argue that this data model allows lossless conversions between all supported formats.
We argue that the addition of optional metadata (R6) can address most peculiarities of the supported formats, although it may not
conveniently address formats that do not have standard ways of representing data structures such as XML (e.g. binding the Person type
with enough metadata to conform the result to [xCard schema](https://tools.ietf.org/html/rfc6351) would be cumbersome).
## Examples
TDB
## Benchmark
TBD
## Conclusion
QBind is a (de)serialization mechanism that can be used on top of Qt, enabling convenient and efficient data transfers
among a lot of Qt data types without loss (other than metadata and unexpected data types). It can leverage registered
QMetaObject stored properties or QDataStream << and >> operators, or replace a lot of format-specific code with a few
T::bind() or QBind<T>::bind() methods (where T can be defined by user, Qt, or a third-part).
Integrated to Qt, QBind would enable the addition of a new kind of tracing facility, as dynamic and convenient as QDebug,
and practically as efficient as Qt tracegen tool. Many existing QDebug tracepoints could be switched transparently
to the new facility. This would allow going beyond debugging or performance analysis, and tackle complex software issues
in the field involving multiple tracepoints or traces with adequate tools like Python, or the more academic
"Parametric Trace Properties" language [ParTraP](http://vasco.imag.fr/tools/partrap/).
# Design
The common data model is formally described by the following recursive automaton:
```dot
digraph Out {
Value -> sequence() -> Sequence
Sequence -> item() -> Value<Sequence>
Sequence -> out() -> Out
Value -> record() -> Record
Record -> item(QName) -> Value<Record>
Record -> out() -> Out
Value -> BindNative :IBind<T>::bind() -> Out
Value -> BindGeneric:QBind<T>::bind() -> Out
}
```
- Boxes (nested) represent possible states when traversing the data, the automaton is always in a single valid state
- Edges represent legal state transitions that translate to specific data format read/write actions
## Well-formedness guarantees with minimum performance overhead
The automaton is implemented as follows:
- Cursor instances are, non-owning, almost-unique, pointers to an IBind interface which implementations translate data traversal
into specific data format read/write actions
- Val<_>, Rec<_> and Seq<_> types implement possible states where _ denotes the type of outer state
These types expose public methods restricted to the legal state transitions, that return the type representing the destination
state with a valid Cursor instance
- The initial state is Val<Cursor> and the final state is Cursor
- returning a Rec or Seq type automatically convert to the top-level Cursor type invoking as much out() transitions as required
Thanks to this design, the compiler will make sure that the only possibility to return a Cursor from a Val<Cursor> is to traverse
the data without backtracking, calling only and all necessary IBind virtual methods. Since Cursor, Val, Rec and Seq have no
data member other than outer types and IBind pointer, calling their methods can be optimized and replaced with just the following
operations:
1. test the IBind pointer validity
2. call the IBind virtual method corresponding to the legal transitions
3. return the resulting Cursor, Val, Rec or Seq with a valid or invalid Cursor depending on IBind method success or failure
Compared to manually calling non-virtual, format-specific implementations, the overhead of always testing the validity of IBind*
and calling virtual methods is around 20% in our benchmark, with a maximum of 100% for trivial format-specific implementations
like copying a single char to a pre-allocated buffer. This performance cost is usually dwarfed by unoptimized code and the ability to
select a data format that performs well for the data at hand. For instance, choosing Cbor with a mixture of boolean, character
literals and numbers makes this overhead negligible.
A convenient side-effect of encoding the common data model in the type system is that smart C++ editors offer data-model-aware code
completion to this fluent interface. Say, after typing `Value(myImpl).` the editor will propose to either `bind(myData.item)`, or to
construct a `sequence()`, `record()` or `null()` value.
## Format extensibility
IBind is a generic abstract base class for translating fluent interface calls to format-specific read or write actions.
The fluent interface guarantees that IBind virtual methods are always called at appropriate times, so IBind implementations do not
have to check well-formedness. It defines a set of BindNative types and default textual representations and encodings for non-textual
native types simplifying again the implementations (see TextWriter example).
Note: BindNative types could be extended by specifying BindSupport<TImpl,T> trait but then, (de)serialization code must be specialized
for each TImpl. For instance, a QBind<QColor,QDataStream> may be implemented differently from QBind<QColor,IBind> but QBind<QColor>
example shows that meta() can also be used to avoid specialized serialization code that breaks W5 requirement. If meta() is deemed
enough, the BindSupport trait and TImpl template parameters can be removed.
## Types extensibility
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment