Structured Data

Jan Burse, created Oct 17. 2011 Jason needed a ship to reach the end of the world in his quest for the Golden Fleece. Today we need vessels to ship data. Relational database management systems do not provide us with such vessels, since the first normal form implies scalar column values. Following strictly this normal form would impose a number of restrictions on the implementation and maintenance of our sales system. The difficulty here is to find a solution that is highly portable across database management systems via the JDBC standard. What does make first normal form database designs inefficient? The example we have encountered was a little list of VAT summaries that is attached to each order entry. First normal form would require us to shred the little list into a new table that is linked to the order entry via a foreign key. Accessing the order with the VAT summaries would thus require an additional join. The more efficient solution would be to use complex data types. Although the JDBC interface supports complex data types via Array and Struct, the problem here is that they are not provided by every database management system vendor. What does make first normal form databases hard to maintain? In the sales system there is the need to model sub types. Take for an example the notion of a customer. This could basically be a person or a company. If it is a company we might like to store the type of the company plus a registrar number. These attributes are not available for a person. Relational database theory suggests different methods to model sub types. We might store the union of the attributes in one table and make use of null values. Or we might use a table per sub type. The result will be in both cases some database schema bloat. Some database management systems provide extra support for sub types. But this would not reduce the database schema bloat. The database schema bloat is especially annoying since our development process is based on the Matula database layer. Database schema bloat would slow down our development process since this layer requires code generation from the database schema. We would need to recompile and redeploy the application for every little subtype enhancement. To model sub types and complex objects we therefore decided to look for an alternative to shredding the data. What first comes to mind as an alternative are XML valued attributes. XML is able to represent complex objects and sub types. The validation of XML is a step that needs not necessarily be carried out by the database management system and can be performed elsewhere, and so we might evade schema bloat. By adopting XML valued attributes we would face similar problems as with complex data types. With the latest version of JDBC there is also a support of an XML data type. But the XML related functionality inside the database management system varies again heavily from vendor to vendor. Therefore an XML valued attribute would provide us no more as a type name for character large object (CLOB). As the JDBC API shows it would be our responsibility to choose the method of parsing and un-parsing, respectively matching. We therefore adopted the idea of a CLOB but tried something else what concerns the coding of structured data. Our point of departure was the un-parsing of the Java classes Vector and Hashtable via the method toString(). But our decision was not to use quotes for strings. Instead we started using the XML entities. We also switched from the punctuation characters “=” and “,” to the more natural punctuation characters “:” and “;”. These CLOBs provides us now with many advantages. Our JSON without quotes format does not only reduce the storage footprint. They automatically support us in full text search over multiple attribute values. We simply search the CLOB with the given string pattern. Most of the time it is not necessary to include the attribute name in the string pattern. Reason is that different attribute domains most often occupy different lexical spaces. For example a person is hardly named after a city, so searching for the word “Zurich” only finds a match in city attributes. On the other hand searching for the word “Müller” only finds a match in last name attributes. But the method would be too slow for important selects on certain attributes. For these attributes we were force to add traditional columns and indexes to the database. The CLOBs and the JSON without quotes format gives us a further advantage. We can now extend the sales system by new attributes during runtime. For this purpose we added a couple of tables to our sales system that serve as a schema declaration for the sales system itself. The schema declarations make use of JSON without quotes itself and allow us to automatically derive forms for the sales system. Currently the schema declarations are quite simple, for example we do not have a type declaration for the scalar types. We have planned a couple of extensions for the future. The data representation and schema declaration are also instrumental in our output system, which we plan to cover in a subsequent blog entry.