Introducing KafkaDB

Albert Cervera i Areny Nov 7, 2011

We're happy to introduce you KafkaDB, our new baby which has just been published on bitbucket. KafkaDB is a tool that will greatly simplify database migrations between OpenERP versions but could also be extended to allow migration between versions of other PostgreSQL-based applications. At the project home page, you can find detailed information on its design and how to use it, but here I wanted to give some details on how we got there.  

The requirements

Since OpenERP SA announced migration tools would not be part of the software we started thinking how those could be implemented in a reusable way. Some companies searched for more short term solutions, simply trying to solve the problem with one or two customers that wanted to move from 4.2 or 5.0 to 6.0, postponing the search for a real solution. We were convinced that the same way that inheritance had allowed OpenERP have hundreds of modules with lots of developers making them freely available we could achieve the same for the migration infrastructure. So basically we had to implement something that solved the following requirements:
  • Modular: it had to provide a mechanism by which we could reuse the data transformations implemented for one database on another one.
  • Fast: given the size of the databases of some of our customers, we knew we had to move things at database level.
  • Shareable: apart from being modular, we wanted to ensure that it would be easy for everyone to share their transforms. Given that the migration information would not be in the modules, it needed to be easy to share, as well as find which transforms there were available.

Research & development

It took us some time till we reached what KafkaDB has ended up being. The process started with a small proof of concept using Python and openetl, an ETL created by OpenERP SA, but abandoned on June 2011. We discarded this option, not only because it had been abandoned, but also because the API was not very intuitive. Also, although we had not made any tests, it seemed the design could make it relatively slow. The first alternative was to look for another python ETL. This time the candidate was Brewery. This one has a nicer API but didn't have some basic features we needed and although we could have contributed to the project, we needed to focus on solving the issues that we had, not implement a full ETL from scratch. We really wanted to keep it in python, specially for making it easier for new people to contribute but we started searching what was available in other languages. Scriptella was the first candidate and bases its configuration in an XML file so that seemed attractive. We already knew that whatever system we choosed we would end up with a configuration file, and at the first glance it seemed that this could be the option because the system already relied on it. However, we were not convinced by the default behaviour of the system for some options, apart from the fact that the verbosity of XML seemed too much for a process that would easily involve 400 tables. So Àngel, one of my partners and our expert in Kettle and large data migrations, started toying with Kettle's API and see what could be achieved with custom transforms and some magic. He soon realized, that not only it was possible to achieve all the requirements that we had, but also we allowed non-developers to migrate their database. For example, we have several customers that could do that themselves if they wanted! Also, Kettle is probably the de facto standard ETL and specially in the OpenERP community (thanks to Terminatooor, a Kettle plugin created by Akretion, specifically designed to work with OpenERP). All in all, although KafkaDB is not finished yet, we're pretty sure it constitutes a good basis for the flexible migration system that we need, not only for migrating between OpenERP versions, but also between different applications, as long as one needs to reuse the process for several databases with similar structures.  

Featured Links