Easy PostgreSQL Migrations with pgloader

As part of a recent project, I’ve been dealing with a difficult database migration from MySQL to PostgreSQL. The migration involves moving many tables, and several Gigabytes of data, which must all be moved without any loss. With a move like this, there is a lot to do besides just transferring the data from one base to another, such as refactoring all the raw queries in the backend of the application which must now use SQL syntax compatible with PostgreSQL, changing model definitions in the backend to use PostgreSQL data types, and finding any inconsistencies in the ORM (Object-relational mapping). With all of this and more to do, it is important to find a simple and fast solution for migrating the actual data from one database to another. This is where pgloader comes in.

pgloader is an open source tool, created by Dimitri Fontaine, author of “Mastering PostgreSQL in Application Development”. The tool has the ability to do migrations to and from PostgreSQL and MySQL, MSSQL, SQLite, and even Citrus databases. It can also transfer data from one PostgreSQL database to another. The tool can load data from files in CSV, DBF, and IBX format. pgloader is quite versatile, and in my opinion is the best option for any type of PostgreSQL data transfer. There are other tools out there, but most are not free, and none of the other options that I saw looked as well-developed as pgloader.

The ease of the learning curve for the tool is probably the reason that it has become popular. By simply installing the tool from the command line, and copying and modiying an example of a configuration file from pgloader’s documentation, we can start migrating data. This file connects to both the source and target database, and migrates the data. The file is also full of customizable choices, such as whether to drop and recreate the tables, create the schema, deal with foreign keys, index, and sequences, and most importantly, cast certain data types in a certain way.

In my opinion, great applications are made by great default values and choices which streamline the user’s most common navigation path. These defaults must also be easily overriden and customized to capture the portion of the user base which has a specific use case scenario. pgloader does that well. It has default casting rules for each type of database, meaning that for most users performing a migration to PostgreSQL, the tool will take care of all data type definition and conversion. In the case that specific casting rules are needed, it is easy to put those rules in the load file.

pgloader is also very fast. The tool could probably migrate two or three times the amount of data that pg_restore (the native PostgreSQL command for filling a database) could upload in the same amount of time. This is because the tool is written in Common Lisp, which can be compiled to machine code and use parallel computing and threads. Related to this is perhaps the only real drawback to pgloader that I found. Because of the Common Lisp compilation needs, it is difficult to get the dependencies for pgloader to all play together nicely on a Windows environment. Most notably, Make and Steel Bank Common Lisp will have difficulty compiling. Common solutions are to use Windows Subsystem for Linux, or Docker.

As a side note, it is a good thing that pgloader was able to painlessly migrate the desired data from MySQL to PostgreSQL, because all the other parts of this type of migration turned out to be anything but painless. All functions for dates had to be translated from one syntax of SQL to the other. Simply concatenating strings is different, as is returning the first non-null value among several choices. References to tables names and aliases must be made differently, as are several other things. Even the ORM used for the project had different behavior in a couple of special cases.

For most professional projects, I would recommend moving away from MySQL. Though it is an easy database to get started with, and has great supporting tools, complex projects suffer from issues with data types, and unforeseen background issues. For example, I recently ran into a problem when I loaded a backup of a database and found that the timestamp values I loaded were different from the source because of a timezone difference in my local system. This would normally be fine if timestamps were compared to other typestamps, but when the timestamp is compared to a datetime, all havoc occurs as data is misplaced and statistics are marred when one of the values is a few hours off. Instead of having to deal with that, I would recommend moving to PostgreSQL if another SQL flavor is required. PostgreSQL has useful data types such as json, which can be used to enhance performance if modeled correctly. When you make the move, I recommend pgloader as your guide.

Updated: