Adventures, life experiences, programming.

How me and my team migrated a Postgres Database on GCP with minimal downtime?

It’s been a 6 months I joined Wayfair and I was tasked with an interesting task to migrate a Postgres database which was managed on different GCP project bound by its own policies to a decoupled GCP project. I’ll come to a minute what decoupled means here. Wayfair was traditionally built on legacy stack and infrastructure like database, others were managed by a central team on a shared terraform project. This slowed down teams in velocity and not having enough control of infrastructure they own and leverage new technologies like DataStream and BigQuery integrations . The company decided to give more power and control to the teams allowing teams to manage and control the infrastructure they own in their own GCP project but restricted by global policies set by the org. This is what decoupled environment means here, and this also allows platform teams to restrict what services are allowed by teams to use by global policies set by the org which is still managed by the central platform team. Let’s come to the meat of the ask.

Our team is responsible to store and serve facility information via strict interfaces/APIs so that other teams/services can use the information to move products from the facilities all the way to customer’s doorstep. We are an integral unit in the supply chain pipeline, any unavailability of the system could affect an order/orders not reaching the customer within the right time. At a high level, this is what the architecture looks like

To tackle this problem, we came up with a design/approach doc which highlighted the pros and cons of different approaches that we are going to take for this migraiton, but a high level this can be categorised into 2 different areas:

  1. Migration with downtime.

    1.0 In this approach, we highlighted that we would have to delete the entire Kubernetes deployment, thereby restricting both writes and reads to the application and performing backup, restore using pg_dump and pg_restore and then performing data integrity check to ensure data correctness. Once that is done, we deploy a new change request to deploy the new version of the application pointing to the new database.


    1.1 Another way is to restrict all writes to the application both via APIs and messages consumed via message bus(kafka here) and only allowing reads to happen. Writes will still suffer during the migration. Once the database is provisioned, we use Postgres tools, pg_dump, pg_restore and data integrity check(a bash script which computes checksum of old tables and new database and compares if their checksum matches) to ensure the data integrity is maintained. Once that is done, we basically create a new change request to deploy the application pointing to the new database.

  2. Migration without Downtime.

    2.0 In this approach, we highlighted that we would use a CDC based solution offered by GCP which would read events from the bin log and apply the initial load and continuous stream to the newly provisioned database. Once that is done, the application is restricted from writes for a few seconds, the Kubernetes deployment is deleted and a Change Request is deployed which points to the new database.

Next step is to get into a discussion with tech lead and senior staff engineers to get an idea about the approaches and which course to take considering time/effort trade offs and we concluded that we would take take 1.1 given that we could move faster and not invest time in setting up CDC.

Now that we have an approval which course of action to take, we setup a meeting with engineering manager to derive task breakdown to accomplish this and time required to bring this to completion. We took a stage based approach where we would measure the time taken to perform the action on dev/stage environment and then extraploate the actual time that could take up in production.

We were able to complete the dev migration of moving data from the older database to newer database. in less than 10 minutes using pg_dump and pg_restore step. We had also prepared a playbook for us to ensure that we take the orderly steps to achieve this. Here is a snapshot of the playbook below.

When we actually moved the production, it took us about nearly some more time as there were some unexpected turns during the event, the dataset size was close to 14GB in size and pg_dump and pg_restore took almost 45 minutes. Running data checksum across two tables to ensure both dataset matches took about 30 minutes due to relative size of the dataset. There were other operational issues during the migration but we acted as a team and mitigated them as they came.

With this new migration, we have full control and ability to manage our own infrastructure in our own maintained terraform allowing developers to make changes in infrastructure specific to team boundarires.

After we have migrated, we notified the team that we successfully migrated the database. We also had fun during the migration, we ordered pizzas and played some nice music to ease down the tension of several operations going round 🙂

With this new model, we are able to manage and control the infrastructure(database) and leverage new technologies like Google DataStream and Big Query integrations which will help team to move and release features faster.

The migration was never possible without these folks around:

https://www.linkedin.com/in/jainvishal1/
https://www.linkedin.com/in/mahantys
https://www.linkedin.com/in/saurabh45/

Leave a Reply

Your email address will not be published. Required fields are marked *