Our customer is a company in the “healthtech” field developing a SaMD (Software as a Medical Device) to perform studies on different rare diseases. As such, they need a plethora of environments to work with, not only for developers to test their features but also for them to be tested and validated through rigorous processes.
Moreover, the customer also needs isolated environments to host a data analysis platform to process the studies’ results. All of these constraints result in 9 environments with their own AWS accounts for this project only, each of them hosting hundreds of AWS resources, such as Amazon EKS clusters or Amazon Lambda functions.
On top of that, except for the development environments, all of them need to comply with the HDS (Healthcare Data Storage) standard. This standard requires strict regulations and control over data: how data is handled at rest, in transit, and who has access to it.
At Padok, we then engaged with the customer and used AWS best practices to perform the migration.
There is one word to summarize our goal with this migration: centralization. There are multiple ways to improve your Cloud infrastructure centralization. But the one we chose was to switch to a hub-and-spoke architecture for the project.
Basically, the hub-and-spoke architecture is a take on how your AWS Organization networking will be managed. This architecture model says two simple things: there is one central account handling ingress and egress traffic from/to the Internet. Moreover, services (AWS’ ones or your own) that can be shared across multiple environments must be.
To implement this model, the plan was actually pretty straightforward. The star of the show here is the Transit Gateway service from AWS. Simply speaking, the Transit Gateway service, as the name implies, is a network gateway allowing you to interconnect multiple VPCs through multiple AWS accounts. It does so by defining attachments to different subnets and route tables to orchestrate your organization traffic with good granularity.
Thanks to this, we can safely make communication between environments possible, through a secure private link that is controlled, logged, and inspected when needed thanks to VPC flow logs for example.
As I said earlier, this migration is a first step into the hub-and-spoke model, so while the theory is about managing ingress and egress traffic, our first implementation of the model was only targeted at inter-VPC and egress traffic.
While security and control at the network level are already awesome features in themselves, our attention was of course drawn toward the centralization possibilities. As the environments could communicate with each other, we decided that multiple tools that were historically deployed in each environment could now be merged into single units.
After some discussion, we settled on monitoring, logging, and CI/CD as perfect candidates for centralization. As a picture is worth a thousand words, let me summarize the migration in a simple diagram:
It might look easy at first glance, but the migration we performed was not so much of a walk in the park. The key element bringing the difficulty up there was that the legacy VPCs, that the customer was using to deploy services in the different environments, had overlapping CIDRs with each other.
Quite problematic when the goal is to make VPCs able to communicate freely without NAT as a “global” private network, isn’t it?
As such, we were facing a new challenge: migrating all VPC-attached resources in each account from those legacy VPCs to new VPCs that were attached to the brand-new Transit Gateway and had no overlapping CIDRs.
Unfortunately, it is not that easy. Easier than with an on-premises system, for sure, but not trivial either. Three types of services deployed in the customer’s infrastructure needed maintenance to migrate from one VPC to another: some Amazon Lambda functions, Amazon RDS (Relational Database Service) instances, and Amazon EKS (Elastic Kubernetes Service) clusters.
Changing the VPC configuration on VPC-attached Lambda functions is quite easy, especially when you manage your infrastructure with an IaC (Infrastructure-as-Code) tool like Terraform. Indeed, changing the input network configuration does the job of handling the migration automatically!
It is a bit trickier for Amazon RDS instances though, as doing the same procedure would result in database subnet groups having ID conflicts when deployed in the same account.
To remediate this problem, we had to manually declare a new database subnet group in the codebase to feed it as input for the existing Amazon RDS instance, then use a move block to clean up the migration process. Talking about move blocks, we recently released tfautomv that helps you handle them very easily, do not hesitate to check it out!
Migrating the customer’s Amazon EKS clusters was the most challenging part of the migration. Indeed, it is not possible to just tell AWS to change the VPC where the cluster is deployed and wait for 30 minutes. The only solution when migrating a cluster to another VPC is to… create another cluster in the new VPC!
Doing this implies recreating everything from scratch: cluster configuration, deployments, load balancers, etc. Thankfully, the customer’s clusters were managed via a CD (Continuous Delivery) framework, and the underlying AWS infrastructure was handled by Terraform, so it was relatively quick for us to re-deploy a new cluster with an identical configuration.
We also had to think about how to migrate data that were stored in persistent volumes. In reality, persistent volumes in Amazon EKS clusters are simply volumes from the Amazon EBS (Elastic Block Storage) service. Personally, I think doing the job of remapping every volume so that it matches the volume claims made in the new cluster is too much of a hassle.
As the customer was fine with losing their logs in non-production environments (which were held for a month only), we decided not to bother. For the production environment, however, the question remains. As such, we decided to use our monitoring tools’ own solutions to transfer data, in our case Prometheus and Elasticsearch, instead of relying on creating a patchwork of volumes.
Speaking of the production environment, we still had a problem when trying to do the migration smoothly with no downtime. Indeed, you lose connectivity between components as soon as you migrate one on the other VPC, so how could the RDS instance, for example, communicate with the EKS cluster that was in another VPC? Well, this is where the VPC Peering comes to the rescue.
This AWS service allows you to interconnect two VPCs in a private way, including a splendid feature enabling security group ruling across VPCs! We could use security group rules, and authorize the EKS cluster to reach the RDS instance across networks in the most secure way possible. Perfect, isn’t it?
Of course, the VPC Peering was only kept for a short period of time while our migration was in progress.
The outcomes of this migration to the hub-and-spoke model exceeded our initial expectations. At first, it was a decision of the customer’s Operation and Security team that we helped put in place, and it turned out to be a real launchpad to improve security and centralize tools in a way we could not have before.
We managed as a first stepping stone to centralize our ELK (Elasticsearch & Kibana) stack in a single account, with data coming from all environments without traffic going through the Internet.
Our next step, by the time I write this article, is to centralize our metrics monitoring stack with a federated Prometheus stack! Using Transit Gateway capabilities, it would also be possible to handle public entry points in a single place, to centrally manage ingress traffic as well. It is only the beginning!
In this article, I helped you understand why and how you would make the switch to a hub-and-spoke model through our experience with our customer’s case. It is not a one-click action and it comes with its challenges, as Cloud is not just some magic from another world, but this model is a great enabler for centralization, allowing more simplicity of management, configuration, and maintenance. I am sure you can do it too!
This article has been written in collaboration with Pragnesh Shah from AWS.