At Padok, we use Terraform on a daily basis. We manage growing codebases that provision long-lived infrastructure. If this tool is really handy and an unavoidable reference in its domain, it still comes with some issues which are not managed out of the box.
If you read our blog, you should know that we try to develop open-source tools to help with the aforementioned issues, such as tfautomv to help with refactoring.
We currently encounter 2 main issues with managing terraform at scale.
A drift is a difference between the current state of your (cloud) resources and the state you stored in your configuration. It can occur for numerous reasons: a modification made manually and not reported in your codebase, a resource failure, or simply because someone forgot to apply his latest modifications.
Either way, a drift is something you do not want. And the only way to detect it is to run a terraform plan
in each and every one of your layers. Unfortunately, the more your codebase grows, the more you expose yourself to it. Especially when using terragrunt with the context pattern.
With one of our clients, we reached up to 338 layers. Imagine needing to run a plan 338 times in a row š± If we could simply apply the GitOps pattern to terraform, it would be so much easier.
It is painful and time-consuming to work a clean and efficient CI/CD for terraform. Moreover, whichever good practice you can learn, you need to learn again when switching your CI/CD provider. We would prefer a more agnostic implementation.
Also, if you follow our blog regularly, you know that, at Padok, we are big fans of ArgoCD. With it, you do not need to write CI/CD for Helm ever again. Why canāt we have this for terraform?
Of course, we are not the only ones who stumbled across those issues. Thatās how the concept of TACoS (Terraform Automation and Collaboration Software) emerged. This typology of the tool offers a variety of features.
When managing multiple terraform layers you might need to run different terraform runtimes (terraform versions, terragruntā¦). Also, some of your layers may need some environment variables and secrets to run properly. And you want to keep those compartmentalized between layers.
A TACoS should provide a way to manage these varieties of runtimes and environments.
Not everyone in your organization should have the right to create a new layer nor modify or apply one. Also, a terraform layer needs a set of rights in your infrastructure which you should keep as small as possible following the least privileged principles.
A TACoS should provide a fine-grained access permission control.
To handle drift detection, we need an easy way to check if a layer needs to be planned or applied. It can be with a simple command or with a cool WebUI.
A TACoS should provide an interface to easily access your terraform layersā status.
Your terrraform CI/CD workflow should be close to a software release workflow. When you push modification to your main branch, it should be applied. When you open a PR/MR, your terraform code should get planned.
Moreover, the result of this plan should be easily accessible (in the comment of your PR/MR for instance) so the reviewer can comment on the output.
A TACoS should provide an out-of-the-box workflow integration.
The very concept of TACoS is not new. Some already existing tools try to address the issues and requirements I just presented to you.
N.B. This list is not exhaustive.
Imagine an open-source Kubernetes Operator which is a fully functional TACoS. Well, that is what we are aiming to do with burrito.
The operator is built around 2 CRDs: TerraformRepository
and TerraformLayer
.
The former is dedicated to defining a repository containing terraform code and its authentication method. You can also define parameters that will be inherited by default by the latter.
The TerraformLayer
defines a path and branch inside a TerraformRepository
which is a terraform layer. Each layer can only be in one out of three states:
Those states are inferred from different conditions that are checked upon each controllerās reconciliation.
We also ensure that a layer cannot start different runners at the same time on itself by generating a Kubernetes Lease each time a workflow is started. Before creating a new one, we check that no other runner holds a Lease on this layer.
We could have used the lock mechanism of terraform directly, but it would create failing pods all the time which is not wanted.
You can, of course, configure a lot of parameters within each CRD declaration or within the controller itself: remediation strategy (should the controller only run the plans or the applies as well), the period between two plans for drift detection, etc.
And as we say: a good sketch is better than a long discourse. Or, in this case, a gifā¦
If you read this article carefully, you should see that burrito is not entirely (yet) a TACoS. But we plan on continuing its development. Feel free to try it out, open new issues or feature requests, and contribute to the project on your own!
Reach out to burritoās maintainers on Twitter: @spoukke and @LonguetAlan. We would love to hear your feedback about burrito!