Our burrito is a TACoS | Padok

Written by Sacha Bernheim | 23-Mar-2023 13:22:04

Our current issues with terraform

At Padok, we use Terraform on a daily basis. We manage growing codebases that provision long-lived infrastructure. If this tool is really handy and an unavoidable reference in its domain, it still comes with some issues which are not managed out of the box.

If you read our blog, you should know that we try to develop open-source tools to help with the aforementioned issues, such as tfautomv to help with refactoring.

We currently encounter 2 main issues with managing terraform at scale.

Drift detection

A drift is a difference between the current state of your (cloud) resources and the state you stored in your configuration. It can occur for numerous reasons: a modification made manually and not reported in your codebase, a resource failure, or simply because someone forgot to apply his latest modifications.

Either way, a drift is something you do not want. And the only way to detect it is to run a terraform plan in each and every one of your layers. Unfortunately, the more your codebase grows, the more you expose yourself to it. Especially when using terragrunt with the context pattern.

With one of our clients, we reached up to 338 layers. Imagine needing to run a plan 338 times in a row 😱 If we could simply apply the GitOps pattern to terraform, it would be so much easier.

CI/CD workflows

It is painful and time-consuming to work a clean and efficient CI/CD for terraform. Moreover, whichever good practice you can learn, you need to learn again when switching your CI/CD provider. We would prefer a more agnostic implementation.

Also, if you follow our blog regularly, you know that, at Padok, we are big fans of ArgoCD. With it, you do not need to write CI/CD for Helm ever again. Why can’t we have this for terraform?

What is a TACoS?

Of course, we are not the only ones who stumbled across those issues. That’s how the concept of TACoS (Terraform Automation and Collaboration Software) emerged. This typology of the tool offers a variety of features.

Environment Management

When managing multiple terraform layers you might need to run different terraform runtimes (terraform versions, terragrunt…). Also, some of your layers may need some environment variables and secrets to run properly. And you want to keep those compartmentalized between layers.

A TACoS should provide a way to manage these varieties of runtimes and environments.

RBAC

Not everyone in your organization should have the right to create a new layer nor modify or apply one. Also, a terraform layer needs a set of rights in your infrastructure which you should keep as small as possible following the least privileged principles.

A TACoS should provide a fine-grained access permission control.

Observability

To handle drift detection, we need an easy way to check if a layer needs to be planned or applied. It can be with a simple command or with a cool WebUI.

A TACoS should provide an interface to easily access your terraform layers’ status.

Workflow management

Your terrraform CI/CD workflow should be close to a software release workflow. When you push modification to your main branch, it should be applied. When you open a PR/MR, your terraform code should get planned.

Moreover, the result of this plan should be easily accessible (in the comment of your PR/MR for instance) so the reviewer can comment on the output.

A TACoS should provide an out-of-the-box workflow integration.

Already existing solutions

The very concept of TACoS is not new. Some already existing tools try to address the issues and requirements I just presented to you.

Atlantis. This open-source tool perfectly manages your contribution workflow to an existing codebase. It is easily installable, but unfortunately, it does not do anything more than what I just described. I am not even sure it qualifies as a TACoS 🙄
Terraform cloud, Spacelift or any other Terraform Automation SaaS. Those tools are actually very well conceived and tick all the checkboxes of a TACoS perfectly. The only downside (apart from vendor locking?): they are really expensive. You must pay for each user, workflow run, and terraform layer you want to be automated…
FluxCD terraform-operator. This open-source Kubernetes Operator handles drift detection, plan, and apply on push and does it in a very efficient way. It also provides a CRD to directly declare terraform resources without connecting to a repository containing terraform code. Like Crossplane would. All this is heavily based on the Flux ecosystem, if you are already using it, it might be worth having a look at it. However, installing the whole ecosystem just for this seems pretty heavy and it’s not clear whether or not they intend to become a “full” TACoS.

N.B. This list is not exhaustive.

Introducing burrito

Imagine an open-source Kubernetes Operator which is a fully functional TACoS. Well, that is what we are aiming to do with burrito.

The operator is built around 2 CRDs: TerraformRepository and TerraformLayer.

The former is dedicated to defining a repository containing terraform code and its authentication method. You can also define parameters that will be inherited by default by the latter.

The TerraformLayer defines a path and branch inside a TerraformRepository which is a terraform layer. Each layer can only be in one out of three states:

Idle, when the layer is up to date
PlanNeeded, when it has been a certain amount of time (configurable) the layer has not been planned or because it received a Git event which indicates it has been modified
ApplyNeeded, when the result of the last plan is not empty

Those states are inferred from different conditions that are checked upon each controller’s reconciliation.

We also ensure that a layer cannot start different runners at the same time on itself by generating a Kubernetes Lease each time a workflow is started. Before creating a new one, we check that no other runner holds a Lease on this layer.

We could have used the lock mechanism of terraform directly, but it would create failing pods all the time which is not wanted.

You can, of course, configure a lot of parameters within each CRD declaration or within the controller itself: remediation strategy (should the controller only run the plans or the applies as well), the period between two plans for drift detection, etc.

And as we say: a good sketch is better than a long discourse. Or, in this case, a gif…

What next?

If you read this article carefully, you should see that burrito is not entirely (yet) a TACoS. But we plan on continuing its development. Feel free to try it out, open new issues or feature requests, and contribute to the project on your own!

Reach out to burrito’s maintainers on Twitter: @spoukke and @LonguetAlan. We would love to hear your feedback about burrito!

View full post