DevOps Blog

Fixing “Essential container in task exited” error in ECS | Padok

Written by Callum Hemsley | 16-Mar-2023 14:14:10

Some fundamental prerequisite knowledge of dockers and containerisation is needed for this article. Take a look at what is a container and how to containerise an app before continuing if not.

An exceedingly generic error

Your application is dockerized and running locally. You have successfully pushed your tagged image to a container registry (e.g ECR). Subsequently, you have meticulously configured ECS to effortlessly retrieve the image, establish a container, and run your application. All the hard work is done, or so you thought.

Seconds after starting to run, the container exits with this generic and non-descriptive error. In this post, we will examine the meaning of this error and explore methods for debugging it to determine a fix. Additionally, we shall dispel the illusion of the often-cited mantra of “build the image once, run it everywhere.”

What does the error mean?

To gain a better understanding of the error message, let's take a step back and review our use of ECS. ECS is a container management service that allows us to run containers on a cluster, providing an alternative to other platforms like Kubernetes. To help differentiate between the two, we've provided a brief comparison of ECS versus EKS in a previous post.

ECS has two important concepts to understand: Tasks and Services.

  • Tasks in ECS are like jobs that tell the computer what to do. For example, you could have a task that says "run this specific program in a container.”
  • Services in ECS make sure there's always a container running your program, so you don't have to keep starting it yourself.

And finally, an analogy given by none other than ChatGPT when asked to explain these concepts as if I were a child: “A task in ECS could be a chef cooking a meal in a restaurant kitchen, while a service would be like ensuring that there is always a steady supply of that meal available to customers, even during peak hours.”

So, what does the error message state? Essentially, the container running from a task exited, and since it was marked as "essential", the task can't continue.

Okay, now we have the context of what is failing, but why is the container failing? I switched to the container logs in hopes of finding the reason, only to find no logs at all!

Debugging the error

As noted on Stack Overflow, container logs are not generated by default, requiring explicit log configuration within the ECS task. In my case, as I was using Terraform for provisioning my ECS, I had to add the following configuration to my container_definitions inside ecs_task_definition.

logConfiguration= {
  logDriver= "awslogs",
  options= {
      awslogs-create-group= "true",
      awslogs-group= "callums-node-logs",
      awslogs-region= "eu-west-3",
      awslogs-stream-prefix= "awslogs-example"
  }
},

By specifying the 'awslogs' driver in the log configuration, we enable the visibility of logs generated from within the container. I had to specify things such as aws-create-group to create the log group if it didn’t already exist and explicitly give the log group a name. There are several other options for configuring AWS logs.

However, as always in AWS, just enabling the feature isn’t enough. By default, your IAM role won’t have the required permissions. Ensure that the IAM role for your Amazon ECS container instance has logs:CreateLogStream and logs:PutLogEvents permissions. Since I was using Fargate, I also had to give these roles to my ECS task execution IAM role as well.

And finally, the logs provided me with an error to investigate:

What was the error?

exec /usr/local/bin/docker-entrypoint.sh: exec format error

Upon conducting a brief search, it turned out that building the image on a M1 Mac caused issues while running it on ECS. This realisation highlighted a substantial gap in my understanding of the intricacies of Docker.

One of the major benefits of Docker for me was reproducibility, in other words solving the commonly cited problem of "it works on my machine, but not on my colleague's". While this is accurate in instances where all team members are utilizing the same architecture, I had misinterpreted it to mean "build once, run anywhere.”

Architecture refers to the instruction set used by a computer CPU. The most prevalent ones are x86 and ARM (such as M1). Differences in computer chips lead to differences in program execution. Docker builds images for the current platform, as noted in its documentation.

I built my image on an M1 Mac, then pushed it to ECR (AWS’s version of Docker Hub) for running via ECS, which was using an x86 architecture.

To summarise the problem, docker images built for the M1 chip won't work on x86 machines because the way the chips are built is different. Just like a puzzle piece made for a square puzzle won't fit in a round puzzle.

The Docker image has been made to fit one type of puzzle (M1 chip), but it won't fit in the other type of puzzle (x86 machine). The Padok blog has a more in-depth look at Docker images with ARM and other architectures.

Fixing the issue

There are several ways to fix this issue on ECS. However, some have interesting trade-offs, so I have listed the ones I considered below.

Use an image that is for an x86 chip set explicitly


By default, an ARM chipset is used for the image if running on an ARM computer. However, we can change this by explicitly building the image for an x86 machine. This can be easily done by modifying the Dockerfile. This meant for me that I had to replace the FROM node:11.15 line with FROM --platform=linux/amd64 node:11.15.

However, be careful with this approach! This worked for me because I’m using Node, an interpreted language. When using a compiled language like Go, docker needs to be instructed to build everything as if it was x86. This can be done by specifying the platform in the docker build command: docker build --platform linux/amd64,linux/arm64 -t <image-tag> .

Use an arm64 based VM for Fargate


To avoid building for x86, we can use an arm64 VM (based on Graviton) within our ECS Fargate instance. This has the added bonus of being 40% more performant (and therefore cheaper!)

Thanks to Fargate abstracting away lots of implementation details, all we need to do is simply specify the CPU architecture type as ARM64 in the ECS Task Definition. This can be done in terraform by modifying your aws_ecs_task_definition resource to include the following:

runtime_platform {
  operating_system_family = "LINUX"
  cpu_architecture = "ARM64"
}

Use an x86 chip to build the image


The final option is to simply allow the image to be built from an x86 chip, for example, it’s not uncommon for the image to be built as part of the CI/CD process.

Conclusion:

The key lesson I learned was to check container logs as soon as issues arise in ECS. Enabling logs almost guarantees the provision of useful information to debug and resolve the problem at hand. If you're looking for general tips on debugging issues with ECS, I found this article by Nathan Peck, a Senior Developer Advocate at Amazon, to be very insightful.