AWS EKS Deep Dive: From Manual Setup to Production-Ready Kubernetes Clusters
When I first looked at AWS’s container orchestration options, I felt like I was staring at a menu in a foreign language. ECS? EKS? Fargate? EC2? What’s the difference, and which one should I actually use?
After weeks of hands-on experimentation—creating clusters manually through the console, wrestling with IAM roles, debugging the infamous cluster autoscaler, and eventually discovering the magic of eksctl—I finally have clarity. Today, I’m sharing everything I learned about deploying production-ready Kubernetes on AWS, complete with the mistakes I made so you don’t have to.
Understanding AWS’s Container Landscape: A High-Level Overview
Before diving into EKS, let’s map out the entire AWS container ecosystem. Think of this as choosing your adventure based on your needs:
The Four Main Paths
1. ECS + EC2: The AWS-Native Approach
Amazon Elastic Container Service (ECS) with EC2 instances.
What it is: AWS’s proprietary container orchestration service running on EC2 instances you manage.
When to use:
- You want a simpler, AWS-native solution (no Kubernetes complexity)
- Your team doesn’t need Kubernetes expertise
- You’re already heavily invested in AWS ecosystem
Trade-offs:
- Not portable to other clouds or on-premises
- Proprietary API (not Kubernetes)
- You manage the EC2 instances (patching, scaling)
2. ECS + Fargate: The Serverless Container Dream
ECS with Fargate compute engine.
What it is: Run containers without managing any servers. AWS handles all infrastructure.
When to use:
- You want zero infrastructure management
- Workloads have unpredictable traffic patterns
- Team wants to focus 100% on applications
Trade-offs:
- Higher cost per workload
- Less control over underlying infrastructure
- Still proprietary to AWS
3. EKS (Elastic Kubernetes Service): The Kubernetes Standard
Managed Kubernetes control plane with self-managed or managed worker nodes.
What it is: Fully managed Kubernetes control plane + your choice of worker node management.
When to use:
- You need Kubernetes (for portability, ecosystem, or skills)
- You want flexibility in how you manage worker nodes
- You need hybrid/multi-cloud capabilities
Management levels:
- Self-managed nodes: You provision and manage EC2 instances yourself (maximum control)
- Managed Node Groups: AWS helps with node lifecycle management (recommended balance)
- EKS + Fargate: Fully serverless Kubernetes (zero node management)
Trade-offs:
- Steeper learning curve than ECS
- More configuration required
- You pay for the control plane ($0.10/hour)
4. ECR (Elastic Container Registry): Your Private Docker Hub
AWS’s Docker image registry.
What it is: Secure, scalable, and reliable registry to store and manage your container images.
Why you need it: Whether you use ECS or EKS, you’ll push your Docker images to ECR and pull them during deployment. Think of it as your team’s private DockerHub.
My Recommendation
For learning and production Kubernetes deployments, EKS with Managed Node Groups is the sweet spot:
- You get full Kubernetes capabilities
- AWS handles control plane complexity
- Managed node groups simplify node lifecycle
- You retain flexibility to optimize costs and performance
Now let’s build one.
Creating an EKS Cluster Manually (The Console Way)
Before we use automation tools, it’s crucial to understand what’s happening under the hood. Let’s create an EKS cluster through the AWS Management Console.
Step 1: Create an IAM Role for the EKS Cluster
The EKS control plane needs permissions to manage AWS resources on your behalf (creating load balancers, managing ENIs, etc.).
In the AWS Console:
- Navigate to IAM → Roles → Create role
- Select AWS Service → EKS → EKS - Cluster
- AWS automatically attaches the required policy:
AmazonEKSClusterPolicy - Name it:
EKS-Cluster-Role - Create the role
What this role does: Allows EKS to make API calls to AWS services like EC2, Elastic Load Balancing, and CloudWatch on your behalf.
Step 2: Create a VPC for Your Cluster
EKS requires a VPC with specific networking configurations (public and private subnets, route tables, NAT gateways). Rather than creating this manually, AWS provides a CloudFormation template.
Why CloudFormation? Creating a production-ready VPC manually involves:
- Multiple subnets (public and private across AZs)
- Internet Gateway for public subnet
- NAT Gateway for private subnet
- Route tables and security groups
- Proper tagging for Kubernetes
That’s 30+ resources. CloudFormation does this in 5 minutes.
Steps:
-
Go to CloudFormation → Create stack → With new resources
-
For the template URL, use the official AWS EKS VPC template:
https://s3.us-west-2.amazonaws.com/amazon-eks/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yamlOr find the latest at: AWS EKS VPC CloudFormation Templates
-
Stack name:
eks-vpc-stack -
Create the stack
Check the Outputs tab after creation—you’ll need the VPC ID and subnet IDs for cluster creation.
Public vs. Private Subnets:
- Public subnets: For internet-facing resources like LoadBalancers
- Private subnets: For worker nodes (secure, no direct internet access)
This is a best practice architecture—worker nodes in private subnets access the internet via NAT Gateway for pulling images, while LoadBalancers in public subnets serve traffic.
Step 3: Create the EKS Cluster
-
Navigate to EKS → Clusters → Create cluster
-
Configuration:
- Cluster name:
my-eks-cluster - Kubernetes version:
1.33(use the latest stable version) - Cluster service role: Select the
EKS-Cluster-Roleyou created
- Cluster name:
-
Networking:
- VPC: Select the VPC from your CloudFormation stack
- Subnets: Select all subnets (both public and private)
- Security groups: Use the default security group created
- Cluster endpoint access:
- Public and Private (recommended): Control plane accessible from both internet (for your laptop) and within VPC (for nodes)
- Public only: Less secure
- Private only: Very secure but requires VPN/bastion host to manage cluster
-
Add-ons (optional but recommended):
- CoreDNS: DNS service for Kubernetes
- kube-proxy: Network proxy running on each node
- VPC CNI: Networking plugin for pod IP assignment
These are essential components. Install them unless you have specific reasons not to.
-
Create the cluster (takes 10-15 minutes)
Step 4: Connect to Your Cluster
Once the cluster is active, configure kubectl to connect:
# Verify AWS CLI is configured
aws configure list
# Update kubeconfig to include your new cluster
aws eks update-kubeconfig --name my-eks-cluster --region us-east-1
# Verify connection
kubectl cluster-info
kubectl get svc
You should see the Kubernetes API server endpoint. At this point, your control plane is ready, but you have zero worker nodes—no place to run workloads yet.
Creating Worker Nodes (Managed Node Groups)
The control plane is the brain; worker nodes are the muscles. Let’s add compute capacity.
Step 1: Create an IAM Role for Worker Nodes
Worker nodes need permissions to:
- Join the EKS cluster
- Pull images from ECR
- Manage networking (ENIs for pod IPs)
In IAM Console:
- Create role → AWS Service → EC2
- Attach these three policies:
AmazonEKSWorkerNodePolicy(core EKS permissions)AmazonEC2ContainerRegistryReadOnly(pull images from ECR)AmazonEKS_CNI_Policy(networking for pods)
- Name:
EKS-Worker-Node-Role - Create role
Step 2: Create a Managed Node Group
In your EKS cluster:
- Compute tab → Add node group
- Configuration:
- Node group name:
eks-nodegroup-1 - Node IAM role:
EKS-Worker-Node-Role
- Node group name:
- Compute configuration:
- AMI type:
Amazon Linux 2(optimized for EKS) - Instance type:
t3.medium(2 vCPU, 4GB RAM - good starting point) - Disk size:
20 GB
- AMI type:
- Scaling configuration:
- Desired size:
2nodes - Minimum size:
1node - Maximum size:
4nodes
- Desired size:
- Remote access (optional but recommended for troubleshooting):
- Enable SSH access
- Select your EC2 key pair
- Specify allowed SSH source (your IP)
- Create node group
Wait 5-10 minutes. Verify with:
kubectl get nodes
You should see 2 nodes in Ready state. Congratulations! You now have a functional EKS cluster.
Auto Scaling: Teaching Your Cluster to Grow and Shrink
Static node counts are fine for learning, but production workloads need dynamic scaling. When traffic spikes, you need more nodes. When traffic drops, you want to save money by scaling down.
Enter the Kubernetes Cluster Autoscaler.
How It Works
- You deploy a pod that requires more resources than available
- Pod remains in
Pendingstate - Cluster Autoscaler detects this
- Autoscaler calls AWS Auto Scaling Group API to increase desired capacity
- New EC2 instance joins cluster
- Pod gets scheduled
Reverse happens when nodes are underutilized for 10+ minutes.
The Architecture: IRSA (IAM Roles for Service Accounts)
Here’s where it gets sophisticated. The Cluster Autoscaler pod needs AWS permissions to modify Auto Scaling Groups. Instead of giving every pod on the node these permissions (overly broad), we use IRSA:
- OIDC Provider: Links your EKS cluster to AWS IAM
- IAM Role: Has permissions to modify Auto Scaling Groups
- Kubernetes Service Account: Gets annotated with the IAM role ARN
- Pod: Uses the service account and assumes the IAM role automatically
This is secure, granular, and follows the principle of least privilege.
Setup Guide (Step-by-Step)
I’m going to share the complete, working setup that finally worked after hours of debugging. This is based on the comprehensive troubleshooting guide I pieced together.
1. Get Cluster Information
# Set variables (replace with your values)
export CLUSTER_NAME="my-eks-cluster"
export AWS_REGION="us-east-1"
# Get AWS account ID
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
echo "AWS Account ID: $AWS_ACCOUNT_ID"
# Get OIDC provider URL
export OIDC_URL=$(aws eks describe-cluster --name $CLUSTER_NAME --region $AWS_REGION --query "cluster.identity.oidc.issuer" --output text)
echo "OIDC URL: $OIDC_URL"
# Extract OIDC ID
export OIDC_ID=$(echo $OIDC_URL | cut -d '/' -f 5)
echo "OIDC ID: $OIDC_ID"
2. Create OIDC Provider (If Not Exists)
# Check if it exists
aws iam list-open-id-connect-providers | grep $OIDC_ID
# If not, create it
aws iam create-open-id-connect-provider \
--url $OIDC_URL \
--client-id-list sts.amazonaws.com \
--thumbprint-list 9e99a48a9960b14926bb7f3b02e22da2b0ab7280
3. Create IAM Policy for Autoscaler
cat > cluster-autoscaler-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": ["*"]
}
]
}
EOF
# Create the policy
aws iam create-policy \
--policy-name AmazonEKSClusterAutoscalerPolicy \
--policy-document file://cluster-autoscaler-policy.json
4. Create IAM Role with Trust Policy
cat > trust-policy.json << EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:cluster-autoscaler",
"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com"
}
}
}
]
}
EOF
# Create the role
aws iam create-role \
--role-name EKSClusterAutoscalerRole \
--assume-role-policy-document file://trust-policy.json
# Attach the policy
aws iam attach-role-policy \
--role-name EKSClusterAutoscalerRole \
--policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AmazonEKSClusterAutoscalerPolicy
5. Download and Customize the Manifest
# Download official manifest
wget https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
Edit the file and make these critical changes:
A. Add IAM role annotation to ServiceAccount:
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::YOUR_ACCOUNT_ID:role/EKSClusterAutoscalerRole
B. Add pod annotation to prevent eviction:
spec:
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
C. Add AWS region environment variable:
containers:
- image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.33.0
name: cluster-autoscaler
env:
- name: AWS_REGION
value: "us-east-1"
D. Update command arguments:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-eks-cluster
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
E. Use the correct image version:
Match your Kubernetes version:
| Kubernetes | Autoscaler Image |
|---|---|
| 1.33 | v1.33.0 |
| 1.32 | v1.32.0 |
| 1.31 | v1.31.0 |
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.33.0
6. Tag Your Auto Scaling Groups
# Find your ASG name
ASG_NAME=$(aws autoscaling describe-auto-scaling-groups \
--query "AutoScalingGroups[?contains(Tags[?Key=='eks:cluster-name'].Value, '$CLUSTER_NAME')].AutoScalingGroupName" \
--output text)
# Add required tags
aws autoscaling create-or-update-tags \
--tags \
ResourceId=$ASG_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=false \
ResourceId=$ASG_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/$CLUSTER_NAME,Value=owned,PropagateAtLaunch=false
These tags are how the autoscaler discovers which Auto Scaling Groups it can modify.
7. Deploy the Autoscaler
kubectl apply -f cluster-autoscaler-autodiscover.yaml
# Verify it's running
kubectl get pods -n kube-system -l app=cluster-autoscaler
# Check logs for success messages
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
Success indicators in logs:
- “Starting cluster autoscaler”
- “Successfully loaded EC2 instance types”
- “Discovered X Auto Scaling Groups”
Testing the Autoscaler
Deploy a workload that needs more resources than available:
# test-autoscaling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
resources:
requests:
cpu: 500m
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 80
selector:
app: nginx
# Deploy
kubectl apply -f test-autoscaling.yaml
# Get LoadBalancer URL
kubectl get svc nginx
# Wait for EXTERNAL-IP, then visit it in browser
# Scale up to trigger autoscaling
kubectl scale deployment nginx --replicas=10
# Watch the autoscaler in action
kubectl logs -n kube-system -l app=cluster-autoscaler -f
# Watch nodes being added
kubectl get nodes -w
You’ll see new nodes join the cluster within 2-3 minutes!
Common Errors I Encountered (And How to Fix Them)
Error 1: NoCredentialProviders
Symptom: CrashLoopBackOff, logs show “NoCredentialProviders: no valid providers”
Cause: OIDC provider not created or service account annotation missing
Fix: Verify OIDC provider exists and service account has the IAM role annotation
Error 2: ImagePullBackOff
Symptom: Pod won’t start, “image pull error”
Cause: Wrong autoscaler version for your Kubernetes version
Fix: Check Kubernetes version with kubectl version and use matching autoscaler image
Error 3: No Auto Scaling Groups Found
Symptom: Logs show “0 ASG found”
Cause: Missing ASG tags or wrong cluster name in command arguments
Fix: Ensure ASG has both required tags and cluster name matches exactly
EKS with Fargate: The Serverless Kubernetes Experience
Want to run Kubernetes pods without managing any EC2 instances? That’s Fargate.
How Fargate Works with EKS
- You define a Fargate Profile specifying which pods should run on Fargate (by namespace or labels)
- When you deploy a pod matching the profile, AWS automatically provisions a Fargate task (right-sized compute)
- When the pod terminates, the Fargate task is deleted
- You pay only for the vCPU and memory resources your pods use
When to Use Fargate
Good fit:
- Batch jobs or cron jobs
- Microservices with variable traffic
- Development/staging environments
- Workloads where you want zero node management
Not ideal:
- Stateful applications requiring persistent storage (limited support)
- GPU workloads (not supported)
- Workloads needing node-level customization
- Cost-sensitive production (EC2 is cheaper at scale)
Setting Up EKS with Fargate
1. Create IAM Role for Fargate
In IAM Console:
- Create role → AWS Service → EKS → EKS - Fargate Pod
- AWS attaches the policy:
AmazonEKSFargatePodExecutionRolePolicy - Name:
EKS-Fargate-Pod-Role
2. Create a Fargate Profile
In your EKS cluster:
- Compute tab → Fargate profiles → Create profile
- Configuration:
- Name:
fargate-profile-1 - Pod execution role:
EKS-Fargate-Pod-Role - Subnets: Select private subnets only (Fargate requires private subnets)
- Name:
- Pod selectors:
- Namespace:
fargate - Labels (optional): Can match specific label selectors
- Namespace:
This means: “Any pod deployed to the fargate namespace will run on Fargate.”
3. Deploy a Workload to Fargate
# Create the namespace
kubectl create namespace fargate
# nginx-fargate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-fg
namespace: fargate
spec:
replicas: 1
selector:
matchLabels:
app: nginx-fg
template:
metadata:
labels:
app: nginx-fg
spec:
containers:
- name: nginx-fg
image: nginx
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-fg
namespace: fargate
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 80
selector:
app: nginx-fg
kubectl apply -f nginx-fargate.yaml
# Watch the pod start
kubectl get pods -n fargate -w
# Get LoadBalancer URL
kubectl get svc -n fargate nginx-fg
Notice the pod startup is slightly slower (30-60 seconds) because AWS is provisioning the Fargate task.
Verify it’s on Fargate:
kubectl get pod -n fargate <pod-name> -o wide
You won’t see a traditional node name—it’ll show a Fargate node identifier.
The Fast Track: Creating EKS Clusters with eksctl
After manually creating clusters twice, I discovered eksctl—a CLI tool that does in one command what took us 45 minutes through the console.
What is eksctl?
An official CLI tool for EKS created by Weaveworks and AWS. It’s like kubectl for cluster creation—declarative, simple, and powerful.
Installing eksctl
macOS:
brew tap weaveworks/tap
brew install weaveworks/tap/eksctl
Linux:
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
Windows (Chocolatey):
choco install eksctl
Verify:
eksctl version
Creating a Cluster with eksctl
eksctl create cluster \
--name eksctl-demo-k8s \
--version 1.33 \
--region us-east-1 \
--nodegroup-name eksctl-demo-ngr \
--node-type t3.medium \
--nodes 2 \
--nodes-min 1 \
--nodes-max 4
That’s it. This single command:
- Creates the IAM roles (cluster + worker nodes)
- Creates a VPC with public/private subnets
- Creates the EKS cluster
- Creates a managed node group
- Configures your
kubeconfigautomatically - Enables all add-ons (CoreDNS, kube-proxy, VPC CNI)
Time: 15-20 minutes. Manual clicks: Zero.
Verify Everything
# Cluster info
kubectl cluster-info
# Nodes
kubectl get nodes
# Test deployment
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer
kubectl get svc
Deleting the Cluster
eksctl delete cluster \
--name eksctl-demo-k8s \
--region us-east-1 \
--wait
This deletes everything: cluster, node group, VPC, IAM roles, CloudFormation stacks. Clean and thorough.
Cleaning Up Resources (Important!)
EKS clusters cost $0.10/hour for the control plane plus EC2/Fargate costs. Don’t forget to delete when done learning.
Manual Cleanup Order
- Delete Fargate profiles (if any)
- Delete Node groups
- Delete Cluster
- Delete CloudFormation stack (VPC)
- Delete IAM roles (if you don’t need them)
eksctl Cleanup
eksctl delete cluster --name <cluster-name> --region <region> --wait
Done. One command.
Key Takeaways & Production Checklist
After this deep dive, here’s what clicked for me:
✅ EKS abstracts the control plane complexity but you still need to understand IAM, VPC, and networking
✅ Managed Node Groups are the sweet spot for most use cases—balance of control and convenience
✅ Cluster Autoscaler requires IRSA setup—take time to understand OIDC providers and IAM trust policies
✅ Fargate is magical for variable workloads but not always cost-effective at scale
✅ eksctl is the fastest way to learn—start here, then dive into manual setup to understand internals
Production Readiness Checklist
Before going live, ensure:
- VPC has private subnets for nodes and public subnets for LoadBalancers
- Cluster endpoint is private + public or private only (with VPN)
- Node groups use multiple AZs for high availability
- Cluster Autoscaler is configured with proper IAM permissions
- All pods have resource requests and limits
- Critical workloads use Pod Disruption Budgets
- Monitoring and logging are enabled (CloudWatch Container Insights)
- Secrets are managed with AWS Secrets Manager or Parameter Store (not plain ConfigMaps)
- Images are scanned for vulnerabilities (ECR image scanning)
- Kubernetes version is within 2 minor versions of latest
- Regular backup strategy for etcd (or rely on AWS’s automatic backups)
What’s Next?
Now that you have a production-grade EKS cluster, here’s what to explore:
- Set up CI/CD: Integrate with GitHub Actions or GitLab CI to auto-deploy to EKS
- Implement monitoring: Deploy Prometheus and Grafana for observability
- Add ingress controller: Use AWS Load Balancer Controller or NGINX Ingress
- Explore service mesh: Try AWS App Mesh or Istio for advanced traffic management
- Experiment with EKS Add-ons: AWS released new capabilities in 2025 including built-in Argo CD and Kube Resource Orchestrator (KRO)
Resources
- Official EKS Docs: docs.aws.amazon.com/eks
- eksctl Docs: eksctl.io
- Cluster Autoscaler: GitHub Repository
- EKS Workshop: eksworkshop.com
- My GitHub: [Sample EKS configurations and troubleshooting guides]
Have you struggled with EKS setup? What tripped you up—IAM roles, networking, or the autoscaler? Drop a comment, and let’s troubleshoot together!
And if this guide saved you hours of debugging (like it would have saved me), bookmark it for your team. Future engineers will thank you. 🚀
Happy Kubernetes-ing on AWS!