From Chaos to GitOps

Day one of the engagement: 40 microservices, 3 environments (dev/staging/prod), approximately 14 engineers who each knew how to deploy "their" services and almost nobody else's. Deployment documentation that hadn't been updated since 2023. A shared staging environment with no resource quotas where one team's load test could (and did) bring down everyone else's work. This is where we started.

Ninety days later, every service was deployed via ArgoCD from a monorepo of Helm charts, every environment was identical by definition, and the engineering team was shipping to production three times per day with confidence. Here's exactly how we did it.

Key Principle

GitOps is the practice of using Git as the single source of truth for your entire system state. The cluster always converges toward what is described in Git. This is not just a deployment pattern — it is a complete operational philosophy.

The 90-Day Roadmap

Weeks 1–2: Discovery and the "Archaeology Phase"

Before we wrote a single line of Helm chart YAML, we spent two weeks mapping the landscape. We documented every service's: runtime requirements, environment variables and secret dependencies, inter-service dependencies, databases touched, and external API integrations. This "archaeology phase" is where most migrations fail — teams skip it and discover missing dependencies in production at 2 PM on a Friday.

Weeks 3–5: Helm Chart Standardization

We created a "golden chart" template that encoded organizational standards: resource requests/limits, pod disruption budgets, probes, NetworkPolicy defaults, and ServiceMonitor for Prometheus. Every service used this template. Engineers override only what their service specifically needs.

yaml

# helm/templates/deployment.yaml — Golden Chart Template
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "app.fullname" . }}
  annotations:
    deployment.kubernetes.io/git-sha: {{ .Values.image.tag | quote }}
spec:
  replicas: {{ .Values.replicaCount | default 2 }}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime by default
  selector:
    matchLabels: {{- include "app.selectorLabels" . | nindent 6 }}
  template:
    spec:
      # Security context — hardened by default
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 2000
        seccompProfile:
          type: RuntimeDefault

      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: Always
          
          # Resource requests are MANDATORY
          resources:
            requests:
              cpu: {{ required "resources.requests.cpu is required" .Values.resources.requests.cpu }}
              memory: {{ required "resources.requests.memory is required" .Values.resources.requests.memory }}
            limits:
              cpu: {{ .Values.resources.limits.cpu | default "2" }}
              memory: {{ .Values.resources.limits.memory | default "512Mi" }}

          # Probes — sensible defaults, service can override
          livenessProbe:
            httpGet:
              path: {{ .Values.health.path | default "/health" }}
              port: {{ .Values.service.port | default 8080 }}
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: {{ .Values.health.readyPath | default "/ready" }}
              port: {{ .Values.service.port | default 8080 }}
            initialDelaySeconds: 5
            periodSeconds: 5

Weeks 6–8: ArgoCD and the App-of-Apps Pattern

We used ArgoCD's App-of-Apps pattern to manage the entire platform from a single root Application. This gives you a single pane of glass for all 40 services, with sync health, resource diffs, and rollback available through one UI. The root app points to a directory of ArgoCD Application manifests, each pointing to an environment-specific values overlay.

Weeks 9–10: Secrets Migration

Secrets are where GitOps gets complicated. You cannot put plaintext secrets in Git. We used the External Secrets Operator with AWS Secrets Manager as the backend. Every secret in the cluster is a reference to a secret in Secrets Manager, not the secret itself. Rotation is automatic. Audit trails are real. Engineers never see production secret values.

Weeks 11–13: Cutover and the Parallel-Run Approach

We ran old deployments and new GitOps-managed deployments in parallel for two weeks per service. Traffic was shifted gradually via weighted Ingress rules. If anything behaved differently in the GitOps-managed version, we had the old version to fall back to immediately. This "parallel run" approach is slower but essentially eliminates migration-induced outages.

Key Insight

The biggest win from GitOps is not deployment automation — it's environment parity. When every environment is described in code, the question "why does this work in staging but not production?" almost entirely disappears. The environments are identical by definition.

The Three Things That Almost Derailed Us

Undocumented init containers: three services had manual database migration steps that were "run by hand before deploy." We discovered these during archaeology. They became Kubernetes Job objects in the charts.
Resource request inflation: engineers had set absurdly large resource requests to "be safe." Normalizing these freed up 40% of cluster capacity.
ServiceAccount proliferation: every service had been running as the default ServiceAccount with cluster-wide access. The RBAC normalization took two weeks and broke four services.