Everything started on Azure Pipelines. A year ago we moved PR validation to GitHub Actions, but deployments and E2E testing stayed on Azure. I designed the final migration to bring everything into one place. Here's the architecture and what went wrong along the way.
After the first partial migration, the repository split CI/CD across two systems. GitHub Actions handled the lightweight stuff — linting, static analysis, unit tests, Docker builds. Azure Pipelines still owned the heavy stuff — pushing images to three cloud registries (ECR, GAR, ACR), deploying to three E2E testing stacks, running ~30 PHPUnit test suites, manual production approval, and production deployment.
This split was painful in practice. Context-switching between two systems with different UIs, different secret management, different debugging tools. Azure's variable UI for managing hundreds of test secrets was clunky. And you couldn't test deploy workflows on feature branches — any change to the pipeline had to go straight to master to be verified.
The goal was simple: consolidate everything into GitHub Actions and use its native features — environments with protection rules, matrix strategies, reusable workflows, OIDC authentication.
The core design decision was a single orchestrator workflow (pipeline.yml) with a resolve job that classifies every GitHub event and emits a routing plan. PR opened? Run CI only. Canary tag pushed? CI + canary deploy. Master push? CI + E2E deploy + production deploy (gated by approval).
case "$EVENT" in
pull_request) IS_PR=true ;;
merge_group) IS_MERGE_GROUP=true ;;
push)
if [[ "$REF" == "refs/heads/master" ]]; then IS_MASTER=true
elif [[ "$REF_NAME" == canary-* ]]; then IS_CANARY=true
elif [[ "$REF_NAME" == dev-* ]]; then IS_DEV_TAG=true
fi ;;
esac
The resolve job outputs a JSON plan — pipeline type, deploy mode, image tag, routing flags — and every downstream job reads from it.
The key insight here: everything uses workflow_call (reusable workflows), not workflow_run. The difference matters. workflow_run only triggers from the default branch, which means you can't test your deploy pipeline from a feature branch. With workflow_call, the orchestrator calls child workflows directly, and the whole thing works from any branch.
This was the core of the migration. The old Azure setup grouped ~30 test suites into 4 batches, each running 6-8 suites via GNU parallel on a single runner. If one suite in a batch failed, you had to dig through the batch log to figure out which one. Re-running meant re-running the entire batch.
The new setup: one matrix job per suite. Individual timeouts, individual pass/fail, individual re-runs.
strategy:
fail-fast: false
max-parallel: 30
matrix:
include:
- { suite: api-tests, stack: aws, processes: 1 }
- { suite: integration-part-1, stack: aws, processes: 5 }
- { suite: integration-part-2, testsuite: integration, stack: gcp, processes: 9 }
- { suite: storage-azure, stack: azure, processes: 1 }
# ~30 suites across aws/azure/gcp
Notice the testsuite field on the third entry? That was the first surprise. The suite name (used for display and secret lookup) isn't always the same as the PHPUnit --testsuite value. Some suites had been renamed over the years in the config but not in phpunit.xml.dist, or vice versa. The docker run command handles this with a fallback:
docker run --rm \
--env-file "$ENV_FILE" \
app_e2e \
./bin/parallel-retry.php \
--processes=${{ matrix.processes || 1 }} \
--testsuite=${{ matrix.testsuite || matrix.suite }}
If testsuite is defined in the matrix, use it. Otherwise, fall back to suite. It's a legacy dependency we'll eventually clean up, but for now the fallback keeps it manageable — only took a couple of iterations to map them all correctly.
One of the wins of the migration: OIDC authentication for cloud registries. No more static credentials for pushing Docker images to AWS ECR or GCP Artifact Registry. The workflow authenticates using GitHub's OIDC token, and the cloud provider trusts the token based on the repository identity.
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT_ID:role/ecr-push-role
aws-region: us-east-1
The role ARN is hardcoded in the workflow — it's a public resource identifier, not a secret. Same for GCP's workload identity provider. Zero static credentials for these two providers.
The easy-to-forget gotcha: id-token: write permission. Without it, the OIDC token request silently fails. I forgot it at least twice in new workflows.
The exception was Azure Container Registry. Azure's OIDC implementation can't scope access to individual container repositories within a registry — it's all or nothing at the registry level. The workaround: scope map tokens (essentially a username/password pair scoped to specific repos). Not ideal, but Azure gave us no alternative for repo-level granularity.
Each E2E suite needs a bunch of environment variables — API tokens, storage URLs, backend credentials. Across 30 suites and 3 cloud stacks, that's roughly 200 secrets. Azure Pipelines stored these as pipeline variables. Where do they go in GitHub Actions?
Not in GitHub Environment variables — there are count limits, and our generator tool updates them frequently. Instead: GCP Secret Manager as an external store.
The naming convention is self-describing: <prefix>--<stack>--<suite>--<VAR_NAME>. For example: e2e-tests--aws--common--API_TOKEN. The helper script discovers secrets at runtime by listing everything matching the pattern {prefix}--{stack}--{suite}--*, extracts the variable name from the last segment, and builds a Docker --env-file.
No static config file needed. Add a secret to GCP, the suite picks it up automatically.
The fun part was the parser. The legacy Azure variable names used __ as a delimiter — SOME_VAR__API_TOKEN__AWS. But variable names themselves can contain __. The parser had to split from the right to handle this correctly. Getting this wrong meant silent mismatches where a suite would load secrets from the wrong stack.
Azure Pipelines used ManualValidation@0 for production approval gates. It worked, but the runner sat occupied during the entire wait — up to 24 hours if nobody approved quickly.
GitHub Environments solve this cleanly. The production environment has required reviewers and a branch restriction (master only). The deploy job targets this environment, and GitHub pauses the job until someone approves. No runner is allocated during the wait. Built-in audit trail of who approved what and when.
I wrote a detailed RFC before starting — architecture diagrams, task checklists, phased rollout plan. It helped enormously for the big decisions (orchestrator pattern, matrix strategy, secret store choice). But the checklist drifted almost immediately once implementation started.
The naming mismatches between matrix config, PHPUnit testsuites, and legacy Azure variables weren't in the RFC. The id-token: write permission gotcha wasn't in the RFC. The fact that workflow_run only triggers from the default branch — I discovered that the hard way after the first approach failed.
Planning documents are maps, not territory. The map got me to the right continent, but navigating the streets required actually walking them.