For years, our Azure Policy changes happened the way they happen in most enterprises: somebody made a careful change in the portal, copied a few notes into a ticket, and hoped nobody else touched the same thing the next day.

It worked until it didn’t. The day you have to explain why a new policy broke a critical deployment, or why ten subscriptions drifted from the baseline, the portal stops feeling like a control plane and starts feeling like a shared Google Doc with no track changes.

We moved policy to pull requests. This is what changed, what got better, and the parts that surprised us.

Who this is for

  • Platform engineering, cloud ops, and governance teams running Azure at scale

  • Anyone managing policy across management groups and many subscriptions

  • Teams that want guardrails without turning every change into a meeting

Before: portal-first policy felt fast, but it was fragile

The portal made policy feel easy. You could tweak an effect, hit Save, and see it live right away.

The problem is that speed without process is just risk moving faster. 

Here is what we kept running into:

  • No reliable review. A change could be technically valid but operationally dangerous.

  • Hard to answer basic questions later: Who changed it? Why? What else was updated?

  • Inconsistent rollouts. Some scopes got updated, others lagged behind, and drift became normal.

  • Exceptions were a mess. Some lived in email, some in tickets, some in people’s heads.

  • Testing was informal. Most validation happened after a deployment failed.

We were not missing tools. We were missing a repeatable workflow that made the safe path the default.

After: policy moved to PRs, and the workflow did the heavy lifting

Moving policy to PRs is not just “store JSON in Git.” The real win is that the PR becomes the unit of change: reviewable, testable, and auditable.

What we changed

  • A single repo became the source of truth for policy definitions, initiatives, and assignments.

  • Every change required a PR with an owner, a reason, and a rollout plan.

  • Automated gates validated policy structure, parameters, and scope impact before merge.

  • Exemptions became explicit objects with metadata, expiry dates, and ownership.

  • Rollout moved to rings: canary first, then broader scope once signals stayed clean.

The new shape of work

Here is how the flow looked once it settled down. Notice how little of it depends on hero knowledge.

  1. Author opens a PR with the policy change and a short rationale written for operators, not auditors.

  2. PR template forces the basics: scope, effect changes, blast radius, and a backout plan.

  3. Automated checks run: JSON schema validation, naming standards, and “what changed” diffs for initiatives and assignments.

  4. A test deployment hits a safe scope (dev management group or a known subscription) using What-If and then an apply.

  5. Reviewers sign off using Code Owners rules so the right teams see the PR (platform, security, networking, FinOps).

  6. Merge triggers a controlled rollout: canary, pause, expand, repeat.

  7. If something goes sideways, rollback is a commit. Not a scramble.

Before vs after, in one table

The PR gates that mattered most

We tried a lot of checks. Only a few were truly non-negotiable. 

These are the ones that kept us out of trouble:

  • Policy schema and parameter validation. No “it deployed but behaves weird” surprises.

  • Diff-aware review. The pipeline summarizes effect changes so reviewers do not have to read raw JSON to understand risk.

  • What-If against the target scope. If the plan is scary, you see it before it is live.

  • Ring-based deployment controls. Canary scope first, then a pause before broad rollout.

  • Automated policy assignment identity checks (managed identity roles, permissions, and required locations).

Exemptions stopped being a loophole and became a control

If you operate at scale, you will have exceptions. The question is whether you can explain them, defend them, and delete them later.

PR-based workflow gave us a clean pattern:

  • Every exemption is a file with the owner, reason, scope, and expiry.

  • No expiry means it fails review. Permanence needs a real argument.

  • Exemptions live next to the policy they override, so context is never lost.

  • Quarterly review is easy: search for exemptions expiring in the next 30 days and clean up.

Results: fewer surprises, cleaner ownership, and faster safe changes

The biggest shift was cultural: policy stopped being “something the platform team does to you” and became “a change request you can see and review.”

We also saw practical benefits almost immediately:

  • Audit trail became automatic. The PR captured intent, discussion, approvals, and rollout notes.

  • Drift dropped because we had one source of truth and repeatable deployments.

  • Rollbacks got boring. That is a compliment.

  • Review quality improved because the right people were pulled in consistently via Code Owners.

Moving from portal changes to pipeline-based deployments cuts policy deployment cycle times, and review and rollouts become repeatable instead of being reinvented every time.

What surprised us

  • People stopped fearing policy changes once they could see the blast radius in the PR.

  • The first month felt slower. Then the firefights went away, and overall throughput went up.

  • The repo structure matters more than the pipeline. If files are hard to find, everything suffers.

  • Exemption hygiene is the difference between “governed” and “policy theater.”

If you want to try this: a quick operator checklist

You can start small. Pick one initiative, one scope, and one PR gate. Make the workflow real, then widen it.

  1. Choose your first scope. A management group is ideal for consistency, but start with a safe ring if your org is nervous.

  2. Create a repo layout that separates definitions, initiatives, assignments, and exemptions.

  3. Write a PR template that forces scope, reason, effect changes, rollout ring, and rollback plan.

  4. Add the minimum gates: schema validation and What-If. Anything else is optional at first.

  5. Define Code Owners and require review from the platform plus the domain team most impacted (security, networking, FinOps).

  6. Ship canary rollouts with a pause. Treat policy like production code.

  7. Make exemptions time-bound and review them on a cadence.

Closing thought

Moving policy to PRs did not make governance stricter. It made governance clearer. That clarity is what makes teams faster, because nobody has to guess what will happen after a policy change.

Keep reading