Senior Interview Practice

Advanced DevOps interview questions on system design, incident management, and leadership for 5+ years experience.

32questions

Practice 32

difficulty

topic

32 questions

App-of-Apps Pattern vs ApplicationSets
Explain the app-of-apps pattern in Argo CD. When would you pick it over ApplicationSets?
advancedGitOps
Promoting Changes from Dev to Staging to Prod with Argo CD
Walk me through how a change moves from dev to staging to prod in an Argo CD setup. How do you stop something from slipping into prod that should not be there?
advancedGitOps
Scaling a Manifests Repo Across Many Services, Environments, and Clusters
You have 15 microservices, 4 environments (dev, staging, preprod, prod), and prod runs in 3 regional clusters. That's potentially 90 Application CRs. How do you structure the manifests repo so it doesn't fall apart?
advancedGitOps
Capacity Planning and Scaling
How do you approach capacity planning for a growing production system? What metrics and strategies do you use?
advancedSRE
Chaos Engineering Practices
What is chaos engineering and how would you implement it safely in a production environment?
advancedSRE
Chargeback Design for Shared and Untaggable Costs
You are designing a chargeback system for 200 teams running across AWS, GCP, and Azure. How do you handle the 15 to 25 percent of the bill that cannot be tagged to a single team, like inter-AZ data transfer, support plans, NAT gateways, and shared Kubernetes clusters?
advancedFinOps
Cloud Cost Optimization
How do you approach cloud cost optimization? What strategies and tools would you use?
advancedCloud
Compliance and Governance in Cloud
How do you implement compliance and governance controls in a cloud-native environment?
advancedSecurity
Disaster Recovery Planning
How do you design a disaster recovery strategy? Explain RPO, RTO, and different DR approaches.
advancedArchitecture
FinOps and Cloud Cost Management
How do you implement FinOps practices to optimize and manage cloud costs at scale?
advancedCloud
Measuring IDP Success and Adoption
You have spent six months building an internal developer platform. Your VP of Engineering asks: 'Is this thing actually working? How do we know it was worth the investment?' What do you show them?
advancedPlatform Engineering
Platform API and Orchestration Layer
You are designing the orchestration layer for your IDP. A developer clicks 'Create New Service' and behind the scenes it needs to create a GitHub repo, provision a database, set up a CI/CD pipeline, configure monitoring, and register the service in the catalog. How do you architect this?
advancedPlatform Engineering
Incident Postmortems
Describe a production incident you handled and how you structured the postmortem. What makes a good blameless postmortem?
advancedSRE
Istio Circuit Breakers and Outlier Detection
How do you implement a circuit breaker in Istio? Explain the difference between the connection pool limits and outlier detection.
advancedService Mesh
Debugging an Istio Traffic Policy That Isn't Working
You applied a VirtualService that splits traffic 80/20 between v1 and v2 of a service, but in production all traffic still goes to v1. Walk me through how you'd debug it.
advancedService Mesh
Linux Process Debugging
A process is consuming 100% CPU on a Linux server. Walk me through how you would identify and troubleshoot this issue.
advancedLinux
From One Experiment to Continuous Chaos at Scale
You have proven a single pod-delete works. Leadership now wants chaos as an ongoing practice across dozens of services and several clusters, not a one-off demo. What changes, and what would you do differently at scale?
advancedChaos Engineering
Scoping Litmus Safely: RBAC and Blast Radius
Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?
advancedChaos Engineering
Multi-Cloud Architecture
When would you recommend a multi-cloud strategy? What are the challenges and how do you address them?
advancedArchitecture
Why Run an OpenTelemetry Collector
Why run an OpenTelemetry Collector at all instead of having every application export directly to your tracing backend? And how would you deploy it in Kubernetes?
advancedObservability
Sampling Strategies at Scale
Your platform handles 50,000 requests per second and tracing every one of them is blowing up the observability bill. How do you approach sampling, and what is the tradeoff between head and tail sampling?
advancedObservability
Platform Team Scaling and Processes
How do you scale a platform/DevOps team to support a growing engineering organization?
advancedDevOps
Feature Flag Architecture at Scale
Your team has 200+ microservices and wants to adopt feature flags across all of them. How would you design the feature flag infrastructure?
advancedCI/CD
Progressive Delivery Rollback Strategy
Your team just enabled a new feature for 20% of users via feature flags, and your monitoring shows a 3x increase in p99 latency for those users. Walk me through exactly what you'd do in the next 10 minutes.
advancedCI/CD
Reducing On-Call Alert Fatigue
Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?
advancedIncident Management
Scaling an On-Call Program Across Many Teams
You've been asked to design the on-call program for an org that grew from one team to fifteen in a year. Right now it's a free-for-all. What does a healthy on-call program look like at that scale, and how would you measure whether it's working?
advancedIncident Management
Security Architecture and DevSecOps
How do you integrate security into the DevOps pipeline? Describe the key components of a secure architecture.
advancedSecurity
Error Budget Burn Investigation
It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.
advancedSRE
SLO-Based Alerting and Burn Rates
Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?
advancedSRE
Search Autocomplete System Design
Design the backend for a search autocomplete system that returns suggestions within 100ms as the user types.
advancedSystem Design
System Design for Reliability
How would you design a highly available web application? What components and patterns would you use?
advancedArchitecture
Zero Trust Architecture
What is Zero Trust Architecture and how do you implement it in a modern infrastructure?
advancedSecurity

Senior Interview Practice

App-of-Apps Pattern vs ApplicationSets

Promoting Changes from Dev to Staging to Prod with Argo CD

Scaling a Manifests Repo Across Many Services, Environments, and Clusters

Capacity Planning and Scaling

Chaos Engineering Practices

Chargeback Design for Shared and Untaggable Costs

Cloud Cost Optimization

Compliance and Governance in Cloud

Disaster Recovery Planning

FinOps and Cloud Cost Management

Measuring IDP Success and Adoption

Platform API and Orchestration Layer

Incident Postmortems

Istio Circuit Breakers and Outlier Detection

Debugging an Istio Traffic Policy That Isn't Working

Linux Process Debugging

From One Experiment to Continuous Chaos at Scale

Scoping Litmus Safely: RBAC and Blast Radius

Multi-Cloud Architecture

Why Run an OpenTelemetry Collector

Sampling Strategies at Scale

Platform Team Scaling and Processes

Feature Flag Architecture at Scale

Progressive Delivery Rollback Strategy

Reducing On-Call Alert Fatigue

Scaling an On-Call Program Across Many Teams

Security Architecture and DevSecOps

Error Budget Burn Investigation

SLO-Based Alerting and Burn Rates

Search Autocomplete System Design

System Design for Reliability

Zero Trust Architecture