Senior Interview Practice
Advanced DevOps interview questions on system design, incident management, and leadership for 5+ years experience.
32 questions
App-of-Apps Pattern vs ApplicationSets
Explain the app-of-apps pattern in Argo CD. When would you pick it over ApplicationSets?
advancedGitOpsPromoting Changes from Dev to Staging to Prod with Argo CD
Walk me through how a change moves from dev to staging to prod in an Argo CD setup. How do you stop something from slipping into prod that should not be there?
advancedGitOpsScaling a Manifests Repo Across Many Services, Environments, and Clusters
You have 15 microservices, 4 environments (dev, staging, preprod, prod), and prod runs in 3 regional clusters. That's potentially 90 Application CRs. How do you structure the manifests repo so it doesn't fall apart?
advancedGitOpsCapacity Planning and Scaling
How do you approach capacity planning for a growing production system? What metrics and strategies do you use?
advancedSREChaos Engineering Practices
What is chaos engineering and how would you implement it safely in a production environment?
advancedSREChargeback Design for Shared and Untaggable Costs
You are designing a chargeback system for 200 teams running across AWS, GCP, and Azure. How do you handle the 15 to 25 percent of the bill that cannot be tagged to a single team, like inter-AZ data transfer, support plans, NAT gateways, and shared Kubernetes clusters?
advancedFinOpsCloud Cost Optimization
How do you approach cloud cost optimization? What strategies and tools would you use?
advancedCloudCompliance and Governance in Cloud
How do you implement compliance and governance controls in a cloud-native environment?
advancedSecurityDisaster Recovery Planning
How do you design a disaster recovery strategy? Explain RPO, RTO, and different DR approaches.
advancedArchitectureFinOps and Cloud Cost Management
How do you implement FinOps practices to optimize and manage cloud costs at scale?
advancedCloudMeasuring IDP Success and Adoption
You have spent six months building an internal developer platform. Your VP of Engineering asks: 'Is this thing actually working? How do we know it was worth the investment?' What do you show them?
advancedPlatform EngineeringPlatform API and Orchestration Layer
You are designing the orchestration layer for your IDP. A developer clicks 'Create New Service' and behind the scenes it needs to create a GitHub repo, provision a database, set up a CI/CD pipeline, configure monitoring, and register the service in the catalog. How do you architect this?
advancedPlatform EngineeringIncident Postmortems
Describe a production incident you handled and how you structured the postmortem. What makes a good blameless postmortem?
advancedSREIstio Circuit Breakers and Outlier Detection
How do you implement a circuit breaker in Istio? Explain the difference between the connection pool limits and outlier detection.
advancedService MeshDebugging an Istio Traffic Policy That Isn't Working
You applied a VirtualService that splits traffic 80/20 between v1 and v2 of a service, but in production all traffic still goes to v1. Walk me through how you'd debug it.
advancedService MeshLinux Process Debugging
A process is consuming 100% CPU on a Linux server. Walk me through how you would identify and troubleshoot this issue.
advancedLinuxFrom One Experiment to Continuous Chaos at Scale
You have proven a single pod-delete works. Leadership now wants chaos as an ongoing practice across dozens of services and several clusters, not a one-off demo. What changes, and what would you do differently at scale?
advancedChaos EngineeringScoping Litmus Safely: RBAC and Blast Radius
Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?
advancedChaos EngineeringMulti-Cloud Architecture
When would you recommend a multi-cloud strategy? What are the challenges and how do you address them?
advancedArchitectureWhy Run an OpenTelemetry Collector
Why run an OpenTelemetry Collector at all instead of having every application export directly to your tracing backend? And how would you deploy it in Kubernetes?
advancedObservabilitySampling Strategies at Scale
Your platform handles 50,000 requests per second and tracing every one of them is blowing up the observability bill. How do you approach sampling, and what is the tradeoff between head and tail sampling?
advancedObservabilityPlatform Team Scaling and Processes
How do you scale a platform/DevOps team to support a growing engineering organization?
advancedDevOpsFeature Flag Architecture at Scale
Your team has 200+ microservices and wants to adopt feature flags across all of them. How would you design the feature flag infrastructure?
advancedCI/CDProgressive Delivery Rollback Strategy
Your team just enabled a new feature for 20% of users via feature flags, and your monitoring shows a 3x increase in p99 latency for those users. Walk me through exactly what you'd do in the next 10 minutes.
advancedCI/CDReducing On-Call Alert Fatigue
Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?
advancedIncident ManagementScaling an On-Call Program Across Many Teams
You've been asked to design the on-call program for an org that grew from one team to fifteen in a year. Right now it's a free-for-all. What does a healthy on-call program look like at that scale, and how would you measure whether it's working?
advancedIncident ManagementSecurity Architecture and DevSecOps
How do you integrate security into the DevOps pipeline? Describe the key components of a secure architecture.
advancedSecurityError Budget Burn Investigation
It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.
advancedSRESLO-Based Alerting and Burn Rates
Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?
advancedSRESearch Autocomplete System Design
Design the backend for a search autocomplete system that returns suggestions within 100ms as the user types.
advancedSystem DesignSystem Design for Reliability
How would you design a highly available web application? What components and patterns would you use?
advancedArchitectureZero Trust Architecture
What is Zero Trust Architecture and how do you implement it in a modern infrastructure?
advancedSecurity