Senior Software Engineer, Site Reliability Engineering
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Software Engineer, Site Reliability Engineering based in United States.
This role sits at the core of building and maintaining highly reliable, scalable, and secure infrastructure that powers large-scale consumer-facing products. As a Senior Site Reliability Engineer, you will design and operate distributed systems that ensure high availability, performance, and resilience across complex cloud environments. You will work closely with product engineering, platform, and developer experience teams to improve system reliability while reducing operational toil and improving developer productivity. This is a high-impact engineering role where your work directly influences system stability, user experience, and engineering velocity across the organization. You will be responsible for shaping platform architecture, evolving observability practices, and ensuring production systems run efficiently at scale. The environment is highly collaborative, fast-paced, and deeply focused on engineering excellence and continuous improvement.
Accountabilities:
- Design, build, and maintain scalable and highly available infrastructure and systems that support large-scale distributed applications.
- Define and influence architectural direction for platform services, ensuring resilience, performance, and scalability across systems.
- Develop tools and automation for deployment, monitoring, configuration management, and infrastructure operations.
- Troubleshoot and resolve complex production issues across distributed systems, ensuring minimal downtime and rapid recovery.
- Improve observability, monitoring, and alerting systems to enhance system visibility and reliability.
- Participate in capacity planning, performance tuning, and forecasting to proactively address scaling challenges.
- Collaborate with engineering teams to improve developer experience and reduce operational toil through automation and platform improvements.
- Participate in on-call rotations and provide incident response support for critical systems.
- 5+ years of experience in Site Reliability Engineering, infrastructure engineering, or distributed systems roles.
- Strong expertise in AWS and Linux-based environments.
- Proficiency in programming languages such as Python, Go, JavaScript, or similar for automation and system development.
- Deep understanding of distributed systems and networking protocols including DNS, HTTP/S, TLS, and TCP/IP.
- Hands-on experience operating, monitoring, and debugging large-scale microservices architectures in production environments.
- Strong problem-solving skills with the ability to break down complex system challenges and evaluate technical trade-offs.
- Excellent communication skills with the ability to collaborate across engineering and non-engineering stakeholders.
- Strong focus on system reliability, scalability, and reducing operational overhead.
- Competitive base salary range aligned with experience and location
- Equity participation in a high-growth technology organization
- Comprehensive medical, dental, and vision insurance coverage
- 401(k) retirement plan and financial wellbeing support
- Flexible remote work options within North America
- Flexible paid time off and parental leave policies
- Professional development support and learning opportunities
- Inclusive, engineering-driven culture focused on reliability and innovation
Requirements:
Benefits: