Sr. DevOps Engineer

TrueML · Remote in USA

Why TrueML?

TrueML is a mission-driven financial software company that aims to create better customer experiences for distressed borrowers. Consumers today want personal, digital-first experiences that align with their lifestyles, especially when it comes to managing finances. TrueML’s approach uses machine learning to engage each customer digitally and adjust strategies in real time in response to their interactions.

The TrueML team includes inspired data scientists, financial services industry experts and customer experience fanatics building technology to serve people in a way that recognizes their unique needs and preferences as human beings and endeavoring toward ensuring nobody gets locked out of the financial system.

TrueML Products is seeking a highly experienced Sr. DevOps Engineer I to serve as a core contributor on our infrastructure and platform engineering efforts. This role is critical in execution-focused cloud architecture, establishing robust CI/CD pipelines, and ensuring the absolute scalability, security, and reliability of our products.

Reporting to the Sr. Manager, DevOps, you will drive the day-to-day evolution of our internal developer platform and infrastructure-as-code (IaC) architecture. The ideal candidate is a deeply technical, hands-on engineer with a "systems-thinking" mindset. We are looking for a practitioner who thrives on solving complex distributed systems challenges and considers leveraging GenAI and AIOps tooling second-nature for optimizing system performance, monitoring, and automation.

What You'll Do (Technical Execution & Architecture):

Implement the technical roadmap for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.
Design, develop, and maintain self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.
Act as a core steward for cloud spend (AWS), proactively identifying and driving cost-optimization initiatives across our infrastructure.
Build and maintain infrastructure architecture that supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols across multiple regions.
Implement and evolve comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.

What You'll Do (Deep-dive Hands-On Engineering):

Write and review high-quality, production-grade code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations.

Drive hands-on development of robust Terraform Infrastructure as Code for reliable resource provisioning.

Directly architect, optimize, and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis) to maximize build-and-deploy speed and reliability.

Proactively manage, fine-tune, and scale container orchestration environments, including hands-on configuration of Ingress controllers and declarative GitOps workflows.

Manage the technical integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow).

What You'll Do (Collaboration & Knowledge Sharing):

Partner closely with other Senior DevOps Engineers and Engineering Managers to align infrastructure deliverables with product roadmaps, ensuring DevOps acts as an accelerator.

Collaborate with Quality Engineering and Security teams to enforce "Definition of Done" standards that include automated testing and security gates.

Provide technical guidance to junior engineers on the team, fostering a culture of continuous learning.

Who You Are:

Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.

6+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering, working within high-performing senior engineering teams.

Expert-level mastery with AWS and hands-on experience managing multi-region, high-availability deployments.

Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in production environments.

High proficiency in Terraform to drive consistency and automation across all infrastructure layers (Experience with Atlantis is a plus).

Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.

Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).

Experience acting as an Incident Commander or critical responder for high-severity outages.

Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into your engineering workflow to accelerate delivery, troubleshooting, and system monitoring.

Apply →