Senior Manager, Engineering - Observability Platform (Remote Eligible)
For over 20 years, Smartsheet has empowered teams to manage work seamlessly and scale solutions smarter. Now, in our most ambitious chapter yet, we are uniting human teams with AI agents. By orchestrating the work agents do best, automating manual tasks and uncovering insights at scale, we create the space for people to focus on what truly matters: judgment, creativity, and big thinking. That is magic at work, and it’s what we show up for every day.
The Observability Platform team is seeking a Senior Manager of Engineering to build and lead a centralized platform capability that gives Smartsheet full-stack visibility into our most complex and consequential systems. This role owns the engineering strategy and execution for a dedicated platform consolidating multiple platforms, serving engineering teams across the company, including the Data & AI Platform, Commerce, and Infrastructure pillars.
You will lead a team based with strategic ownership spanning metrics, distributed tracing, alerting, log analytics, SLO/SLA management, and AI/ML observability integrations tied to SmartAssist and our agentic AI workstreams on Amazon Bedrock and MLflow. This is a high-leverage, high-visibility role at the intersection of platform reliability and AI-native engineering.
You Will:
Team & Platform Leadership
- Lead a team of engineers focused on observability platform engineering, driving build-out of a unified observability stack used by all engineering teams at Smartsheet.
- Own and evolve the platform's technical roadmap, consolidating multiple tooling platforms, and AI observability tooling into a coherent, scalable capability.
- Define platform standards, contribute to architectural direction, and ensure the team operates with engineering rigor and strong operational habits.
- Build and scale the team, hiring senior engineers and establishing effective global practices across distributed stakeholders.
Observability Engineering
- Lead design and delivery of centralized observability infrastructure covering metrics pipelines, distributed tracing, alerting frameworks, and log analytics across Smartsheet services.
- Drive SLO/SLA definition and tooling for platform-wide reliability visibility, partnering closely with infrastructure, platform engineering, and on-call teams.
- Own governance including instrumentation standards, cost optimization, and rollout of advanced capabilities such as APM, RUM, and custom dashboards.
- Lead architecture, scaling, and operational practices for log analytics across high-throughput production workloads.
- Establish shared observability libraries, agents, and SDKs that reduce instrumentation burden for application engineering teams.
AI Observability
- Build and maintain AI/ML observability integrations in partnership with the AI Platform team.
- Partner with the Data & AI Platform team to integrate MLflow tracing, Inference Tables, and LLM-as-judge evaluation pipelines into the observability stack.
- Develop dashboards and alerting for agentic AI workloads, including latency, token consumption, error rates, and evaluation metric drift.
- Contribute to the AI governance and cost observability program, providing telemetry for model usage, cost attribution, and compliance reporting.
Cross-Functional Partnership & Execution
- Serve as the primary engineering partner for platform consumers across Data & AI, Commerce, Infrastructure, and Security teams, ensuring observability needs are met across workstreams.
- Lead complex, cross-functional observability projects with high ambiguity, managing delivery risk, communicating clearly to senior stakeholders, and building alignment across teams.
- Partner with delivery partners to coordinate instrumentation across platform modernization and migration workstreams
- Contribute to quarterly and annual platform goals, reporting on key reliability and observability metrics to engineering leadership.
- Communicate platform status, risks, and roadmap progress to Engineering leadership and above audiences in a clear, executive-ready format.
Operational Excellence
- Embed on-call culture and incident management discipline into the team, ensuring clear runbooks, fast MTTR, and post-incident learning loops.
- Drive cost governance for observability tooling, including spend optimization and efficient resource management.
- Champion AI-assisted engineering practices within the team, applying tooling and automation to reduce toil and accelerate delivery.
You Have:
Required
- 10+ years of software or platform engineering experience, with strong fundamentals in distributed systems, infrastructure, and backend services.
- 3 years of engineering management experience, including direct team building, performance management, and cross-functional delivery ownership.
- Deep hands-on expertise with observability tooling: Datadog (APM, metrics, logs, alerting), OpenSearch or Elasticsearch, distributed tracing (OpenTelemetry or equivalent), and SLO/SLA management at scale.
- Proven experience operating observability platforms for high-availability, high-throughput production environments.
- Experience building and scaling engineering teams in distributed or international focus
- Strong execution track record on complex, cross-functional infrastructure programs with high ambiguity.
- Clear, direct communication (written and verbal) with both technical and non-technical audiences, including leadership and executive stakeholders.
- Proactive risk identification and status communication without prompting.
- Experience managing vendors, external delivery partners, and third-party integrations in a platform context.
Preferred
- Hands-on experience with AI/ML observability: MLflow tracing, LLM evaluation pipelines, or observability for agentic AI systems.
- Familiarity with Amazon Bedrock, ECS Fargate, or LangGraph-based multi-agent architectures.
- Experience with cloud cost governance and FinOps practices for observability tooling
- Exposure to data platform observability and data quality monitoring in a lakehouse context
- Experience establishing internal developer platforms, shared libraries, or platform-as-a-service offerings for application teams.
- Prior work in SaaS environments with enterprise compliance requirements (SOC 2, FedRAMP, HIPAA).
Education & Eligibility
- CS, Engineering, or equivalent degree, or commensurate practical experience.
- Legally eligible to work in the U.S. on an ongoing basis
Current US Perks & Benefits:
- Employer subsidized medical/vision and dental coverage for full-time employees
- 401k Match to help you save for your future (50% of your contribution up to the first 6% of your eligible pay)
- Monthly stipend to support your work and productivity
- Flexible Time Away Program, plus Sick Time Off
- US employees are automatically covered under Smartsheet-sponsored life insurance, short-term, and long-term disability plans
- US employees receive 12 paid holidays per year
- Up to 24 weeks of Parental Leave
- Personal paid Volunteer Day to support our community
- Opportunities for professional growth and development including access to Udemy online courses
- Company Funded Perks, including a counseling membership, local retail discounts, and your own personal Smartsheet account
- Teleworking options from any registered location in the U.S. (role specific)
Smartsheet provides a competitive base salary range for roles that may be hired in different geographic areas we are licensed to operate our business from. Actual compensation is determined by several factors including, but not limited to, level of professional, educational experience, skills, and specific candidate location. In addition, this role will be eligible for a market competitive incentive opportunity.
Get to Know Us:
At Smartsheet, your ideas are heard, your potential is supported, and your contributions have real impact. You’ll have the freedom to explore, push boundaries, and grow beyond your role. We welcome diverse perspectives and nontraditional paths—because we know that impact comes from individuals who care deeply and challenge thoughtfully. When you’re doing work that stretches you, excites you, and connects you to something bigger, that’s magic at work. Let’s build what’s next, together.
Equal Opportunity Employer:
Smartsheet is an Equal Opportunity (EEO) employer committed to fostering an inclusive environment with the best employees. It is our policy to provide equal employment opportunities to all qualified applicants in accordance with applicable laws in the US, UK, Australia, Germany, Costa Rica, Japan, Bulgaria, and India. All qualified applicants will receive consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran or disabled status, or genetic information.
If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.
#LI-Remote
Eng Management pay context
Based on 733 disclosed Eng Management salaries on RoleSuite, the role pays a median of $216K/year, with most offers between $178K and $254K (10th–90th percentile: $156K–$313K).
This posting lists $205K–$275K, above the $216K market median.
See the full Eng Management salary breakdown →