Proactive Monitoring Engineering Role (Slack, Salesforce)
Position Summary
As a Slack Proactive Monitoring (ProM) Engineer, you will operate at the intersection of platform engineering, site reliability, and strategic customer success. You are a highly technical, customer-centric specialist dedicated to safeguarding the health and performance of Slack’s largest and most complex global enterprise deployments (Enterprise Grid).
Rather than waiting for customers to report problems, you will continuously monitor Slack workspace metrics, performance thresholds, API integrations, enterprise security events, and custom app behavior. Operating in a 24/7/365 rotation, you will proactively detect anomalies, triage system exceptions, and orchestrate rapid mitigation steps—frequently initiating preemptive customer outreach before their internal users even experience a slowdown.
Key Responsibilities:
Monitoring & Alerting
Continuously monitor dashboards, alerting systems, and telemetry data (error rates, latency spikes, API failures, deployment anomalies) for early signals of degradation.
Triage and correlate alerts from multiple sources (Splunk, internal tools, etc) to identify patterns before customers report issues.
Maintain and refine monitoring playbooks and runbooks for common signal patterns.
Actively monitor Slack platform health dashboards, network latency signals, message delivery queues, and database capacities for high-frequency workspaces.
Monitor critical custom automations, Slack Workflow Builder runs, Enterprise Key Management (EKM) operations, and Identity Provider (IDP) authentication syncs.
Proactive Customer Engagement
Identify customers potentially affected by degraded service conditions and coordinate proactive outreach with Customer Success and Support teams.
High-Impact Communications: Draft and send clear, context-rich proactive notices to customer IT/Slack Administrators advising them of anomalies detected in their environment.
Partner with the Incident Management team to escalate signals that meet incident-threshold criteria.
Technical Advisory: Partner with CSMs and Success Architects to deliver annual technical health check reviews, assessing platform metrics, configuration limits, and custom integration health.
Issue Investigation & Resolution
Perform root cause analysis (RCA) on proactively detected issues, documenting findings in internal case and incident management systems.
Work closely with Engineering and SRE teams to drive rapid remediation of identified issues.
Intervene in low-risk system exceptions (e.g., advising clients on misconfigured Slack Webhooks, API rate limit exhaustion, or broken Salesforce-Slack app connections) before they trigger widespread downtime.
Open, document, and manage proactive cases, tracking them from the initial automated alert to complete root-cause resolution.
Tooling & Continuous Improvement
Contribute to improvements in monitoring coverage, alert fidelity, and signal-to-noise ratio.
Identify gaps in observability and work with Engineering to instrument new telemetry.
Build and maintain Slack-based automations and workflows to streamline proactive monitoring operations.
Knowledge & Documentation
Maintain up-to-date runbooks for monitoring scenarios, escalation paths, and known issues.
Share knowledge across the broader Support Engineering team via Slack channels, canvases, and regular syncs.
Required Skills/Experience:
Required
1+ years of experience in technical support engineering, site reliability engineering, or a related operations role.
Hands-on experience with observability and monitoring tools (e.g., Grafana, Splunk, Datadog, PagerDuty, or equivalent).
Strong understanding of cloud-based SaaS architecture, APIs, and common failure modes.
Proficiency in reading and analyzing logs, metrics, and traces.
Excellent written and verbal communication skills; ability to clearly convey technical findings to both technical and non-technical audiences.
Demonstrated ability to work independently in a fast-paced, ambiguous environment.
Preferred
Familiarity with Salesforce Service Cloud / OrgCS case management.
Scripting or automation experience (Python, JavaScript, Bash).
Experience in a customer-facing support engineering or reliability role at a SaaS company.
ITIL, SRE, or similar certification.
Skills & Competencies
Analytical thinking and strong troubleshooting skills
Customer empathy and proactive mindset
Collaboration across Engineering, SRE, and Customer Success
Attention to detail and signal prioritization
Strong time management and ability to manage multiple active issues simultaneously