Filevine is a Legal AI company delivering Legal Operating Intelligence for the future of legal work. Grounded in a singular system of truth, Filevine brings together data, documents, workflows, and teams into one unified platform—where modern legal work happens with clarity and consistency.
Powered by LOIS, the Legal Operating Intelligence System, Filevine connects context across every matter to transform legal operations from reactive to proactive. LOIS reads, understands, and reasons across your data to surface insight, automate complexity, and give professionals the clarity and confidence to see more, know more, and do more. Fueled by a team of exceptional collaborators and innovators, Filevine’s rapid growth has earned AI awards and recognition from Deloitte and Inc. as one of the most innovative and fastest-growing technology companies in the country.
Filevine is seeking a Senior Site Reliability Engineer with a strong focus on Observability and Incident Management. This role will lead efforts to improve system visibility, reliability, and operational excellence across our platform.
You will partner closely with engineering teams to build scalable, resilient systems while driving best practices in monitoring, alerting, distributed tracing, logging, and incident response.
What you will do
- Own and evolve observability strategy, including monitoring, alerting, dashboards, logging, and distributed tracing.
- Define and manage SLIs, SLOs, and reliability metrics.
- Lead incident response, postmortems, and continuous improvement initiatives.
- Improve MTTD and MTTR through automation and operational excellence.
- Integrate observability into CI/CD pipelines and software delivery workflows.
- Build and maintain reliable cloud infrastructure on AWS and Kubernetes.
- Mentor engineers and promote SRE best practices across the organization
What we are looking for
- 8+ years of experience in software engineering, infrastructure, or operations.
- 5+ years of Site Reliability Engineering experience.
- Deep expertise with observability platforms such as New Relic, Datadog, Dynatrace, Grafana, or Prometheus.
- Strong experience with monitoring, alerting, incident management, and reliability engineering practices.
- Hands-on experience with AWS, Kubernetes, and cloud-native technologies.
- Proficiency in Python, Bash, PowerShell, or similar scripting languages.
- Excellent communication and collaboration skill
Preferred Experience
- Leading observability platform implementations or migrations at scale.
- Building SLI/SLO frameworks and reliability programs.
- Experience with OpenTelemetry, distributed tracing, and modern observability architectures.