Data Center Production Operations Engineer (Third Shift)
Meta is seeking a Data Center Production Operations Engineer to support the reliability, efficiency, and scalability of our global data center infrastructure. In this role, you will be responsible for the day-to-day operational health of server fleets and production systems that underpin Meta's family of apps and services. You will work at the intersection of hardware lifecycle management, systems reliability, and operational process improvement, ensuring that production environments meet the demands of billions of users worldwide. Manage and maintain large-scale server fleets across data center environments, including hardware triage, failure analysis, and coordinating repair and replacement workflows Monitor production systems health using observability tooling and telemetry data to proactively identify and resolve infrastructure anomalies before they impact service availability Develop and refine operational runbooks, escalation procedures, and incident response playbooks specific to data center server environments Collaborate with hardware engineering, network operations, and capacity planning teams to support server deployment, decommissioning, and lifecycle transitions Analyze failure trends and operational data to identify systemic issues in server hardware or firmware, and drive root cause analysis and corrective action Contribute to automation initiatives that reduce manual toil in server provisioning, health checks, and fleet management workflows, including leveraging AI-integrated tooling Partner with cross-functional teams to evaluate and implement process improvements that increase operational efficiency and reduce mean time to resolution for production incidents Communicate infrastructure status, incident timelines, and risk assessments to engineering and operations stakeholders through clear written and verbal updates Support capacity readiness activities by validating server acceptance criteria and coordinating with data center technicians during hardware bring-up and commissioning Identify gaps in monitoring coverage or operational tooling and propose solutions that improve fleet visibility and production reliability Participate in 24/7 on-call rotation Ability to travel up to 15% of the time Required to work a shifted schedule (includes nights and weekends) 6+ years of experience in data center operations, site operations, or production infrastructure engineering supporting large-scale server environments 6+ years of experience with server hardware components including CPUs, memory, storage, and network interface cards, including hands-on troubleshooting and failure diagnosis Experience using systems monitoring and observability platforms to track fleet health, identify anomalies, and drive incident resolution in production data center environments Experience developing or improving operational processes, runbooks, or automation scripts to support server fleet management at scale Experience collaborating with hardware engineering, network, and capacity teams to coordinate infrastructure deployments and lifecycle activities Experience contributing to post-incident reviews and translating findings into durable operational improvements that reduce recurrence across a server fleet Background in capacity planning or hardware acceptance testing processes within a large-scale cloud or hyperscale data center organization Familiarity with server firmware management, BIOS configuration, and out-of-band management interfaces such as IPMI or Redfish in hyperscale data center environments Experience with scripting languages such as Python or Bash to automate data center operations tasks including health checks, inventory management, or alerting workflows
Operations pay context
Based on 4,559 disclosed Operations salaries on RoleSuite, the role pays a median of $111K/year, with most offers between $83K and $147K (10th–90th percentile: $63K–$188K).
See the full Operations salary breakdown →