SysOps Engineer – Monitoring & Cloud Operations

Jobgether · India

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a SysOps Engineer – Monitoring & Cloud Operations in India.

This role sits at the core of mission-critical infrastructure operations, ensuring the stability, performance, and resilience of large-scale cloud and hybrid systems. You will be responsible for continuously monitoring production environments, identifying and resolving incidents, and maintaining high availability across distributed services. Working within a fast-paced engineering organization, you will collaborate closely with cloud, DevOps, and DataOps teams to safeguard system health and optimize performance. The environment is highly production-driven, requiring strong operational discipline, rapid troubleshooting skills, and a proactive mindset toward risk prevention. You will play a key role in designing and maintaining observability frameworks, ensuring that alerts, dashboards, and monitoring tools provide actionable insights. This is a high-impact position where your work directly supports system uptime, service reliability, and business continuity.

Accountabilities:

  • Monitor infrastructure and production systems using observability tools such as New Relic, Prometheus, Grafana, or similar platforms, ensuring full visibility into system health.
  • Configure and maintain alerts, dashboards, and service-level monitoring to proactively detect anomalies and prevent incidents.
  • Lead incident management activities including troubleshooting, root cause analysis (RCA), and post-incident reporting.
  • Ensure system uptime, performance, and SLA compliance across cloud and on-premise environments.
  • Manage operating system-level tasks (Linux and Windows), including patching, tuning, and service management.
  • Oversee backup processes and regularly validate restoration procedures to ensure data reliability.
  • Execute and support disaster recovery (DR) plans, including failover/failback testing and DR drills across environments.
  • Collaborate with DataOps and infrastructure teams to ensure replication integrity, system resilience, and business continuity readiness.
  • Perform capacity planning, performance optimization, and infrastructure health assessments.
  • Maintain operational documentation, including runbooks, monitoring guidelines, and incident playbooks.
  • Requirements:

    • Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
    • Proven experience in SysOps, Cloud Operations, SRE, or Infrastructure Support roles in production environments.
    • Strong hands-on experience with Linux and Windows system administration.
    • Experience using monitoring and observability tools such as New Relic, Prometheus, Grafana, Datadog, or equivalent solutions.
    • Solid understanding of incident management, problem management, and root cause analysis methodologies.
    • Experience working with cloud platforms such as AWS, Azure, or Google Cloud Platform.
    • Strong knowledge of disaster recovery, backup strategies, and business continuity planning.
    • Familiarity with infrastructure components such as virtual machines, compute instances, and physical servers.
    • Understanding of web and system services such as Nginx, IIS, and systemd.
    • Strong analytical and troubleshooting skills with the ability to resolve complex production issues under pressure.
    • Excellent communication and collaboration skills for cross-functional coordination.
    • Experience in high-availability, mission-critical environments is highly preferred.
    • Benefits:

      • Competitive compensation package aligned with experience and market standards.
      • Fully remote work environment with flexible arrangements.
      • Opportunity to work on large-scale, mission-critical infrastructure systems.
      • Exposure to modern cloud technologies and advanced observability platforms.
      • Professional growth in a fast-paced, high-impact engineering organization.
      • Collaborative and cross-functional team culture.
      • Involvement in disaster recovery planning, system resilience design, and cloud operations at scale.
Apply →