DevOps Engineer - Platform Reliability (Remote, China)

Bjakcareer · China

BJAK’s automation systems support customer journeys across quote generation, policy issuance, claims, payments, renewals and insurer integrations. These systems are business-critical—meaning reliability, uptime and safe deployments directly impact customers and operations.

We're looking for a DevOps Engineer based in China to strengthen platform reliability, improve infrastructure resilience and ensure BJAK’s AI automation systems run safely and consistently at scale.

This is a fully remote position where you'll collaborate closely with our Malaysia-based engineering, product and operations teams to build and maintain highly reliable production systems.

The Mission

Build and maintain a highly reliable platform for BJAK’s AI automation systems by improving infrastructure stability, deployment safety and operational resilience across all services.

What You’ll Own

  • Own and improve platform reliability across production systems and environments.

  • Manage cloud infrastructure, deployment pipelines and runtime environments.

  • Design and improve CI/CD workflows to enable safe, fast and repeatable releases.

  • Build and enhance monitoring, alerting, logging and system observability.

  • Lead incident response efforts and perform structured root cause analysis.

  • Improve system resilience through redundancy, failover and recovery mechanisms.

  • Work with engineering teams to reduce production risk through better deployment and system design practices.

  • Strengthen infrastructure security, access control and secrets management.

  • Support reliability for business-critical workflows across multiple countries and services.

  • Continuously improve operational discipline, uptime and system stability.

What We're Looking For

  • Experience in DevOps, SRE, platform engineering or infrastructure-focused roles.

  • Strong understanding of cloud infrastructure, CI/CD pipelines and deployment systems.

  • Experience with production monitoring, alerting and incident management practices.

  • Ability to troubleshoot infrastructure and production issues in a structured and calm manner.

  • Strong understanding of reliability engineering principles (availability, fault tolerance, recovery).

  • Experience supporting business-critical or high-availability systems.

  • Strong ownership mindset during incidents and operational failures.

  • Practical judgment on reliability, performance, security and cost trade-offs.

  • Comfortable working closely with engineering teams in fast-paced environments.

  • Low ego, disciplined and focused on long-term system stability.

Bonus Points

  • Experience with AWS, GCP, Azure or similar cloud platforms.

  • Experience with Kubernetes, Docker or container orchestration.

  • Experience with infrastructure-as-code tools (Terraform, Ansible, Pulumi, etc.).

  • Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).

  • Experience with zero-downtime deployments, blue-green or canary release strategies.

  • Experience supporting distributed or high-traffic production systems.

  • Strong knowledge of security best practices in cloud infrastructure.

  • Experience in fintech, insurance or regulated industry environments.

  • Contributions to platform reliability or infrastructure scaling initiatives.

The Kind of Builder We Want

  • Calm and structured under pressure, especially during production incidents.

  • Hands-on with infrastructure and deeply familiar with production systems.

  • Thinks in failure modes, system risks and recovery paths.

  • Proactive in preventing incidents, not just reacting to them.

  • Strong focus on uptime, reliability and operational discipline.

  • Careful and deliberate when making production changes.

  • Builds systems engineers can trust to deploy and operate safely.

This Role Is Not For

  • People who only react after systems fail instead of preventing them.

  • Engineers who are careless with production changes or access control.

  • Individuals who ignore monitoring, alerting or operational discipline.

  • People who make risky infrastructure changes without proper evaluation.

  • Candidates who cannot stay calm during incidents or outages.

Success in This Role

You'll be successful if you can:

  • Improve platform uptime, stability and deployment safety.

  • Reduce production incidents and infrastructure-related failures.

  • Strengthen monitoring, alerting and system visibility across services.

  • Enable engineers to deploy with confidence and lower operational risk.

  • Improve resilience of BJAK’s AI automation platform as it scales.

Why Join BJAK

  • Build Reliable AI Platform Infrastructure – Support systems powering end-to-end insurance automation.

  • High-Impact Engineering – Solve real-world reliability and scaling challenges.

  • Global Engineering Team – Work with experienced engineers across multiple countries.

  • Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.

  • International Exposure – Build systems used across Southeast Asia markets.

  • Learning & Development Budget – Support continuous technical growth and certifications.

  • High Ownership Environment – Strong autonomy over infrastructure and reliability strategy.

  • Modern Engineering Culture – Focus on stability, observability and engineering excellence.

  • Competitive Compensation – Attractive salary package based on experience and impact.

Interview Process

We assess infrastructure depth, reliability thinking and production problem-solving ability. The process usually includes application review, two interviews and a technical scenario or systems discussion.

DevOps pay context

Based on 1,253 disclosed DevOps salaries on RoleSuite, the role pays a median of $141K/year, with most offers between $115K and $173K (10th–90th percentile: $100K–$210K).

See the full DevOps salary breakdown →
Apply →