DevOps Engineer - Platform Reliability (Remote, China)

Bjakcareer · China

BJAK’s automation systems support customer journeys across quote generation, policy issuance, claims, payments, renewals and insurer integrations. These systems are business-critical—meaning reliability, uptime and safe deployments directly impact customers and operations.

We're looking for a DevOps Engineer based in China to strengthen platform reliability, improve infrastructure resilience and ensure BJAK’s AI automation systems run safely and consistently at scale.

This is a fully remote position where you'll collaborate closely with our Malaysia-based engineering, product and operations teams to build and maintain highly reliable production systems.

The Mission

Build and maintain a highly reliable platform for BJAK’s AI automation systems by improving infrastructure stability, deployment safety and operational resilience across all services.

What You’ll Own

Own and improve platform reliability across production systems and environments.
Manage cloud infrastructure, deployment pipelines and runtime environments.
Design and improve CI/CD workflows to enable safe, fast and repeatable releases.
Build and enhance monitoring, alerting, logging and system observability.
Lead incident response efforts and perform structured root cause analysis.
Improve system resilience through redundancy, failover and recovery mechanisms.
Work with engineering teams to reduce production risk through better deployment and system design practices.
Strengthen infrastructure security, access control and secrets management.
Support reliability for business-critical workflows across multiple countries and services.
Continuously improve operational discipline, uptime and system stability.

What We're Looking For

Experience in DevOps, SRE, platform engineering or infrastructure-focused roles.
Strong understanding of cloud infrastructure, CI/CD pipelines and deployment systems.
Experience with production monitoring, alerting and incident management practices.
Ability to troubleshoot infrastructure and production issues in a structured and calm manner.
Strong understanding of reliability engineering principles (availability, fault tolerance, recovery).
Experience supporting business-critical or high-availability systems.
Strong ownership mindset during incidents and operational failures.
Practical judgment on reliability, performance, security and cost trade-offs.
Comfortable working closely with engineering teams in fast-paced environments.
Low ego, disciplined and focused on long-term system stability.

Bonus Points

Experience with AWS, GCP, Azure or similar cloud platforms.
Experience with Kubernetes, Docker or container orchestration.
Experience with infrastructure-as-code tools (Terraform, Ansible, Pulumi, etc.).
Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
Experience with zero-downtime deployments, blue-green or canary release strategies.
Experience supporting distributed or high-traffic production systems.
Strong knowledge of security best practices in cloud infrastructure.
Experience in fintech, insurance or regulated industry environments.
Contributions to platform reliability or infrastructure scaling initiatives.

The Kind of Builder We Want

Calm and structured under pressure, especially during production incidents.
Hands-on with infrastructure and deeply familiar with production systems.
Thinks in failure modes, system risks and recovery paths.
Proactive in preventing incidents, not just reacting to them.
Strong focus on uptime, reliability and operational discipline.
Careful and deliberate when making production changes.
Builds systems engineers can trust to deploy and operate safely.

This Role Is Not For

People who only react after systems fail instead of preventing them.
Engineers who are careless with production changes or access control.
Individuals who ignore monitoring, alerting or operational discipline.
People who make risky infrastructure changes without proper evaluation.
Candidates who cannot stay calm during incidents or outages.

Success in This Role

You'll be successful if you can:

Improve platform uptime, stability and deployment safety.
Reduce production incidents and infrastructure-related failures.
Strengthen monitoring, alerting and system visibility across services.
Enable engineers to deploy with confidence and lower operational risk.
Improve resilience of BJAK’s AI automation platform as it scales.

Why Join BJAK

Build Reliable AI Platform Infrastructure – Support systems powering end-to-end insurance automation.
High-Impact Engineering – Solve real-world reliability and scaling challenges.
Global Engineering Team – Work with experienced engineers across multiple countries.
Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.
International Exposure – Build systems used across Southeast Asia markets.
Learning & Development Budget – Support continuous technical growth and certifications.
High Ownership Environment – Strong autonomy over infrastructure and reliability strategy.
Modern Engineering Culture – Focus on stability, observability and engineering excellence.
Competitive Compensation – Attractive salary package based on experience and impact.

Interview Process

We assess infrastructure depth, reliability thinking and production problem-solving ability. The process usually includes application review, two interviews and a technical scenario or systems discussion.

DevOps pay context

Based on 1,253 disclosed DevOps salaries on RoleSuite, the role pays a median of $141K/year, with most offers between $115K and $173K (10th–90th percentile: $100K–$210K).

See the full DevOps salary breakdown →

Apply →