Job Responsibilities
1. Infrastructure and Server Operations (Core Responsibilities)
- Design, deploy, and optimize the company’s server clusters (OCI / AWS).
- Manage Linux servers, system environments, user permissions, SSH keys, SFTP, firewalls, and security groups.
- Oversee Nginx, SSL, reverse proxy, domain names, and certificate management to ensure high availability and security.
- Maintain virtual machines, load balancers, object storage, VPC/VCN networks, subnets, and security group policies.
- Troubleshoot production environment issues such as port conflicts, permission errors, service startup failures, full disks, and network anomalies.
2. CI/CD and Deployment Management
- Design, build, and maintain CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
- Write and maintain deployment scripts, automated build scripts, environment variable management, and release workflows.
- Define deployment strategies, rollback plans, and blue-green/canary release processes for test, UAT, and production environments.
- Collaborate with development teams on routine releases, emergency hotfixes, and configuration management.
3. System Reliability and Availability (SRE Focus)
- Establish application monitoring systems (Prometheus, Grafana, ELK, CloudWatch).
- Implement alerting for CPU, memory, disk usage, service anomalies, and API errors.
- Define and enforce SLAs, SLOs, and SLIs to improve system reliability.
- Conduct regular capacity planning, performance optimization, and stress testing.
4. Security and Access Management
- Manage server accounts, cloud platform credentials, Git repository permissions, and Jira/Wiki access.
- Deploy and maintain bastion hosts (Jump Server/Bastion) following the principle of least privilege.
- Develop security baseline policies and perform regular patching, vulnerability scans, and security audits.
- Collaborate with security and risk teams to address incidents such as brute-force attacks, abnormal traffic, and service vulnerabilities.
5. Database and Middleware Maintenance
- Maintain deployments, backups, and master-slave configurations for MySQL, PostgreSQL, Redis, and Kafka.
- Perform database performance tuning, slow query analysis, and connection pool optimization.
- Implement backup strategies, automate backups, ensure disaster recovery, and conduct periodic restore drills.
6. Documentation and Asset Management
- Maintain inventories of servers, domain certificates, and permission lists.
- Create and update operations documentation, including deployment guides, release procedures, security policies, and architecture diagrams.
- Manage operations assets such as server specifications, monitoring dashboards, keys, environment configurations, and network topology diagrams.
7. Team and Process Development
- Oversee daily management and development of the operations team.
- Drive implementation of production change management processes, release policies, access control policies, and disaster recovery procedures.
- Coordinate cross-functionally with development, backend, DBAs, and security teams to resolve critical incidents.
Qualifications
- Expert in Linux system administration, shell scripting, and network fundamentals (Layer 3/4/7).
- Experience with cloud platform operations (OCI / AWS).
- Strong knowledge of Nginx, SSL, reverse proxy, Keepalived, and load balancing.
- Familiarity with Docker/Kubernetes (proficiency in Docker and Compose required).
- Experience with CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
- Proficient in MySQL fundamentals, replication, backup and recovery, and performance tuning.
- Familiar with at least one middleware such as Redis, Kafka, or RabbitMQ.
- Experience building monitoring systems (Prometheus / Grafana / ELK / Loki).
Preferred Qualifications
- Strong logical thinking and rapid troubleshooting skills, capable of handling production incidents independently.
- Comprehensive operations mindset covering monitoring, alerting, security, access, and process management.
- Excellent documentation skills, able to maintain asset inventories, network topologies, and process documentation.
- Strong communication and cross-team collaboration abilities.
- Experience in finance, exchanges, or blockchain industry operations.
- Familiarity with high-concurrency, high-availability architecture design.