Job Responsibilities
1. Cloud Native Architecture Design and Governance:
- Design highly available architectures on AWS and Cloudflare, extending beyond CDN configuration to implement edge logic with Cloudflare Workers and secure access layers using Argo Tunnel/Zero Trust.
- Manage AWS multi-account structures via Organizations, architect cross-Region networking (Transit Gateway, VPC Peering, VPN) to resolve complex connectivity and latency challenges.
- Enforce Infrastructure as Code (Terraform/Pulumi) across edge rules and underlying resources to minimize manual console operations.
2. Deep Kubernetes Engineering:
- Maintain large-scale EKS or self-managed clusters, performing performance tuning and troubleshooting of core components such as etcd, CNI plugins (Cilium/Calico), and CoreDNS.
- Develop Kubernetes Operators/Controllers or kubectl plugins to enhance platform automation based on business requirements.
- Bridge local development and production environments (Docker Compose to Helm/Kustomize) to ensure consistency.
3. Engineering Productivity and Observability:
- Design and maintain complex CI/CD pipelines, integrating code quality analysis (SonarQube), container image security scanning, and automated testing.
- Implement GitOps workflows using ArgoCD or Flux.
- Build a Prometheus-based monitoring system with in-depth runtime (Go/Java) and system-level (eBPF) performance analysis.
4. System-Level Support and Reliability:
- Maintain middleware such as Nginx, Redis, and Kafka with capabilities for source-level debugging and parameter tuning.
- Address system bottlenecks under high concurrency (TCP queues, file handles, memory management).
- Linux Systems Expert: Deep understanding of Linux kernel internals and proficient use of perf, strace, tcpdump, eBPF, and other tools to diagnose CPU, I/O, and network issues in production.
- Cloud and Networking Proficiency: Familiarity with AWS infrastructure limits (API rate limits, EBS IOPS) and Cloudflare fundamentals (Anycast, SSL handshake), with a deep understanding of the TCP/IP stack and HTTP/2/3 protocols.
- Kubernetes Hands-On Experience: In-depth knowledge of cgroups and namespaces, service meshes (Istio/Linkerd), and rapid diagnosis of pod scheduling failures or crashes.
- Development Skills: Proficient in Go or Python, capable of reading open-source code, fixing bugs, and developing backend tools.
Preferred Qualifications
- Contributor to CNCF open source projects.
- Experience maintaining systems handling hundreds of millions of daily requests.
- Hands-on experience implementing chaos engineering in production environments.