Charm Security- Senior DevOps Engineer, Infrastructure
Team8
Description
Charm Security is building an Agentic Workforce that prevents and resolves scams, fraud, and social engineering attacks targeting financial institutions. Our AI Agents are engineered to understand human vulnerabilities and act like expert fraud analysts, investigating cases, intervening in real time, and guiding victims through recovery.
Backed by Team8, we deploy our Resolution Agent with banks and financial services companies as an AI teammate that operates across the full case lifecycle: triage, evidence gathering, live intervention, and remediation.
This role is on our Infrastructure team, responsible for the cloud platform that powers Charm’s agentic products across multi-tenant and customer-deployed environments.
Your Role
You’ll design, build, and maintain Charm’s cloud infrastructure, ensuring our platform is secure, scalable, and production-ready for enterprise financial institutions. You will:
- Own and evolve our multi-cloud infrastructure using Terraform and Terragrunt, managing multi-tenant environments across SaaS, hybrid, and fully isolated deployment modes
- Build and operate Kubernetes clusters, networking, secrets management, and security infrastructure across cloud providers
- Maintain our GitOps deployment pipeline with ArgoCD, Helm charts, and GitHub Actions-based promotion workflows
- Operate and improve our observability stack with Datadog and OpenTelemetry
- Support customer-deployed (Bring Your Own Cloud) architectures with per-tenant isolation, dedicated resources, and encryption
Requirements:
Requirements
- 7+ years of experience as an Infrastructure, Platform, or DevOps Engineer
- Strong hands-on experience with at least one major cloud provider (GCP, AWS, or Azure) — managed Kubernetes, VPC networking, IAM, secret management, and managed databases
- Proficiency with Terraform for managing production infrastructure
- Deep understanding of Kubernetes: cluster operations, Helm, operators, networking, RBAC, and troubleshooting
- Experience with GitOps workflows (ArgoCD or similar) and CI/CD pipelines (GitHub Actions or similar)
- Strong production monitoring and on-call experience – building SLOs, alerting, dashboards, and leading incident response for production-grade services; experienced with observability tools like Datadog, Prometheus, or similar.
- Comfort working in a small, fast-paced team, taking ownership, and wearing multiple hats
- A builder’s mindset. We’re early stage and move fast
Nice to Have
- Experience operating AI/ML workloads on Kubernetes – GPU node pools, self-hosted LLM serving (vLLM, TGI, Ollama), or running agent frameworks (LangGraph, LangChain) in production – Big advantage
- Prior SRE or production-monitoring role – deep experience with SLOs, on-call rotations, incident response, and observability at scale – Big advantage
- Experience with multi-tenant SaaS platforms and customer-deployed (BYOC) architectures
- Experience with Crossplane (or equivalent tools such as AWS ACK, GCP Config Connector) for managing cloud resources as CRDs from Kubernetes
- Background in financial services or regulated industries (SOC2, ISO 27001)
- Familiarity with message streaming systems (NATS, Pub/Sub, RabbitMQ)