Lead Site Reliability Engineer

Описание вакансии

Social Links is a leading global OSINT company headquartered in the US, integrating data from over 500 open sources: social media, messengers, blockchains, and the Dark Web to deliver cutting-edge solutions for investigations, compliance, and risk analytics. Our Open Data & AI platform powers hundreds of organizations across 80+ countries.

Our infrastructure spans both legacy on-premise deployments and a new AWS-based cloud-native platform. To support this transformation, we’re looking for a Lead Site Reliability Engineer who will own the full lifecycle of system reliability - from process design to hands-on implementation.

As Lead SRE Engineer, you will:

Define and implement SRE practices: SLO/SLA management, incident response, postmortems, alerting policies.
Lead the team responsible for:
1. On-prem infrastructure (Linux, VPNs, networking, firewalls, Zabbix).
2. DevOps and CI/CD workflows.
3. Platform observability (Prometheus, Grafana, Loki, Tempo).
Architect and scale cloud-native infrastructure using AWS services:
1. EC2, VPC, EKS, S3, IAM, CloudWatch, Route53, etc.
Oversee migration of services and systems from on-prem to cloud.
Own logging, metrics, recovery processes, DRP, and secure runtime environments.
Implement infrastructure automation and self-healing mechanisms.
Build internal documentation, runbooks, and operational guidelines.
Act as a mentor and leader for the reliability culture across engineering.

What We’re Looking for:

5+ years in infrastructure/SRE/DevOps roles, 2+ years in technical leadership.
Expert knowledge of Linux, Bash, system automation.
Deep understanding of core networking: VPN, TCP/IP, DNS, routing, NAT, firewalls.
Hands-on experience with on-prem operations and modernization.
Experience with monitoring: Zabbix, Prometheus, Grafana.
Proven experience with AWS (high priority): EC2, IAM, VPC, EKS, S3, CloudWatch.
Strong skills in CI/CD tooling: GitHub Actions, GitLab CI, ArgoCD, Helm, Kustomize.
Experience implementing SRE disciplines: SLOs, error budgets, incident management.
Proficiency in writing clear documentation and infrastructure standards.

Nice to Have:

Experience with OpenFaaS, Kubernetes, Terraform, Ansible.
Familiarity with SOC2, ISO 27001, GDPR compliance practices.
Python scripting for automation.
Experience with Vault, OPA, RBAC, and Zero Trust architectures.

Why Join Us

A strategic role where you define infrastructure and reliability culture from the ground up.
Full ownership over reliability, observability, and platform resiliency.
A growing, global, product-driven company with engineering at the center.
Flexible remote environment with stock options and leadership visibility.
A foundational role with a clear growth path toward Head of Infrastructure/SRE.

If you turn chaos into structure and systems into strategy - this role is for you!