Resume
Contact
Location: New York City, NY
Work Authorization: Green Card
Email: iliazlobin91@gmail.com
Professional Portfolio: iliazlobin.com/portfolio
LinkedIn: linkedin.com/in/iliazlobin
GitHub: github.com/iliazlobin
YouTube: youtube.com/@iliazlobin
Summary
Principal Software Engineer with 12+ years designing, building, and operating large-scale cloud platforms and distributed systems across AWS, GCP, and Azure. Strong coding background in Python, Go, and TypeScript, with extensive experience creating infrastructure platforms, developer tooling, and automation frameworks that improve reliability, reduce operational overhead, and accelerate developer productivity.
Brings experience in AI/ML infrastructure and MLOps, including training pipelines, orchestration frameworks, accelerator clusters, and evaluation workflows. Hands-on with LLM fine-tuning and multi-agent orchestration, with an emphasis on enabling scalable platforms that support research and product engineering teams.
Professional Experience
Mastercard, New York, NY (Principal Software Engineer)
October 2024 - Present
- Accelerated developer onboarding to the internal developer platform by introducing guided onboarding flows and rapid-start environments, lowering the learning curve and increasing platform adoption by ~30%.
- Enabled secure, compliant and rapid experimentation of new applications and services by delivering integrated ephemeral dev environments across the org with built-in lifecycle and budget controls.
- Strengthened security and compliance across Kubernetes runtime environments by designing and integrating certificate validation into the admission pipeline, ensuring only trusted workloads could be deployed.
- Provided technical leadership by representing platform engineering in technical steering committees, aligning platform strategy with business objectives and execution across platform teams.
EPAM Systems, Jersey City, NJ (Systems Architect)
August 2021 - September 2024
- Coordinated the development of a multi-tenant AWS Landing Zone adhering to Well-Architected Practices, using Terraform with Spacelift for custom guardrails and automated provisioning. Achieved a 70% faster setup, SOC 2 compliance, and significantly reduced management overhead across 250+ accounts.
- Designed and implemented an active-active disaster recovery (DR) solution in AWS, utilizing CloudFormation for automated infrastructure provisioning, DynamoDB global tables for cross-region data replication, and Step Functions for orchestrating failover processes, ensuring high availability, fault tolerance, near-zero downtime during regional outages.
- Authored infrastructure and Kubernetes aspects of the GKE modernization factory, resulting in the migration of 200 services over 6 months from GCE to a global GKE platform, operating at a scale of 2,000 nodes over 3 regions.
- Spearheaded a cost optimization initiative for GKE workloads by implementing a metrics collector pipeline, automating data analysis, and establishing both HPA and non-HPA policies with guardrails. Through workload rightsizing, HPA optimization, and leveraging spot instances, achieved over $1M in annual cost savings.
- Incorporated serverless patterns using GCP Cloud Functions and CloudRun within an enterprise organization by identifying needs, gathering requirements, conducting research, and prototyping solutions. The pattern was widely adopted by more than 10 teams within 3 months for operational tasks.
- Led the development of an organization-wide automated incident response system utilizing serverless technologies in an event-driven architecture with AWS Step Functions, Lambda, DynamoDB, and EventBridge, reducing mean time to detect and remediate critical security events from hours to under 5 minutes.
- Delivered a MySQL cloning solution as part of a developer enablement initiative, providing developers with a self-service portal for on-demand database clones. This solution handled over 100 daily clones, accelerating application feature development by 80%.
- Led the migration of Kubernetes workloads from GCP to AWS by designing and configuring an EKS landing zone with Linkerd service mesh, automating processes with FluxCD, instrumenting applications with NewRelic observability libraries, constructing data replication pipelines for MySQL databases, and developing production readiness, cutover, and rollback plans.
- Designed and implemented a secure identity federation pattern enabling critical financial applications in AWS to securely consume GCP services without the use of static credentials. Extended this model to authenticate GitHub Actions deployments, ensuring that only authorized personnel could initiate releases.
EPAM Systems, Minsk, Belarus (Lead DevOps Engineer)
October 2018 - August 2021
- Built out a scalable, multi-region Kubernetes platform supporting 1000+ microservices across 50 development teams. Implemented custom Kubernetes operators, Istio service mesh, and GitOps workflows based with Bash, Python and Terraform, reducing deployment times by 80% and achieving 99.99% platform availability, and providing blue-green platform level deployment capability.
- Implemented a multi-cloud landing zone using Cloud Posse’s Atmos, enabling consistent provisioning across AWS, Azure, and GCP. Achieved 70% faster bootstrapping and unified management for 300+ accounts, enhancing operational efficiency and security.
- Developed a custom Kustomize plugin in Go, integrating it with the continuous deployment system for newly created Kubernetes environments. Achieved a 80% reduction in operational time.
- Developed a custom Kubernetes operator in Go to automate database operations, reducing manual DBA interventions by 80% and improving database reliability.
- Implemented a GitOps workflow with FluxCD and progressive deployment with Prometheus metrics. This approach decreased deployment errors by 80% and improved lead time by 70%.
- Engineered a high-performance Terraform drift detection system in Go, processing configurations across 200+ repositories and 1000+ resources daily.
- Built an EKS-based platform with advanced observability (Prometheus/Grafana) and custom blue-green deployment support, reducing mean time to recovery (MTTR) by 60% and improving overall system reliability by 40%.
- Integrated Azure DevOps for continuous integration and continuous deployment (CI/CD) with Terraform, Bash scripts and comprehensive quality gates to streamline application development and deployment processes across connected Azure and AWS environments reducing lead time of deployment down to 10 minutes.
- Designed and implemented a unified observability system using the ELK stack and Prometheus/Grafana with APM capabilities, enabling DevOps KPIs and reducing Mean Time to Detect (MTTD) issues by 50%.
NIIAS, Saint Petersburg, Russia (Software Developer)
October 2012 - August 2018
- Developed and maintained mission-critical railway control embedded software using C++ with real-time Linux OS, ensuring 99.99% uptime for systems managing 1000+ daily train operations. Optimized core algorithms, resulting in a 40% reduction in CPU usage and 50% improvement in response times.
- Led the migration from monolithic architecture to containerized microservices using Docker, enhancing system modularity and portability. This initiative improved deployment frequency by 300%, reduced time-to-market for new features by 50%.
- Implemented automated testing practices, increasing code coverage from 60% to 95% and reducing post-release defects by 80%. Introduced behavior-driven development (BDD) practices, improving collaboration between developers and business stakeholders.
- Engineered an embedded video streaming system in C++ for 100 trains with real-time Linux OS, optimizing for constrained hardware to deliver 720p@30fps with <1s latency (99th percentile) over 3G networks.
- Developed a centralized train control center software using C++ and Qt, enabling remote operation of multiple trains with telemetry visualization, predictive collision alarms, and digital location map.
- Managed corporate country-wide Linux IT infrastructure, ensuring 99.9% uptime across 5 locations. Implemented automated backups and disaster recovery, reducing recovery time to hours and supporting 100+ employees.
Education
Bachelor’s Degree in Computer Science (with honors)
St. Petersburg State Transport University, 2012
Professional Certifications
- Professional Cloud Architect, Google Cloud
- AWS Certified Solutions Architect – Professional
- AWS Certified DevOps Engineer – Professional
- Microsoft Certified: Azure Solutions Architect Expert
- CKA: Certified Kubernetes Administrator
Personal GenAI Projects (AI/ML Engineer/Architect)
December 2018 - Present
-
AI Events Concierge (Full-Stack Agentic System) – Designed and implemented an AI concierge for event discovery and registration using multi-agent workflows. Automated search, ranking, form submission, and calendar integration for end-to-end scheduling with minimal human input.
GitHub Repository | Video -
Event Ingestion and Ranking Pipeline (Full-Stack Project) – Built a serverless data processing platform (AWS/SST) for ingesting and ranking social events with crawlers. Implemented hybrid search and ranking with OpenSearch and automated publishing to external platforms, supporting analytics and real-time discovery at scale.
GitHub Repository -
DSPy: Declarative Language Programs (AI/ML Research) – Explored DSPy, a framework for declarative LLM pipelines. Built notebooks and demos for workflow orchestration, prompt optimization, and evaluation.
GitHub Repository | Video -
Speech Analysis and Visualization with ML (Full-Stack Project) – Developed toolkits and full-stack applications for pronunciation and pitch analysis, deployed on AWS SageMaker and serverless backends. Delivered real-time visual feedback for learners via modern web frontends.
GitHub Repository -
LLM Fine-Tuning (ML Research) – Explored optimization strategies for customizing transformer models, covering quantization, PEFT, and evaluation frameworks. Delivered reproducible research workflows and benchmarks on GPU infrastructure.
GitHub Repository | Video -
Cloud Service Providers Blogs Summarizator (Full-Stack Project) – Built a cloud-native summarization pipeline integrating crawlers, LLMs, and AWS Step Functions. Automated crawling, summarization, and publishing into Notion/Next.js applications with full infrastructure-as-code.
GitHub Repository -
Atmos Landing Zones (IaC Project) – Built secure, scalable AWS Landing Zones with Cloud Posse Atmos, Terraform, and Helmfile. Automated multi-account provisioning, IAM delegation, centralized networking, security guardrails, audit logging, and Kubernetes orchestration. Enabled CI/CD, reproducible developer environments, and extensible configs for future cloud expansion.
GitHub Repository