Sravan Pusuluri — Generative AI Engineer

About Me

The engineer behind
the numbers.

I'm a Generative AI Engineer with 4+ years of experience building LLM inference pipelines, RAG architectures, and agentic AI systems at enterprise scale.

Currently at MetLife (Fortune 50), I work on the full stack of AI infrastructure — from quantization and GPU optimization to multi-agent orchestration and observability.

My work spans regulated industries including finance and healthcare, where responsible AI and production-grade guardrails aren't optional — they're required.

Let's Connect

📍

Location

New York, NY · Open to Remote

🏢

Current Role

Generative AI Engineer @ MetLife

🎓

Education

MS Data Analytics, SUNY Albany '25

📧

sravanpusuluri03@gmail.com

Technical Skills

Everything I work with,
end to end.

GenAI & LLMs

LangChain

LlamaIndex Bedrock vLLM TensorRT-LLM Llama 3 Mistral RAG LoRA/QLoRA RLHF

Agentic AI

LangGraph

LangSmith AutoGen ReAct ReWoo Multi-Agent AI Guardrails Responsible AI

Vector & Search

Pinecone pgvector ChromaDB Weaviate OpenSearch Semantic Search Hybrid Search

GPU & Performance

CUDA NCCL Nsight Systems INT8 Quant KV-Cache Dynamic Batching DeepSpeed MII A100/H100

MLOps & Cloud

Docker K8s SageMaker Actions Airflow FastAPI MLflow

Prometheus

CloudWatch

Languages & ML

Python PyTorch TensorFlow

Hugging Face SQL Scala Spark XGBoost spaCy

Experience

Where I've built
real impact.

MetLife · Fortune 50

Generative AI Engineer — LLM Optimization & Inference

Sep 2024 – Present · New York, NY

Current

Optimized LLM inference stack using vLLM and TensorRT-LLM, cutting p95 response latency by 42% while preserving output consistency across 3B-parameter model variants in production.
Deployed quantized INT8 GPTQ pipelines with KV-cache streaming and head pruning, reducing GPU memory footprint by 35% on NVIDIA A100 clusters — enabling 2× model concurrency and deferring an estimated $500K in GPU hardware procurement.
Integrated DeepSpeed MII for multi-GPU orchestration via NCCL and CUDA Graphs, achieving near-linear scalability across 8-GPU nodes.
Engineered dynamic batching in Triton Inference Server, increasing GPU utilization from ~58% to 83% — avoiding an estimated $120K+ in annual cloud GPU spend.
Converted fine-tuned models to ONNX Runtime for edge deployments, achieving 1.7× token throughput vs. baseline PyTorch.
Architected RAG pipelines using LangChain and vector databases (Pinecone, ChromaDB) with measurable hallucination reduction.
Built production-grade inference monitoring with Prometheus + CloudWatch, enabling real-time latency alerting and automated horizontal scaling.
Automated CI/CD via Docker + GitHub Actions, compressing release cycles from weekly → daily and improving deployment reliability by 50%.
Led GPU performance investigations using Nsight Systems, reducing inference variance by 27%.

Sage Softtech

Machine Learning Engineer

Mar 2021 – Jul 2023 · India

Engineered production ML system on 50M+ EHR records — patient readmission risk engine improving high-risk detection by 18%.
Built clinical NLP pipeline using spaCy NER, boosting predictive AUC by 0.09.
Compressed retraining cycles from 4 hours → 45 minutes via Airflow-orchestrated pipelines.
Deployed production APIs (FastAPI + Docker on AWS EC2) serving 200+ physicians daily.
Implemented MLflow + Prometheus for full model lifecycle governance and automated data drift detection.
Operated within a regulated healthcare environment — the same compliance mindset now applied at MetLife.

Projects

Things I've built
that actually work.

Enterprise RAG Chatbot with Agentic Workflows

Multi-agent conversational AI using LangGraph (ReAct) and Amazon Bedrock, grounded in a 100K+ document corpus via ChromaDB. Implemented AI Guardrails for PII redaction and hallucination mitigation.

91% faithfulness <800ms p95 500+ concurrent

LangChain LangGraph Amazon Bedrock ChromaDB LangSmith AWS Lambda

Fraud Detection System — Real-Time Anomaly Scoring

Fraud detection pipeline on highly imbalanced financial transaction data using Isolation Forest and Autoencoder-based anomaly detection. Deployed as real-time scoring API on SageMaker.

94% recall Sub-second alerts

Isolation Forest Autoencoders XGBoost SageMaker CloudWatch

Multi-Model Time Series Demand Forecasting

Ensemble forecasting system comparing ARIMA, Prophet, and LSTM for supply chain demand. Enabled 30-day ahead forecasts with automated seasonality detection.

12% lower MAPE 30-day horizon

LSTM Prophet ARIMA Apache Airflow AWS

Customer Churn Prediction

End-to-end machine learning pipeline to predict customer churn using classification models. Built with full data preprocessing, feature engineering, model training and evaluation pipeline.

End-to-end ML pipeline

Python Scikit-learn Pandas XGBoost EDA

Click to view project →

Blog

Thoughts on AI & Engineering

How I cut LLM latency by 42% at MetLife

LLM MLOps

March 2025 · 6 min read

RAG vs Fine-tuning: What I learned building enterprise AI

RAG LLM

June 2025 · 8 min read

Why GPU memory optimization saved us $620K

GPU Infra

November 2025 · 5 min read

Agentic AI in production — lessons from multi-agent systems

Agents LangGraph

February 2026 · 7 min read

who makes AI |

The engineer behind
the numbers.

Everything I work with,
end to end.

Where I've built
real impact.

Things I've built
that actually work.

Thoughts on AI & Engineering

Credentials that prove
the expertise.

Let's build something
great together.

who makes AI |

The engineer behindthe numbers.

Everything I work with,end to end.

Where I've builtreal impact.

Things I've builtthat actually work.

Thoughts on AI & Engineering

Credentials that provethe expertise.

Let's build somethinggreat together.

The engineer behind
the numbers.

Everything I work with,
end to end.

Where I've built
real impact.

Things I've built
that actually work.

Credentials that prove
the expertise.

Let's build something
great together.