Sravan Pusuluri
Sravan Pusuluri
Generative AI Engineer
Open to Remote
Ready to Relocate


who makes AI |

Building LLM inference pipelines, RAG architectures & agentic AI systems at Fortune 50 scale. Based in New York, NY.

42%
Latency Reduced
35%
GPU Memory Saved
$620K+
Infrastructure Saved
4+
Years Experience

The engineer behind
the numbers.

I'm a Generative AI Engineer with 4+ years of experience building LLM inference pipelines, RAG architectures, and agentic AI systems at enterprise scale.

Currently at MetLife (Fortune 50), I work on the full stack of AI infrastructure โ€” from quantization and GPU optimization to multi-agent orchestration and observability.

My work spans regulated industries including finance and healthcare, where responsible AI and production-grade guardrails aren't optional โ€” they're required.

Let's Connect
๐Ÿ“
Location
New York, NY ยท Open to Remote
New York, NY โ€” where the servers never sleep and neither do I. Open to Remote & Relocation! ๐Ÿ—ฝ
๐Ÿข
Current Role
Generative AI Engineer @ MetLife
Officially: Generative AI Engineer. Unofficially: the person who stops the LLMs from hallucinating at 3am. ๐Ÿค–
๐ŸŽ“
Education
MS Data Analytics, SUNY Albany '25
๐ŸŽ“ SUNY Albany, Class of 2025. Spent more time optimizing LLMs than sleeping. My thesis advisor called it 'impressive'. My mom called it 'concerning'. ๐Ÿ˜„

Everything I work with,
end to end.

GenAI & LLMs
LangChain LlamaIndex Bedrock vLLM TensorRT-LLM Llama 3 Mistral RAG LoRA/QLoRA RLHF
Agentic AI
LangGraph LangSmith AutoGen ReAct ReWoo Multi-Agent AI Guardrails Responsible AI
Vector & Search
Pinecone pgvector ChromaDB Weaviate OpenSearch Semantic Search Hybrid Search
GPU & Performance
CUDA NCCL Nsight Systems INT8 Quant KV-Cache Dynamic Batching DeepSpeed MII A100/H100
MLOps & Cloud
Docker K8s SageMaker Actions Airflow FastAPI MLflow Prometheus CloudWatch
Languages & ML
Python PyTorch TensorFlow Hugging Face SQL Scala Spark XGBoost spaCy

Where I've built
real impact.

MetLife ยท Fortune 50
Generative AI Engineer โ€” LLM Optimization & Inference
Sep 2024 โ€“ Present ยท New York, NY
Current
  • Optimized LLM inference stack using vLLM and TensorRT-LLM, cutting p95 response latency by 42% while preserving output consistency across 3B-parameter model variants in production.
  • Deployed quantized INT8 GPTQ pipelines with KV-cache streaming and head pruning, reducing GPU memory footprint by 35% on NVIDIA A100 clusters โ€” enabling 2ร— model concurrency and deferring an estimated $500K in GPU hardware procurement.
  • Integrated DeepSpeed MII for multi-GPU orchestration via NCCL and CUDA Graphs, achieving near-linear scalability across 8-GPU nodes.
  • Engineered dynamic batching in Triton Inference Server, increasing GPU utilization from ~58% to 83% โ€” avoiding an estimated $120K+ in annual cloud GPU spend.
  • Converted fine-tuned models to ONNX Runtime for edge deployments, achieving 1.7ร— token throughput vs. baseline PyTorch.
  • Architected RAG pipelines using LangChain and vector databases (Pinecone, ChromaDB) with measurable hallucination reduction.
  • Built production-grade inference monitoring with Prometheus + CloudWatch, enabling real-time latency alerting and automated horizontal scaling.
  • Automated CI/CD via Docker + GitHub Actions, compressing release cycles from weekly โ†’ daily and improving deployment reliability by 50%.
  • Led GPU performance investigations using Nsight Systems, reducing inference variance by 27%.
Sage Softtech
Machine Learning Engineer
Mar 2021 โ€“ Jul 2023 ยท India
  • Engineered production ML system on 50M+ EHR records โ€” patient readmission risk engine improving high-risk detection by 18%.
  • Built clinical NLP pipeline using spaCy NER, boosting predictive AUC by 0.09.
  • Compressed retraining cycles from 4 hours โ†’ 45 minutes via Airflow-orchestrated pipelines.
  • Deployed production APIs (FastAPI + Docker on AWS EC2) serving 200+ physicians daily.
  • Implemented MLflow + Prometheus for full model lifecycle governance and automated data drift detection.
  • Operated within a regulated healthcare environment โ€” the same compliance mindset now applied at MetLife.

Things I've built
that actually work.

01
Enterprise RAG Chatbot with Agentic Workflows
Multi-agent conversational AI using LangGraph (ReAct) and Amazon Bedrock, grounded in a 100K+ document corpus via ChromaDB. Implemented AI Guardrails for PII redaction and hallucination mitigation.
91% faithfulness <800ms p95 500+ concurrent
LangChain LangGraph Amazon Bedrock ChromaDB LangSmith AWS Lambda
02
Fraud Detection System โ€” Real-Time Anomaly Scoring
Fraud detection pipeline on highly imbalanced financial transaction data using Isolation Forest and Autoencoder-based anomaly detection. Deployed as real-time scoring API on SageMaker.
94% recall Sub-second alerts
Isolation Forest Autoencoders XGBoost SageMaker CloudWatch
03
Multi-Model Time Series Demand Forecasting
Ensemble forecasting system comparing ARIMA, Prophet, and LSTM for supply chain demand. Enabled 30-day ahead forecasts with automated seasonality detection.
12% lower MAPE 30-day horizon
LSTM Prophet ARIMA Apache Airflow AWS
04
Customer Churn Prediction
End-to-end machine learning pipeline to predict customer churn using classification models. Built with full data preprocessing, feature engineering, model training and evaluation pipeline.
End-to-end ML pipeline
Python Scikit-learn Pandas XGBoost EDA
Click to view project โ†’

Thoughts on AI & Engineering

01
How I cut LLM latency by 42% at MetLife
LLM MLOps
March 2025 ยท 6 min read
02
RAG vs Fine-tuning: What I learned building enterprise AI
RAG LLM
June 2025 ยท 8 min read
03
Why GPU memory optimization saved us $620K
GPU Infra
November 2025 ยท 5 min read
04
Agentic AI in production โ€” lessons from multi-agent systems
Agents LangGraph
February 2026 ยท 7 min read
Featured post
How I cut LLM latency by 42% at MetLife โ€” a production deep dive
March 2025 ยท 6 min read
A behind-the-scenes look at optimizing vLLM, TensorRT-LLM and INT8 quantization pipelines on NVIDIA A100 clusters at Fortune 50 scale. Covers KV-cache streaming, dynamic batching, and how we deferred $500K in GPU spend.
LLM MLOps GPU Production
42%
Latency cut
$500K
GPU savings
6 min
Read time
Read article โ†’

Credentials that prove
the expertise.

Amazon Web Services
AWS
In Progress
AWS Certified Machine Learning Engineer โ€“ Associate
MLA-C01
Amazon Web Services
AWS
Completed
AWS Certified AI Practitioner
AIF-C01
NVIDIA Academy
In Progress
Deep Learning Institute โ€“ Accelerating LLM Inference
NVIDIA DLI
NVIDIA Academy
Completed
AI for All: From Basics to GenAI Practice
Sep 2025
Click to view certificate
NVIDIA Certificate

Let's build something
great together.

Open to senior Generative AI / MLOps roles. Feel free to reach out via any channel below.