AI/ML Engineer · M.S. AI/ML, UW–Madison · graduating May 2026

Ajeenckya Mahadik

I build calibrated probabilistic models and LLM agent systems. Most portfolios tell you about the models — this one lets you run them. Every demo below is powered by my real model outputs.

Ajeenckya Mahadik
0.176Brier score (26.7% better than baseline)
7.5×lower latency than LangChain (own agent loop)
94%task completion, self-improving agent (Tau Bench)
−44%RAG retrieval latency via embedding caching

// the lab

Don't take my word for it. Run the models.

Four systems, live in your browser. The probabilities, simulations, and benchmark numbers below are actual outputs of models I trained — not mockups.

01 · Probabilistic forecasting — UW–Madison research

World Cup 2026 Bayesian Forecaster

A PyTorch policy-gradient network with a Bayesian neural network head (Monte Carlo weight sampling), an XGBoost weather sub-model, temperature-scaled calibration, and a Monte Carlo tournament layer — served via FastAPI in the real system. Brier score 0.24 → 0.176.

PyTorchBayesian NNXGBoostMonte CarloFastAPI
tournaments 0 last champion

Everything here is reproducible — same result on every click, because the model is fixed. ▶ Play the predicted World Cup animates the model's representative tournament — groups, bracket, champion — from a fixed random seed, so it always ends the same way. 📊 The model's forecast is the formal prediction: 5,000 seeded Monte Carlo futures aggregated into calibrated title probabilities, because a probabilistic model's answer is a distribution, not one name. ⚡ Run 5,000 re-runs everything with fresh dice — watch the numbers land on the official ◇ odds again.

group stage · 12 groups · 48 teams · fixtures from the real model

world champion probability · ▮ live simulation vs ◇ official 5,000-run distribution

Group results sample directly from the calibrated ensemble's real per-match probabilities. Knockout matchups use team strengths fitted to my official Monte Carlo run (◇); bracket pairings are approximated.

A broadcast-style match center for all 72 group fixtures — except the win probabilities are the exact outputs of the three model variants: the standard network (SNN), the Bayesian head (BNN), and the calibrated ensemble that blends them with XGBoost weather features.

A Bayesian neural network doesn't have fixed weights — it has distributions over weights. Each forward pass samples a slightly different network. Hundreds of passes produce a spread of predictions: that spread is the model telling you how much to trust it.

samples drawn0
mean P(win)
±2σ interval

Calibration means: when the model says 70%, it should be right ~70% of the time. Temperature scaling + the BNN head moved the Brier score from 0.24 to 0.176 and recall to 84%, validated with k-fold CV and McNemar tests, with diagnostics logged to PostgreSQL.

The predictions on this page were locked in before the tournament. From June 11, a GitHub Action pulls official results every 6 hours and this scoreboard grades every group-stage prediction automatically — accuracy, live Brier score, no excuses.

Brier score = mean squared error of the 3-way probabilities (lower is better; 0.667 = uniform guessing, my backtest = 0.176). Knockout fixtures aren't in the locked group-stage prediction set, so they're tracked as results only. Results source: fixturedownload.com, synced by .github/workflows/update-results.yml.

Loading live tournament data…

the football is the demo — the same model ports directly to these domains

📦 Demand forecasting

Slow-moving SKUs are low-count events, exactly like goals. Calibrated outcome probabilities + the Monte Carlo layer = inventory scenario planning with honest service-level estimates instead of point forecasts.

🏦 Credit risk

Default / cure / prepay is the same 3-way classification as win / draw / loss. Banks are required to prove calibration (Basel) — reliability diagrams, Brier decomposition and temperature scaling are the exact validation toolkit.

🛡 Insurance pricing

Claim frequency is count modeling under exogenous conditions. The XGBoost weather sub-model is literally a catastrophe-covariate block; calibrated probabilities are the difference between profitable and mispriced premiums.

🏭 Predictive maintenance

P(failure within 30 days) with epistemic uncertainty tells you not just what will break but when the model doesn't know — inspect exactly those assets. I shipped this pattern in production at Powertrac and Tata Motors.

🩺 Clinical risk scores

Readmission and complication models are only useful if 70% means 70% — miscalibrated scores erode clinician trust. The BNN spread flags the patients where the model should defer to humans.

⚡ Energy & finance

The time-decay feature weighting is EWMA from volatility modeling; Monte Carlo propagation of match uncertainty to title odds is the same machinery as load forecasting bands and portfolio scenario analysis.

📘 Model deep dive — architecture, features, training, calibration, serving

The problem

Football outcomes are low-count, high-variance events: a 3-way classification (home / draw / away) where even the best team loses often. The interesting challenge isn't predicting a winner — it's producing an honest probability, one you could bet on, plan inventory around, or price risk with. That's why the whole system is built around calibration rather than raw accuracy.

Features

Each fixture is described by engineered team-pair features: rolling form and attack/defense rates with time-decay weighting (the same idea as EWMA in volatility modeling), ranking deltas, rest and travel context, and host advantage. On top of that sit two live signals: an NLP pipeline that pulls RSS news, scores sentiment per team and per key player, and folds it into the features — and a weather block (temperature, humidity, precipitation, wind, cloud cover, pressure) for the actual stadium and kickoff window, which you can toggle in the Match center to see its real effect per fixture.

Architecture — three models, one ensemble

SNN: a PyTorch policy-gradient neural network over the engineered features, trained with a policy-gradient objective and emitting a 3-way softmax. It's the fast, deterministic baseline — one forward pass, one answer. BNN: the same backbone extended with a Bayesian output head. Instead of fixed weights, the head learns weight distributions; at inference I draw hundreds of Monte Carlo weight samples, each producing a slightly different network and prediction. The spread of those predictions decomposes into total uncertainty (predictive entropy), epistemic uncertainty (mutual information — what the model doesn't know because it hasn't seen enough data) and the win-probability standard deviation you can watch being sampled in the "Inside the model" tab. XGBoost weather sub-model: a gradient-boosted tree model that learns how weather shifts outcomes, feeding the ensemble. The final calibrated ensemble blends SNN and BNN with the weather signal and applies temperature scaling so that "70%" means 70%. On top sits a Poisson scoreline layer: it inverts the calibrated win/draw/loss probabilities into goal rates (λhome, λaway) and reports the most likely scoreline consistent with the pick — score predictions are a pure function of the same probabilities, not a separate guess.

Training & validation

Validated with 5-fold cross-validation and historical backtests; every variant comparison (e.g. weather on vs off) goes through McNemar tests in an A/B framework, so a feature only survives if it helps beyond noise. Calibration diagnostics — reliability diagrams, expected calibration error, Brier decomposition — plus experiment metadata are logged to PostgreSQL. Results: Brier score 0.24 → 0.176 (−26.7% vs baseline), recall 84%, scenario variance −17%.

The Monte Carlo tournament layer

Match probabilities become title odds by simulation: sample all 72 group fixtures from the ensemble's distributions, resolve standings (including the 8 best third-placed teams in the 48-team format), play the bracket to the final, repeat 5,000 times, and count champions. Uncertainty propagates from single matches all the way to the trophy — exactly what the simulator tab does in your browser with the same numbers.

Serving & feedback

The real system runs behind FastAPI: REST endpoints for predictions, team data and result uploads, a post-match feedback loop that updates the model as results land, and the news pipeline refreshing sentiment features continuously. It's deployable software, not a notebook.

Why it generalizes

Calibrated low-count forecasting is the same problem as demand planning for slow-moving inventory, credit-risk PDs, and insurance claim frequency. The football is the demo; the method is the product.

02 · LLM agents from scratch

CodeCraft — framework-free CLI coding agent

A coding agent built on raw API tool-calling loops — no LangChain, no CrewAI. ReAct planning, session memory, diff previews, approval guardrails, RLAIF scoring, Docker-packaged. 100% completion and 7.5× lower latency than LangChain on an identical 15-task benchmark.

Raw tool callingReActRLAIFDocker

Type any coding task and race the two agents on it: my raw API loop in one lane, the identical task routed through LangChain's framework stack in the other. Watch where the time goes.

🔑 Real mode — run the actual agent live, with your own key

Paste any OpenAI-compatible API key (free keys: Groq or OpenRouter). It is used only inside your browser — calls go straight from your browser to the provider; this site has no servers and never sees it. Real mode = the agent's genuine tool-calling loop hitting a live LLM, with code actually executed by real Python (Pyodide/WebAssembly) in this page. Type mock as the key to test the full pipeline (scripted LLM, real execution) without any key. With a key saved, the self-improving agent and DocGraph demos below also switch to live runs.

measured · 15 identical tool-calling tasks, same model & tools

CodeCraft
1.0× · 15/15 ✓
LangChain
7.5× latency

Why LangChain is slow: it rebuilds an AgentExecutor chain per task, dispatches a CallbackManager event on every step, renders prompt templates through serialization layers, and re-validates tool output with parser retries. A raw loop does one thing — send messages, execute tools. The on-screen race trace is simulated from your prompt; the latency ratio is my measured benchmark.

⚡ CodeCraft · raw loop 0.0s

type a task → ▶ Race

🐢 LangChain · framework 0.0s
📘 Model deep dive — the loop, the tools, the guardrails, the benchmark

Zero-framework by design

CodeCraft talks to the LLM with raw HTTPS calls to an OpenAI-compatible chat-completions endpoint (xAI by default) using only the Python standard library — no LangChain, no CrewAI, no AutoGen. The point isn't framework allergy; it's that an agent is fundamentally a while loop around an API, and owning that loop means owning latency, cost, and failure modes.

How the loop works

The model receives the conversation plus JSON schemas for the local tools (native function calling). Each turn it either answers or returns tool_calls; the agent executes them in the workspace, appends the observations, and calls the model again — ReAct's think → act → observe cycle until the task is done. Session memory keeps the project map and prior decisions across requests, so follow-ups skip re-discovery.

The tool set

Eleven workspace-scoped tools: project_context, list_files, read_file, search_files, write_file, replace_in_file, make_directory, git_status, git_diff, run_tests, and run_command. Every file write shows a diff preview and every shell command requires approval by default (with an explicit --auto-approve escape hatch) — guardrails are part of the architecture, not an afterthought. A doctor command diagnoses provider/tool-call issues.

RLAIF scoring

Completed trajectories are scored by Grok 4 acting as an AI judge — task success, efficiency, and safety of the action sequence — giving a reward signal used to refine the agent's prompting policy over time (reinforcement learning from AI feedback, without human labeling).

The benchmark

15 identical tool-calling coding tasks, same model, same tools, run through CodeCraft and through a LangChain implementation. Result: 15/15 completed, 7.5× lower latency. The gap is pure orchestration overhead — chain abstractions, callback layers, and serialization the raw loop simply doesn't have. Packaged with Docker for reproducibility.

03 · Agents that learn from failure

Self-Improving LLM Agent

An agent that gets better without retraining: it analyzes its own failed execution traces, writes corrective strategies into ChromaDB semantic memory, and retrieves them on similar future tasks. Distilled Grok 4 → QLoRA LLaMA-3.2-1B. 94% task completion on Tau Bench.

ChromaDBQLoRALLaMA-3.2-1BDistillation

The hypothesis: failure memory compounds. Give it any coding task — the agent first writes a hidden verifier for it, then attempts a solution, learns a strategy from every real failure, and retries with memory injected.

🔑 with a key saved in the CodeCraft panel above, this is a LIVE run on your task: your LLM writes the verifier and the solutions, real Python (Pyodide) judges every attempt, and the strategies in memory are written by the LLM from its own real failures. Without a key it plays a recorded demo.

strategy memory (ChromaDB)

∅ empty — no strategies learned yet

how it performs · task success during the run (final: 94% on Tau Bench)

📘 Model deep dive — failure analysis, strategy memory, distillation, evaluation

The hypothesis

Failure memory compounds. Inspired by SELF-REFINE (Madaan et al., 2023): if an agent analyzes why it failed and stores the lesson, retrieval of those lessons should make every future attempt smarter — improvement without touching the model weights.

Three agents, one benchmark

The system implements three strategies for controlled comparison: ReAct (think-act-observe baseline), Plan-and-Act (upfront planning baseline), and Strategy-Guided — the contribution, which augments planning with retrieved strategies from memory.

The learning loop

Every execution is traced step by step. When a task fails, a failure-analysis pass (heuristic rules plus LLM reasoning over the trace) produces a corrective strategy — a short, generalizable instruction like "sort by total price including fees before selecting." The strategy is embedded and stored in ChromaDB. On a new task, the agent retrieves the top-k semantically similar strategies and injects them into its planning prompt. The demo above is this exact loop: fail → analyze → store → retrieve → pass.

Distillation to a 1B model

To make the agent cheap to run, Grok 4 teacher trajectories are distilled into LLaMA-3.2-1B via QLoRA — 4-bit NF4 quantization with low-rank adapters — so a 1-billion-parameter model inherits behavior from a frontier teacher at a fraction of the inference cost.

Evaluation

94% task completion on Tau Bench, with additional OS-level and web-navigation task suites and an ablation study isolating each component's contribution (memory off vs on, heuristic vs LLM failure analysis). The success curve in the demo is illustrative in shape; the endpoint is the measured result.

04 · Retrieval beyond vector search

DocGraph — Knowledge-Graph RAG

RAG over 10K emails using LightRAG + pgvector + Mistral 7B with JWT multi-user isolation. Graph traversal answers multi-hop questions flat vector search can't, and E5 embedding caching cut retrieval latency 1.8s → 1.0s (−44%).

Knowledge GraphLightRAGpgvectorMistral 7B

Real RAG, running in your browser. Ask anything about the inbox on the right — BM25 retrieval ranks the actual emails, sentence extraction builds the answer from their text, and the knowledge graph lights up the linked entities. 🔑 With a key saved above, the answer is instead generated by your LLM from the retrieved emails, with citations. The demo inbox is entirely fictional — it exists only to show the mechanics. Paste your own emails below and ask about those instead; nothing leaves your browser.

➕ use your own emails — paste them here

Paste anything (one document per blank-line-separated block; first line becomes the subject). It is indexed locally in your browser only — refresh and it's gone.

production metrics from that system: multi-hop answers flat vector search misses, retrieval latency 1.8s → 1.0s (−44%) via E5 embedding caching, JWT isolation at the SQL layer.

📘 Model deep dive — ingestion, graph construction, hybrid retrieval, isolation

Why a graph at all

Flat vector RAG retrieves chunks that look like the query. But "who approved the invoice Sarah flagged?" needs facts from documents that never mention each other in the same paragraph. A knowledge graph stores entities (people, vendors, documents, projects) and their relations explicitly, so retrieval can walk from fact to fact — multi-hop reasoning vector similarity structurally cannot do.

Ingestion & graph construction

10K emails are parsed (.eml, extendable to IMAP), chunked, and passed through LightRAG, which extracts entities and relations to build the knowledge graph alongside the text index. Every chunk is embedded with intfloat/e5-small-v2 (384 dimensions — deliberately CPU-friendly) and stored in Postgres 16 + pgvector with cosine-distance indexing.

Hybrid retrieval

A query first hits pgvector for candidate entry points, then expands along graph edges to pull in connected entities and their supporting chunks — vector search for recall, graph traversal for reasoning. The assembled context goes to a fully local Mistral 7B Instruct (GGUF via llama.cpp), which answers with referenceable /emails/{id} source links. Nothing leaves the machine — by design, since email is about as private as data gets.

Multi-user isolation

Authentication is JWT-based, and isolation is enforced where it can't be bypassed: every retrieval query is filtered by user_id at the SQL layer. One database, many users, zero cross-user leakage.

Performance

E5 embedding caching cut end-to-end retrieval latency from 1.8s to 1.0s (−44%). The FastAPI backend ships with seed scripts, a benchmark harness, and an evaluation report comparing graph-RAG against flat vector baselines on multi-hop questions.

// experience

Four years of ML in production environments.

May 2025 – PresentMadison, WI

Graduate Research Assistant

University of Wisconsin–Madison

  • Built a probabilistic forecasting stack: PyTorch policy-gradient network + XGBoost weather sub-model served via FastAPI, improving forecast accuracy by 12%.
  • Extended it with a Bayesian neural network head via Monte Carlo sampling — Brier score 0.24 → 0.176 (−26.7%), recall 84%, variance −17%.
  • Validated variants with Brier score and McNemar tests in an A/B framework; calibration diagnostics and experiment metadata logged to PostgreSQL.
2023 – 2024India

Graduate Associate Engineer

Fiat India Automobiles

  • Trained a Gradient Boosting Regressor on 150K assembly records to predict operation-level delay, giving managers a 2–4 hour intervention window and lifting line efficiency 9%.
  • Engineered queue-depth, shift-context, operator-history, and product-complexity features with leakage-aware splits.
  • Provisioned the AWS delivery pipeline (S3, Lambda, CloudWatch, IAM) feeding delay features to manufacturing teams.
2022 – 2023India

Analytics Engineer

Powertrac

  • Built a predictive-maintenance XGBoost classifier for 30-day breakdown risk, cutting downtime 12%.
  • Replaced manual triage with a risk-score dispatch system using greedy-knapsack technician assignment — routing efficiency +22%.
  • Ran a monitoring loop tracing low-confidence predictions to feature drift, sustaining 88%+ precision in production.
2021 – 2022India

Graduate Engineer

Tata Motors

  • Deployed Isolation Forest anomaly detection on throughput, error-rate, and readiness signals — efficiency +18%, throughput +15%.
  • Modeled deployment impact via Monte Carlo simulation, reducing operational costs 20%. Kaizen Award winner.

// skills

The stack behind the demos.

ML & Deep Learning

PyTorch · TensorFlow · Scikit-learn · XGBoost · Bayesian modeling · LoRA/QLoRA fine-tuning · calibration & evaluation

LLMs & Agents

RAG · Knowledge-Graph RAG · tool use · agent memory · ReAct · guardrails · LangChain/LangGraph · multi-agent orchestration · RLAIF · LLM evaluation

Languages & Backend

Python · SQL · FastAPI · Streamlit · Playwright · REST APIs · Git

Data & Cloud

PostgreSQL · Redis · pgvector · AWS (S3, Lambda) · Docker · Kubernetes · CI/CD · model deployment

// education

Education

University of Wisconsin–Madison

M.S. Industrial Engineering · AI/ML specialization · May 2026

AI Agents · Advanced Deep Learning · Foundation Models · DL for NLP · Machine Learning

Shivaji University

B.E. Mechanical Engineering · 2020

// contact

Hiring for ML, applied AI, or agent engineering?

I'm graduating in May 2026 and looking for roles where calibrated probabilistic thinking and hands-on LLM systems work matter. The fastest way to evaluate me is two scrolls up — the models speak for themselves.