Mario Brajkovski

Building RL environments and evaluation systems for frontier AI agents

Munich · Remote

Founding engineer at HUD, building RL training environments and evaluation infrastructure for computer-use agents. Co-author of ZeroDayBench (ICLR 2026), a benchmark for evaluating whether LLM agents can autonomously discover and patch zero-day vulnerabilities. Most of my time goes into the infrastructure that makes models better: RL environments, reward design that holds up under hacking, and evals that actually mean something.

What I Work On

RL environment design for computer-use agents. Reward shaping, reward hacking detection, trajectory collection for browser-based tasks.

Evaluation infrastructure for frontier models. Figuring out what to measure and how to measure it without fooling yourself.

Agent safety and red-teaming. Trying to break autonomous agents before they go to production.

Research

ZeroDayBench: Evaluating LLM Agents on Zero-Day Vulnerabilities

Can frontier LLMs autonomously find and patch real vulnerabilities? We tested GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 on 22 novel critical zero-days in open-source codebases. Short answer: not yet. But the failure modes are telling. A couple of frontier labs are now looking at running it internally.

Confidential LLM inference in trusted execution environments

Privacy-preserving ML using AMD SEV-SNP encrypted VMs. Full memory encryption and attestation with <20% overhead on 110M-param BERT.

Selected Work

RL Training Environments

RL environments for training computer-use agents. Reward hacking analysis and prevention, red-teaming infrastructure, alignment testing.

Evaluation Infrastructure

Eval infrastructure for an unreleased frontier model. Designed the evals from scratch, including the scoring and validation.

ast-pilot

One command turns any well-tested Python module into a shippable coding eval task. AST-level source scanning, LLM-powered prompt generation with multi-round factual validation that refuses to ship prompts that drifted from the code. Hidden test injection, bytecode obfuscation, and automatic repo-internal dependency remapping for anti-cheating.

CUA Environment Template

Template for evaluating computer-use agents in a full virtual desktop. X11/XFCE4/Chromium stack in Docker, dual-mode tool registration for multiple LLM platforms, and flexible bash/LLM hybrid grading.

Coding Eval Environment

Eval environment for AI agents on competitive programming. Anti-cheating grading server, 21 problem scenarios with hidden test cases.

Writing

Projects

TwoPeas

AI companion with a Live2D avatar that talks, emotes, and moves in sync with the conversation. Sub-second voice latency, persistent memory.

Background

2019–2025: Infrastructure and reliability engineering at Joyn (streaming platform, 4 yrs) and HanseMerkur (insurance). Distributed systems, large-scale ops.