Mario Brajkovski

Building RL environments and evaluation systems for frontier AI agents

North Macedonia · Remote

Founding engineer at HUD, building RL training environments and evaluation infrastructure for computer-use agents. Co-author of ZeroDayBench (ICLR 2026), a benchmark for evaluating whether LLM agents can autonomously discover and patch zero-day vulnerabilities. Most of my work is on the infrastructure side of making models better: environments that surface failure modes, reward structures that don't get gamed, and evals that tell you something useful.

What I Work On

RL environment design for computer-use agents. Reward shaping, reward hacking detection, trajectory collection for browser-based tasks.

Evaluation infrastructure for frontier models. Figuring out what to measure and how to measure it without fooling yourself.

Agent safety and red-teaming. Finding the ways autonomous agents break before they break in production.

Research

ZeroDayBench: Evaluating LLM Agents on Zero-Day Vulnerabilities

Can frontier LLMs autonomously find and patch real vulnerabilities? We tested GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 on 22 novel critical zero-days in open-source codebases. Short answer: not yet. But the failure patterns point to interesting directions for proactive security.

Confidential LLM inference in trusted execution environments

Privacy-preserving ML using AMD SEV-SNP encrypted VMs. Full memory encryption and attestation with <20% overhead on 110M-param BERT.

Selected Work

RL Training Environments

RL environments for training computer-use agents. Reward hacking analysis and prevention, red-teaming infrastructure, alignment testing.

Evaluation Infrastructure

Eval infrastructure for an unreleased frontier model. Designed the evals from scratch—what to test, how to score it, how to know the scores mean something.

Coding Eval Environment

Eval environment for AI agents on competitive programming. Anti-cheating grading server, 21 problem scenarios with hidden test cases.

Writing

Projects

TwoPeas

Voice-first AI companion with sub-second end-to-end latency. Real-time audio streaming with custom voice activity detection and persistent memory.

Background

2019–2025: Infrastructure and reliability engineering at Joyn (streaming platform, 4 yrs) and HanseMerkur (insurance). Distributed systems, large-scale ops.