Mario Brajkovski

Building RL environments and evaluation systems for frontier AI agents

North Macedonia · Remote

Founding engineer at HUD, building RL training environments and evaluation infrastructure for computer-use agents. Co-author of ZeroDayBench (ICLR 2026), a benchmark for evaluating whether LLM agents can autonomously discover and patch zero-day vulnerabilities. Most of my time goes into the infrastructure that makes models better — RL environments, reward design that holds up under hacking, and evals that actually mean something.

What I Work On

RL environment design for computer-use agents. Reward shaping, reward hacking detection, trajectory collection for browser-based tasks.

Evaluation infrastructure for frontier models. Figuring out what to measure and how to measure it without fooling yourself.

Agent safety and red-teaming. Trying to break autonomous agents before they go to production.

Research

ZeroDayBench: Evaluating LLM Agents on Zero-Day Vulnerabilities

Can frontier LLMs autonomously find and patch real vulnerabilities? We tested GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 on 22 novel critical zero-days in open-source codebases. Short answer: not yet. But the failure modes are telling. A couple of frontier labs are now looking at running it internally.

Confidential LLM inference in trusted execution environments

Privacy-preserving ML using AMD SEV-SNP encrypted VMs. Full memory encryption and attestation with <20% overhead on 110M-param BERT.

Selected Work

RL Training Environments

RL environments for training computer-use agents. Reward hacking analysis and prevention, red-teaming infrastructure, alignment testing.

Evaluation Infrastructure

Eval infrastructure for an unreleased frontier model. Designed the evals from scratch, including the scoring and validation.

Coding Eval Environment

Eval environment for AI agents on competitive programming. Anti-cheating grading server, 21 problem scenarios with hidden test cases.

Writing

Projects

TwoPeas

AI companion with a Live2D avatar that talks, emotes, and moves in sync with the conversation. Sub-second voice latency, persistent memory.

Background

2019–2025: Infrastructure and reliability engineering at Joyn (streaming platform, 4 yrs) and HanseMerkur (insurance). Distributed systems, large-scale ops.