RL environment design for computer-use agents. Reward shaping, reward hacking detection, trajectory collection for browser-based tasks.
Mario Brajkovski
Building RL environments and evaluation systems for frontier AI agents
North Macedonia · Remote
Founding engineer at HUD, building RL training environments and evaluation infrastructure for computer-use agents. Co-author of ZeroDayBench (ICLR 2026), a benchmark for evaluating whether LLM agents can autonomously discover and patch zero-day vulnerabilities. Most of my time goes into the infrastructure that makes models better — RL environments, reward design that holds up under hacking, and evals that actually mean something.
What I Work On
Evaluation infrastructure for frontier models. Figuring out what to measure and how to measure it without fooling yourself.
Agent safety and red-teaming. Breaking autonomous agents on purpose so they don't break on their own.
Research
Can frontier LLMs autonomously find and patch real vulnerabilities? We tested GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 on 22 novel critical zero-days in open-source codebases. Short answer: not yet. But the failure modes are telling. A couple of frontier labs are now looking at running it internally.
Privacy-preserving ML using AMD SEV-SNP encrypted VMs. Full memory encryption and attestation with <20% overhead on 110M-param BERT.
Selected Work
RL environments for training computer-use agents. Reward hacking analysis and prevention, red-teaming infrastructure, alignment testing.
Eval infrastructure for an unreleased frontier model. Designed the evals from scratch—what to test, how to score it, how to know the scores mean something.
Eval environment for AI agents on competitive programming. Anti-cheating grading server, 21 problem scenarios with hidden test cases.
Writing
Projects
Voice-first AI companion with sub-second end-to-end latency. Real-time audio streaming with custom voice activity detection and persistent memory.
Background
2019–2025: Infrastructure and reliability engineering at Joyn (streaming platform, 4 yrs) and HanseMerkur (insurance). Distributed systems, large-scale ops.