RL environment design for computer-use agents. Reward shaping, reward hacking detection, trajectory collection for browser-based tasks.
Mario Brajkovski
Building RL environments and evaluation systems for frontier AI agents
North Macedonia · Remote
Founding engineer at HUD, building RL training environments and evaluation infrastructure for computer-use agents. Co-author of ZeroDayBench (ICLR 2026), a benchmark for evaluating whether LLM agents can autonomously discover and patch zero-day vulnerabilities. Most of my work is on the infrastructure side of making models better: environments that surface failure modes, reward structures that don't get gamed, and evals that tell you something useful.
What I Work On
Evaluation infrastructure for frontier models. Figuring out what to measure and how to measure it without fooling yourself.
Agent safety and red-teaming. Finding the ways autonomous agents break before they break in production.
Research
Can frontier LLMs autonomously find and patch real vulnerabilities? We tested GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 on 22 novel critical zero-days in open-source codebases. Short answer: not yet. But the failure patterns point to interesting directions for proactive security.
Privacy-preserving ML using AMD SEV-SNP encrypted VMs. Full memory encryption and attestation with <20% overhead on 110M-param BERT.
Selected Work
RL environments for training computer-use agents. Reward hacking analysis and prevention, red-teaming infrastructure, alignment testing.
Eval infrastructure for an unreleased frontier model. Designed the evals from scratch—what to test, how to score it, how to know the scores mean something.
Eval environment for AI agents on competitive programming. Anti-cheating grading server, 21 problem scenarios with hidden test cases.
Writing
Projects
Voice-first AI companion with sub-second end-to-end latency. Real-time audio streaming with custom voice activity detection and persistent memory.
Background
2019–2025: Infrastructure and reliability engineering at Joyn (streaming platform, 4 yrs) and HanseMerkur (insurance). Distributed systems, large-scale ops.