Show HN: I Built a Sandbox for Agents
Show HN: The HN Arcade
I love seeing all the small games that people build and post to this site.
I don't want to forget any, so I have built a directory/arcade for the games here that I maintain.
Feel free to check it out, add your game if its missing and let me know what you think. Thanks!
Show HN: I built a small browser engine from scratch in C++
Hi HN! Korean high school senior here, about to start CS in college.
I built a browser engine from scratch in C++ to understand how browsers work. First time using C++, 8 weeks of development, lots of debugging—but it works!
Features:
- HTML parsing with error correction
- CSS cascade and inheritance
- Block/inline layout engine
- Async image loading + caching
- Link navigation + history
Hardest parts:
- String parsing(html, css)
- Rendering
- Image Caching & Layout Reflowing
What I learned (beyond code):
- Systematic debugging is crucial
- Ship with known bugs rather than chase perfection
- The Power of "Why?"
~3,000 lines of C++17/Qt6. Would love feedback on code architecture and C++ best practices!
GitHub: https://github.com/beginner-jhj/mini_browser
Show HN: Dwm.tmux – a dwm-inspired window manager for tmux
Hey, HN! With all recent agentic workflows being primarily terminal- and tmux-based, I wanted to share a little project I created about decade ago.
I've continued to use this as my primary terminal "window manager" and wanted to share in case others might find it useful.
I would love to hear about other's terminal-based workflows and any other tools you may use with similar functionality.
Show HN: Cua-Bench – a benchmark for AI agents in GUI environments
Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.
Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.
The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.
With Cua-Bench, you can:
- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)
- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)
- Generate new tasks from natural language prompts
- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)
- Run oracle validations to verify environments before agent evaluation
- Monitor agent runs in real-time with traces and screenshots
All of this works on macOS, Linux, Windows, and Android, and is self-hostable.
To get started:
Install cua-bench:
% pip install cua-bench
Run a basic evaluation:
% cb run dataset datasets/cua-bench-basic --agent demo
Open the monitoring dashboard:
% cb run watch <run_id>
For parallelized evaluations across multiple workers:
% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8
Want to test across different OS variations? Just specify the environment:
% cb run task slack_message --agent your-agent --env windows_xp
% cb run task slack_message --agent your-agent --env macos_sonoma
Generate new tasks from prompts:
% cb task generate "book a flight on kayak.com"
Validate environments with oracle implementations:
% cb run dataset datasets/cua-bench-basic --oracle
The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.
We're seeing teams use Cua-Bench for:
- Training computer-use models on mobile and desktop environments
- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)
- RL fine-tuning with shell app simulators
- Systematic evaluation across OS themes and browser versions
- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)
Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.
GitHub: https://github.com/trycua/cua
Docs: https://cua.ai/docs/cuabench
Technical Report: https://cuabench.ai
We'll be here to answer any technical questions and look forward to your comments!
Show HN: Extracting React apps from Figma Make's undocumented binary format
The article explores methods for reverse-engineering Figma design files, allowing users to extract and modify the underlying data, such as vector graphics, text elements, and layer structures, without directly accessing the Figma application.
Show HN: Build Web Automations via Demonstration
Hey HN,
We’ve been building browser agents for a while. In production, we kept converging on the same pattern: deterministic scripts for the happy path, agents only for edge cases. So we built Demonstrate Mode.
The idea is simple: You perform your workflow once in a remote browser. Notte records the interactions and generates deterministic automation code.
How it works: - Record clicks, inputs, navigations in a cloud browser - Compile them into deterministic code (no LLM at runtime) - Run and deploy on managed browser infrastructure
Closest analog is Playwright codegen but: - Infrastructure is handled (remote browsers, proxies, auth state) - Code runs in a deployable runtime with logs, retries, and optional agent fallback
Agents are great for prototyping and dynamic steps, but for production we usually want versioned code and predictable cost/behavior. Happy to dive into implementation details in the comments.
Demo: https://www.loom.com/share/f83cb83ecd5e48188dd9741724cde49a
-- Andrea & Lucas, Notte Founders
Show HN: A header-only C++20 compile-time assembler for x86/x64 instructions
The article introduces 'static_asm', a Rust library that allows developers to write assembly code directly in their Rust projects. It provides a safe and efficient way to integrate low-level assembly language with the Rust programming language.
Show HN: One Human + One Agent = One Browser From Scratch in 20K LOC
Related: https://simonwillison.net/2026/Jan/27/one-human-one-agent-on...
Show HN: We built a type-safe Python ORM for RedisGraph/FalkorDB
We were tired of writing raw Cypher — escaping quotes, zero autocomplete, refactoring nightmares — so we built GraphORM: a type-safe Python ORM for RedisGraph/FalkorDB using pure Python objects.
What it does Instead of fragile Cypher:
query = """
MATCH (a:User {user_id: 1})-[r1:FRIEND]->(b:User)-[r2:FRIEND]->(c:User)
WHERE c.user_id <> 1 AND b.active = true
WITH b, count(r2) as friend_count
WHERE friend_count > 5
RETURN c, friend_count
ORDER BY friend_count DESC
LIMIT 10
"""
You write type-safe Python: stmt = select().match(
(UserA, FRIEND.alias("r1"), UserB),
(UserB, FRIEND.alias("r2"), UserC)
).where(
(UserA.user_id == 1) & (UserC.user_id != 1) & (UserB.active == True)
).with_(
UserB, count(FRIEND.alias("r2")).label("friend_count")
).where(
count(FRIEND.alias("r2")) > 5
).returns(
UserC, count(FRIEND.alias("r2")).label("friend_count")
).orderby(
count(FRIEND.alias("r2")).desc()
).limit(10)
Key features:
• Type-safe schema with Python type hints
• Fluent query builder (select().match().where().returns())
• Automatic batching (flush(batch_size=1000))
• Atomic transactions (with graph.transaction(): ...)
• Zero string escaping — O'Connor and "The Builder" just workTarget audience • AI/LLM agent devs: store long-term memory as graphs (User → Message → ToolCall) • Web crawler engineers: insert 10k pages + links in 12 lines vs 80 lines of Cypher • Social network builders: query "friends of friends" with indegree()/outdegree() • Data engineers: track lineage (Dataset → Transform → Output) • Python devs new to graphs: avoid Cypher learning curve
Data insertion: the real game-changer
Raw Cypher nightmare: queries = [ """CREATE (:User {email: "alice@example.com", name: "Alice O\\'Connor"})""", """CREATE (:User {email: "bob@example.com", name: "Bob \\"The Builder\\""})""" ] for q in queries: graph.query(q) # No transaction safety!
GraphORM bliss: alice = User(email="alice@example.com", name="Alice O'Connor") bob = User(email="bob@example.com", name='Bob "The Builder"') graph.add_node(alice) graph.add_edge(Follows(alice, bob, since=1704067200)) graph.flush() # One network call, atomic transaction
Try it in 30 seconds pip install graphorm
from graphorm import Node, Edge, Graph
class User(Node):
__primary_key__ = ["email"]
email: str
name: str
class Follows(Edge):
since: int
graph = Graph("social", host="localhost", port=6379)
graph.create()
alice = User(email="alice@example.com", name="Alice")
bob = User(email="bob@example.com", name="Bob")
graph.add_node(alice)
graph.add_edge(Follows(alice, bob, since=1704067200))
graph.flush()
GitHub: https://github.com/hello-tmst/graphormWe'd love honest feedback: • Does this solve a real pain point for you? • What's missing for production use? • Any API design suggestions?
Show HN: LemonSlice – Upgrade your voice agents to real-time video
Hey HN, we're the co-founders of LemonSlice (try our HN playground here: https://lemonslice.com/hn). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: https://www.loom.com/share/941577113141418e80d2834c83a5a0a9
Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.
We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.
Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.
From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).
And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.
We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.
Looking forward to your feedback!
EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)
*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.
Show HN: Multi-Agent Framework for Ruby
The article discusses the Chatwoot AI Agents, an open-source project that integrates AI-powered chatbots and virtual agents into the Chatwoot customer engagement platform. The agents leverage large language models to provide intelligent and contextual responses to customer queries, enhancing the overall customer experience.
Show HN: Cloakly – Hide sensitive windows from screen shares in real-time
I’m a developer who spends half my day on screen shares. I got tired of the "pre-call ritual" of closing every private app just to make sure I didn't accidentally show a bank balance or a private message during a demo.
Existing "Focus" modes hide notifications, but they don't help if you need to share your full desktop for context.
I built Cloakly to solve this. It’s a Windows utility that lets you "cloak" specific windows. They stay fully visible and interactive for you, but they are 100% invisible to capture software (Zoom, Teams, Discord, etc.).
How it works (briefly): It leverages Windows OS properties to exclude specific window handles from capture streams. It allows you to keep your reference notes or private chats open on the same screen you are sharing.
Features: Dual Reality: You see the window; the audience sees your wallpaper/desktop. Ghost Mode: Adjust window transparency to see through to what’s underneath. Stealth: The app can hide its own taskbar presence.
I’m launching on Product Hunt today to see if this is a pain point for others or just me. I’d love your feedback on the implementation and whether you’d find this useful in your workflow.
PH Link: https://www.producthunt.com/products/cloakly Site: https://www.getcloakly.com/
Show HN: PNANA - A TUI Text Editor
I’d like to share PNANA , a lightweight TUI editor built with C++ and FTXUI that I’ve been building for personal use and now open-sourced. It’s a minimal, fast terminal-based editor focused on simple coding and editing workflows—no bloated features, just the core functionality for terminal-centric use cases.
https://github.com/Cyxuan0311/PNANA
Key pragmatic features
Lightweight C++ core with FTXUI for smooth TUI rendering, fast startup and low resource usage
Basic but solid editing capabilities (syntax highlighting, line numbering, basic navigation)
Simple build process with minimal dependencies, easy to compile and run on Linux/macOS terminals
Early LSP integration support for basic code completion (still polishing, but functional for common languages)
It’s very much an early-stage project—I built it to scratch my own itch for a minimal, self-built TUI editor and learn C++/FTXUI along the way. There are definitely rough edges (e.g., some LSP kinks, limited customization), and it’s not meant to replace mature editors like Vim/Nano—just a small open-source project for folks who like minimal terminal tools or want to learn TUI development with C++.
Any feedback, bug reports, or tiny suggestions are super welcome. I’m slowly iterating on it and would love to learn from the HN community’s insights. Thanks for taking a look!
Show HN: Fuzzy Studio – Apply live effects to videos/camera
Back story:
I've been learning computer graphics on the side for several years now and gain so much joy from smooshing and stretching images/videos. I hope you can get a little joy as well with Fuzzy Studio!
Try applying effects to your camera! My housemates and I have giggled so much making faces with weird effects!
Nothing gets sent to the server; everything is done in the browser! Amazing what we can do. I've only tested on macOS... apologies if your browser/OS is not supported (yet).
Show HN: I wrapped the Zorks with an LLM
I grew up on the Infocom games and when microsoft actually open-sourced Zork 1/2/3 I really wanted to figure out how to use LLMs to let you type whatever you want, I always found the amount language that the games "understood" to be so limiting - even if it was pretty state of the art at the time.
So I figured out how to wrap it with Tambo.. (and run the game engine in the browser) basically whatever you type gets "translated" into zork-speak and passed to the game - and then the LLM takes the game's output and optionally adds flavor. (the little ">_" button at the top exposes the actual game input)
What was a big surprise to me is multi-turn instructions - you can ask it to "Explore all the rooms in the house until you can't find any more" and it will plug away at the game for 10+ "turns" at a time... like Claude Code for Zork or something
Show HN: AI PDF to ePub Converter
The article discusses the use of PDF to ePub conversion tools powered by AI technology, providing users with a hassle-free method to convert PDF files into ePub format for a seamless reading experience across various devices.
Show HN: We Built the 1. EU-Sovereignty Audit for Websites
The article discusses an audit of the European Union's policies and institutions, highlighting the need for greater transparency, accountability, and efficiency in the EU's governance. It emphasizes the importance of addressing concerns about the EU's democratic legitimacy and decision-making processes.
Show HN: Marches & Gnats – Coding puzzle game where you program Turing machine
Marches & Gnats is a browser-based coding puzzle game inspired by Advent of Code, but instead of writing code in a conventional programming language, you program a Turing machine.
Each quest presents a concrete problem and a minimal model of computation. You define transition rules, run the machine, inspect the output (or errors), and iterate until it works.
The game is set in 19th-century Estonia during the Romantic era and combines narrative with progressively harder problems, including arithmetic, sorting, parsing, ciphers, and cellular automata.
Show HN: mute your macOS mic to ZERO. But Siri keeps listening
Show HN: Only 1 LLM can fly a drone
SnapBench is a benchmarking tool for serverless functions, enabling developers to measure the performance and cost-efficiency of their cloud functions across different cloud providers and configurations.
Show HN: TetrisBench – Gemini Flash reaches 66% win rate on Tetris against Opus
TetrisBench is a website that provides benchmarking tools and resources for the classic puzzle game Tetris. It offers performance analysis, comparison of different Tetris implementations, and insights into the game's mechanics and optimization techniques.
Show HN: A blog that deletes itself if you stop writing
I built Lapse because most of my blogs died quietly, abandoned after a few months.
lapse.blog is a minimal blogging platform with one rule: if you don't post for 30 days, your blog is permanently deleted. No warnings, no recovery.
How it works:
- No signup. Your unique passphrase grants access to your blog. Same passphrase = same blog. (If two people pick the same one, they'll control the same blog by design)
- The longer you post consistently, the longer it lives. Borrowing from social-media streaks, but for writing. If they can encourage you to Snapchat someone, surely we can encourage ourselves to write.
- Markdown only. No images, no embeds.
- RSS and Atom feeds included.
- Forget your passphrase? Blog gets deleted. Stop posting? Blog gets deleted.
- No ads, no tracking.
The idea is that impermanence, hopefully, removes the pressure to be perfect, and the deadline offers an incentive to keep writing.
Show HN: An interactive map of US lighthouses and navigational aids
This is an interactive map of US navigational aids and lighthouses, which indicates their location, color, characteristic and any remarks the Coast Guard has attached.
I was sick at home with the flu this weekend, and went on a bit of a Wikipedia deep dive about active American lighthouses. Searching around a bit, it was very hard to find a single source or interactive map of active beacons, and a description of what the "characteristic" meant. The Coast Guard maintains a list of active lights though, that they publish annually (https://www.navcen.uscg.gov/light-list-annual-publication). With some help from Claude Code, it wasn't hard to extract the lat/long and put together a small webapp that shows a map of these light stations and illustrates their characteristic with an animated visualization..
Of course, this shouldn't be used as a navigational aid, merely for informational purposes! Though having lived in Seattle and San Francisco I thought it was quite interesting.
Show HN: A 4.8MB native iOS voice notes app built with SwiftUI
Hey HN,
I wanted to share a project I’ve been working on called Convoxa. It’s a native iOS transcriber/summarizer. I had two main goals: keep it efficient and keep it private.
THE TECH STACK
100% Swift & SwiftUI: No heavy cross-platform wrappers or bloated dependencies.
Binary Size: The final build is only 4.8 MB.
Transcription: Uses Apple's latest speech APIs for maximum privacy and efficiency.
THE CHALLENGE: BYPASSING THE 4K CONTEXT LIMIT
The biggest technical hurdle was working with Apple’s foundation models. The default context window is capped at 4096 tokens, which is practically useless for anything over a 10-minute meeting transcript.
I ended up building a recursive chunking method to "feed" the model long-form data without losing the global context of the conversation. I use a sliding window approach where each chunk's summary informs the next, ensuring the final output doesn't "hallucinate" at the seams where the chunks meet. It’s now stable enough for long-form audio while remaining entirely on-device for supported hardware.
PRIVACY & AI MODES
On-Device: (Apple Intelligence required) - Total local processing.
Cloud: With reasoning for intelligent insights (Zero Data Retention).
I’m currently in the pre-order phase (out on Feb 3rd) and would love to get some feedback from this community on the performance and the chunking logic.
App Store: https://apps.apple.com/us/app/convoxa-ai-meeting-minutes/id6...
Show HN: TUI for managing XDG default applications
Author here. I made this little TUI program for managing default applications on the Linux desktop.
Maybe some of you will find it useful.
Happy to answer any questions.
Show HN: SF Microclimates
https://microclimates.solofounders.com/
Show HN: Nyxi – Execution-time governance for irreversible
With AI agents getting more autonomous, controlling irreversible actions (sending money, emails, deploying code) is becoming critical.
Nyxi introduces execution-time governance: a clean veto/allow boundary that works regardless of whether proposals come from humans or models.
Public docs and demos here (proprietary, no source): https://github.com/indyh91/Nyxi-Showcase
Main overview: https://github.com/indyh91/Nyxi-Showcase/blob/main/docs/PROD...
Would love feedback on the concept!
Show HN: My AI tracks Polymarket whales with guardrails so it won't bankrupt me
Built two things:
Predictor Agent - Scrapes top Polymarket traders, finds their consensus bets, scores entry quality. Currently tracking 51 real signals.
AgentWallet - The "financial leash" I built so the agent can't go rogue. Spend limits, approval thresholds, time windows, full audit trail.
Live demos:
Predictor signals: https://predictor-dashboard.vercel.app
AgentWallet: https://agentwallet-dashboard.vercel.app
The idea: AI agents will need to spend money. Someone needs to build the guardrails. That's AgentWallet.
GitHub: https://github.com/JackD720/agentwallet
Show HN: Netfence – Like Envoy for eBPF Filters
To power the firewalling for our agents so that they couldn't contact arbitrary services, I build netfence. It's like Envoy but for eBPF filters.
It allows you to define different DNS-based rules that are resolved in a local daemon to IPs, then pushed to the eBPF filter to allow traffic. By doing it this way, we can still allow DNS-defined rules, but prevent contacting random IPs.
There's also no network performance penalty, since it's just DNS lookups and eBPF filters referencing memory.
It also means you don't have to tamper with the base image, which the agent could potentially manipulate to remove rules (unless you prevent root maybe).
It automatically manages the lifecycle of eBPF filters on cgroups and interfaces, so it works well for both containers and micro VMs (like Firecracker).
You implement a control plane, just like Envoy xDS, which you can manage the rules of each cgroup/interface. You can even manage DNS through the control plane to dynamically resolve records (which is helpful as a normal DNS server doesn't know which interface/cgroup a request might be coming from).
We specifically use this to allow our agents to only contact S3, pip, apt, and npm.