Show stories

Show HN: Kolibri, a DIY music club in Sweden
EastLondonCoder about 10 hours ago

Show HN: Kolibri, a DIY music club in Sweden

We’re Maria and Jonatan, and we run a small DIY music club in Norrköping, Sweden, called Kolibri.

We run it through a small Swedish company. We pay artists, handle logistics, and take operations seriously. But it has still behaved like a tiny cultural startup in the most relevant way: you have to build trust, form a recognisable identity, pace yourself, avoid burnout, and make something people genuinely return to, without big budgets or growth hacks. We run it on the last Friday of every month in a small restaurant venue, typically 50–70 paying guests.

What we built isn’t an app. It’s a repeatable local format: a standing night where strangers become regulars, centred on music rather than networking.

We put up a simple anchor site with schedule + photos/video: https://kolibrinkpg.com/

What you can “try” on the site:

  * Photos and short videos from nights (atmosphere + scale)
  * A sense of programming/curation (what we book, how we sequence a night)
  * Enough context to copy parts of the format if you’re building something similar locally
How it started: almost accidentally. I was doing one of many remote music sessions with a friend from London, passing Ableton projects back and forth while talking over FaceTime. One evening I ran out of beer and wandered into a nearby restaurant (Mitropa). A few conversations later we had a date on the calendar.

That restaurant is still the venue. It’s owned by a local family: one runs the kitchen, another manages the space. Over time they’ve become close to us, so I’ll put it plainly: if they called and needed help, we’d drop everything.

Maria was quickly dubbed klubbvärdinnan (hostess), partly as a joke. In Sweden in the 1970s, posh nightclubs sometimes had a klubbvärdinna, a kind of social anchor. She later adopted it as her DJ alias, and the role became real: greeting people, recognising newcomers who look uncertain, and quietly setting the tone for how people treat one another.

The novelty (if there is any) is that we treat the night like a designed social system:

  * Curation is governance. If the music is coherent and emotionally “true”, people relax. If it’s generic, people perform.
  * The room needs a host layer. Someone has to make it socially safe to arrive alone.
  * Regulars are made, not acquired. People return when they feel recognised and when the night has a consistent identity.
  * DIY constraints create legitimacy. Turning a corner restaurant into a club on a shoestring sounds amateurish, but it reads as real.
  * Behavioural boundaries are practical. If newcomers can’t trust the room, the whole thing stops working.
On marketing: we learned quickly that “posting harder” isn’t the same as building a local thing. What worked best was analogue outreach: we walked around town, visited local businesses we genuinely like, bought something, introduced ourselves, and asked if we could leave a flyer. It’s boring, but it builds trust because it’s human, not algorithmic.

A concrete example: early on we needed Instagram content that could show music visually without filming crowds in a club. We started filming headphone-walk clips: one person, headphones on, walking through town to a track we chose. It looked good, stylised, cinematic, and that mattered more than we expected. People didn’t just tolerate being filmed; many wanted to be in the videos. Then we’d invite them for a couple of free drinks afterwards as a thank-you and a chance to actually talk. That was a reliable early trust-building mechanism.

At one point we were offered a larger venue with a proper budget. It was tempting. But we’d just hosted our first live gig at Mitropa and felt something click. We realised the format works because it’s small and grounded. Scale would change the social physics.

kolibrinkpg.com
35 9
Summary
Show HN: Play Zener Cards
nirvanist about 2 hours ago

Show HN: Play Zener Cards

just play zener cards. don't judge :)

zener.cards
4 0
Summary
tullie 3 days ago

Show HN: ShapedQL – A SQL engine for multi-stage ranking and RAG

Hi HN,

I’m Tullie, founder of Shaped. Previously, I was a researcher at Meta AI, worked on ranking for Instagram Reels, and was a contributor to PyTorch Lightning.

We built ShapedQL because we noticed that while retrieval (finding 1,000 items) has been commoditized by vector DBs, ranking (finding the best 10 items) is still an infrastructure problem.

To build a decent for you feed or a RAG system with long-term memory, you usually have to put together a vector DB (Pinecone/Milvus), a feature store (Redis), an inference service, and thousands of lines of Python to handle business logic and reranking.

We built an engine that consolidates this into a single SQL dialect. It compiles declarative queries into high-performance, multi-stage ranking pipelines.

HOW IT WORKS:

Instead of just SELECT , ShapedQL operates in four stages native to recommendation systems:

RETRIEVE: Fetch candidates via Hybrid Search (Keywords + Vectors) or Collaborative Filtering. FILTER: Apply hard constraints (e.g., "inventory > 0"). SCORE: Rank results using real-time models (e.g., p(click) or p(relevance)). REORDER: Apply diversity logic so your Agent/User doesn’t see 10 nearly identical results.

THE SYNTAX: Here is what a RAG query looks like. This replaces about 500 lines of standard Python/LangChain code:

SELECT item_id, description, price

FROM

  -- Retrieval: Hybrid search across multiple indexes

  search_flights("$param.user_prompt", "$param.context"),

  search_hotels("$param.user_prompt", "$param.context")
WHERE

  -- Filtering: Hard business constraints

  price <= "$param.budget" AND is_available("$param.dates")
ORDER BY

  -- Scoring: Real-time reranking (Personalization + Relevance)

  0.5 * preference_score(user, item) +

  0.3 * relevance_score(item, "$param.user_prompt")
LIMIT 20

If you don’t like SQL, you can also use our Python and Typescript SDKs. I’d love to know what you think of the syntax and the abstraction layer!

playground.shaped.ai
72 21
Summary
Show HN: Autonomous recovery for distributed training jobs
tsvoboda about 9 hours ago

Show HN: Autonomous recovery for distributed training jobs

Hi HN! We’re TensorPool. We help companies access and optimize large scale compute for training foundation models.

The Problem

It’s been almost a year since we’ve finished YC, and we’ve just crossed 100,000 multinode training GPU hours run on our platform.

On those training runs, we’ve seen countless 3am job crashes because of issues like an Xid error from a flaky GPU or an S3 timeout that corrupted a checkpoint save. By the time you wake up and notice, you've lost 8+ hours of compute. You scramble to diagnose the issue, manually restart from the last checkpoint, and hope it doesn't happen again. Rinse and repeat.

For training runs that take days to weeks, this constant babysitting is exhausting and expensive. The research iteration cycles lost can also make or break a model release (especially for short reservations).

What We Built

This agent monitors your training jobs and autonomously recovers them when things go wrong. It works with Kubernetes, Slurm, and TensorPool Jobs.

We originally built the TensorPool Agent as an internal tool to help us debug failures with our own customers. Over time, we realized its performance was so good that we could automate the entire triage process. We're now releasing a public beta for people to use.

Best case: The TensorPool Agent detects the failure, diagnoses the root cause, fixes it, and restarts your job from the last checkpoint – all while you sleep ;)

Worst case: If the TensorPool agent can't fix the issue automatically, it delivers a preliminary RCA and a list of actions it attempted, giving you a head start on debugging.

How It Works

1) Registration – You provide credentials to your job scheduler via our dashboard. Perms are granted on a whitelist basis; you explicitly control what actions the agent can take.

2) Monitoring – The agent continuously monitors your job for failure conditions.

3) Recovery – On failure, the agent analyzes logs and attempts to diagnose the issue. If successful, it restarts the job from the last checkpoint and resumes monitoring. If not, you get an alert with full context.

Target Failure Modes

The agent is specifically designed for runtime errors that occur deep into training, like:

- CUDA OOM: Memory leaks, gradient explosions

- Xid errors: GPU hardware faults (Xid 79, 63, 48, etc.)

- Distributed communication failures: NCCL timeouts, rank failures

- Storage I/O errors: Checkpoint corruption

- Network issues: S3 request timeouts on mounted object storage

docs.tensorpool.dev
8 3
Summary
Show HN: Transcribee: YouTube transcriber that builds a knowledge base
ofabioroma about 7 hours ago

Show HN: Transcribee: YouTube transcriber that builds a knowledge base

Transcribee is an open-source automated transcription tool that uses natural language processing to convert audio files into text transcripts. It provides a simple and efficient way to transcribe audio content, with features like speaker diarization and multi-language support.

github.com
13 1
Summary
gjarrosson about 4 hours ago

Show HN: We review YC applications for free – with feedback from YC founders

Hi HN, We just launched YC Roaster, a free YC application review service run by Lobster Capital. If you’re applying to Y Combinator, you can submit your application and get written feedback from founders who’ve actually gone through YC and built real companies. How it works * You submit your YC application (PDF) * It’s reviewed by successful YC founders in our network * Everyone gets written feedback * If we think you have a strong shot, we offer a short 1-on-1 Zoom call Why are we doing this for free? Lobster Capital is a VC fund that exclusively invests in YC companies. We’ve reviewed hundreds of YC applications over the years and backed 100+ YC startups. This is our way of giving back, and hopefully helping more strong teams get in. There’s no guarantee of acceptance, no pitch requirement, and no obligation to talk to us afterward. What we’re curious about * What parts of the YC application do founders struggle with the most? * What feedback have you found most useful when applying? Website: https://www.ycroaster.com Happy to answer questions or take feedback (including criticism).

ycroaster.com
3 2
Show HN: VCluster Free – Free K8s Multi-Tenancy with Virtual Clusters
gentele about 7 hours ago

Show HN: VCluster Free – Free K8s Multi-Tenancy with Virtual Clusters

vCluster is launching a free version of its enterprise-grade Kubernetes platform, offering advanced features like multi-tenancy, resource isolation, and high availability at no cost, enabling developers to build and deploy applications more efficiently.

vcluster.com
11 3
Summary
Show HN: ARC-AGI-3 Toolkit
gkamradt about 5 hours ago

Show HN: ARC-AGI-3 Toolkit

docs.arcprize.org
3 1
Show HN: SimpleSVGs – Free Online SVG Optimizer Multiple SVG Files at Once
firtaet about 8 hours ago

Show HN: SimpleSVGs – Free Online SVG Optimizer Multiple SVG Files at Once

SimplesSVGs is a website that offers a collection of free, customizable SVG images for use in web design and other projects. The site provides a user-friendly interface to browse, preview, and download a variety of SVG icons and illustrations.

simplesvgs.online
5 0
Summary
Show HN: Dwm.tmux – a dwm-inspired window manager for tmux
saysjonathan 6 days ago

Show HN: Dwm.tmux – a dwm-inspired window manager for tmux

Hey, HN! With all recent agentic workflows being primarily terminal- and tmux-based, I wanted to share a little project I created about decade ago.

I've continued to use this as my primary terminal "window manager" and wanted to share in case others might find it useful.

I would love to hear about other's terminal-based workflows and any other tools you may use with similar functionality.

github.com
98 17
Summary
Show HN: SHDL – A minimal hardware description language built from logic gates
rafa_rrayes 1 day ago

Show HN: SHDL – A minimal hardware description language built from logic gates

Hi, everyone!

I built SHDL (Simple Hardware Description Language) as an experiment in stripping hardware description down to its absolute fundamentals.

In SHDL, there are no arithmetic operators, no implicit bit widths, and no high-level constructs. You build everything explicitly from logic gates and wires, and then compose larger components hierarchically. The goal is not synthesis or performance, but understanding: what digital systems actually look like when abstractions are removed.

SHDL is accompanied by PySHDL, a Python interface that lets you load circuits, poke inputs, step the simulation, and observe outputs. Under the hood, SHDL compiles circuits to C for fast execution, but the language itself remains intentionally small and transparent.

This is not meant to replace Verilog or VHDL. It’s aimed at: - learning digital logic from first principles - experimenting with HDL and language design - teaching or visualizing how complex hardware emerges from simple gates.

I would especially appreciate feedback on: - the language design choices - what feels unnecessarily restrictive vs. educationally valuable - whether this kind of “anti-abstraction” HDL is useful to you.

Repo: https://github.com/rafa-rrayes/SHDL

Python package: PySHDL on PyPI

To make this concrete, here are a few small working examples written in SHDL:

1. Full Adder

component FullAdder(A, B, Cin) -> (Sum, Cout) {

    x1: XOR; a1: AND;
    x2: XOR; a2: AND;
    o1: OR;

    connect {
        A -> x1.A; B -> x1.B;
        A -> a1.A; B -> a1.B;

        x1.O -> x2.A; Cin -> x2.B;
        x1.O -> a2.A; Cin -> a2.B;
        a1.O -> o1.A; a2.O -> o1.B;

        x2.O -> Sum; o1.O -> Cout;
    }
}

2. 16 bit register

# clk must be high for two cycles to store a value

component Register16(In[16], clk) -> (Out[16]) {

    >i[16]{
        a1{i}: AND;
        a2{i}: AND;
        not1{i}: NOT;
        nor1{i}: NOR;
        nor2{i}: NOR;
    }
    
    connect {
        >i[16]{
            # Capture on clk
            In[{i}] -> a1{i}.A;
            In[{i}] -> not1{i}.A;
            not1{i}.O -> a2{i}.A;
            
            clk -> a1{i}.B;
            clk -> a2{i}.B;
            
            a1{i}.O -> nor1{i}.A;
            a2{i}.O -> nor2{i}.A;
            nor1{i}.O -> nor2{i}.B;
            nor2{i}.O -> nor1{i}.B;
            nor2{i}.O -> Out[{i}];
        }
    }
}

3. 16-bit Ripple-Carry Adder

use fullAdder::{FullAdder};

component Adder16(A[16], B[16], Cin) -> (Sum[16], Cout) {

    >i[16]{ fa{i}: FullAdder; }

    connect {
        A[1] -> fa1.A;
        B[1] -> fa1.B;
        Cin -> fa1.Cin;
        fa1.Sum -> Sum[1];

        >i[2,16]{
            A[{i}] -> fa{i}.A;
            B[{i}] -> fa{i}.B;
            fa{i-1}.Cout -> fa{i}.Cin;
            fa{i}.Sum -> Sum[{i}];
        }

        fa16.Cout -> Cout;
    }
}

github.com
43 21
Summary
Show HN: A MitM proxy to see what your LLM tools are sending
jmuncor 1 day ago

Show HN: A MitM proxy to see what your LLM tools are sending

I built this out of curiosity about what Claude Code was actually sending to the API. Turns out, watching your tokens tick up in real-time is oddly satisfying.

Sherlock sits between your LLM tools and the API, showing you every request with a live dashboard, and auto-saved copies of every prompt as markdown and json.

github.com
213 113
Summary
yuppiepuppie 1 day ago

Show HN: The HN Arcade

I love seeing all the small games that people build and post to this site.

I don't want to forget any, so I have built a directory/arcade for the games here that I maintain.

Feel free to check it out, add your game if its missing and let me know what you think. Thanks!

andrewgy8.github.io
339 112
Summary
Show HN: Shelvy Books
tekkie00 1 day ago

Show HN: Shelvy Books

Hey HN! I built a little side project I wanted to share.

Shelvy is a free, visual bookshelf app where you can organize books you're reading, want to read, or have finished. Sign in to save your own collection.

Not monetized, no ads, no tracking beyond basic auth. Just a fun weekend project that grew a bit.

Live: https://shelvybooks.com

Would love any feedback on the UX or feature ideas!

shelvybooks.com
45 17
Summary
thirdavenue about 15 hours ago

Show HN: An Open Source Alternative to Vercel/Render/Netlify

Shor Labs is a biotechnology company that develops novel treatments for rare and genetic diseases. The company utilizes cutting-edge gene therapy and gene editing technologies to create innovative solutions for unmet medical needs.

shorlabs.com
21 3
Summary
Show HN: Build Web Automations via Demonstration
ogandreakiro 3 days ago

Show HN: Build Web Automations via Demonstration

Hey HN,

We’ve been building browser agents for a while. In production, we kept converging on the same pattern: deterministic scripts for the happy path, agents only for edge cases. So we built Demonstrate Mode.

The idea is simple: You perform your workflow once in a remote browser. Notte records the interactions and generates deterministic automation code.

How it works: - Record clicks, inputs, navigations in a cloud browser - Compile them into deterministic code (no LLM at runtime) - Run and deploy on managed browser infrastructure

Closest analog is Playwright codegen but: - Infrastructure is handled (remote browsers, proxies, auth state) - Code runs in a deployable runtime with logs, retries, and optional agent fallback

Agents are great for prototyping and dynamic steps, but for production we usually want versioned code and predictable cost/behavior. Happy to dive into implementation details in the comments.

Demo: https://www.loom.com/share/f83cb83ecd5e48188dd9741724cde49a

-- Andrea & Lucas, Notte Founders

notte.cc
30 20
Summary
Show HN: Externalized Properties, a modern Java configuration library
jeyjeyemem 3 days ago

Show HN: Externalized Properties, a modern Java configuration library

Externalized Properties is powerful configuration library which supports resolution of properties from external sources such as files, databases, git repositories, and any custom sources

github.com
11 7
Summary
lcolucci 2 days ago

Show HN: LemonSlice – Upgrade your voice agents to real-time video

Hey HN, we're the co-founders of LemonSlice (try our HN playground here: https://lemonslice.com/hn). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: https://www.loom.com/share/941577113141418e80d2834c83a5a0a9

Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.

We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.

Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.

From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).

And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.

We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.

Looking forward to your feedback!

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.

123 127
Show HN: Pinecone Explorer – Desktop GUI for the Pinecone vector database
arsentjev 2 days ago

Show HN: Pinecone Explorer – Desktop GUI for the Pinecone vector database

https://github.com/stepandel/pinecone-explorer

pinecone-explorer.com
20 3
Summary
Show HN: I built a small browser engine from scratch in C++
crediblejhj 1 day ago

Show HN: I built a small browser engine from scratch in C++

Hi HN! Korean high school senior here, about to start CS in college.

I built a browser engine from scratch in C++ to understand how browsers work. First time using C++, 8 weeks of development, lots of debugging—but it works!

Features:

- HTML parsing with error correction

- CSS cascade and inheritance

- Block/inline layout engine

- Async image loading + caching

- Link navigation + history

Hardest parts:

- String parsing(html, css)

- Rendering

- Image Caching & Layout Reflowing

What I learned (beyond code):

- Systematic debugging is crucial

- Ship with known bugs rather than chase perfection

- The Power of "Why?"

~3,000 lines of C++17/Qt6. Would love feedback on code architecture and C++ best practices!

GitHub: https://github.com/beginner-jhj/mini_browser

github.com
141 45
Summary
Show HN: Built a way to validate ideas with AI personas and Simulated Community
justincxa about 9 hours ago

Show HN: Built a way to validate ideas with AI personas and Simulated Community

Built a way to validate ideas with AI personas and Simulated Community by inputting a niche and seed prompt you can test ideas and have communities respond like they would - beta is free now!

nichesim.com
2 0
Summary
Show HN: vind – A Better Kind (Kubernetes in Docker)
saiyampathak about 9 hours ago

Show HN: vind – A Better Kind (Kubernetes in Docker)

Vind is an open-source, lightweight, and high-performance search engine that can be easily integrated into web applications. It provides a simple and efficient way to add search functionality to various types of content, including documents, images, and more.

github.com
13 1
Summary
Show HN: I made a dual-bootable NixBSD (NixOS and FreeBSD) image
jonhermansen about 10 hours ago

Show HN: I made a dual-bootable NixBSD (NixOS and FreeBSD) image

I've been working on getting NixBSD (Nix package manager + FreeBSD) to boot alongside NixOS on a shared ZFS pool. The result is a <2GB disk image you can try in QEMU or virt-manager.

What works:

    - GRUB chainloads FreeBSD's bootloader
    - Both systems share a ZFS pool
    - Everything is defined in a single Nix flake
    - Fully reproducible builds (some dependencies are now cached on Cachix)
Planned:

    - Support native compilation of NixBSD (currently cross-compiled on Linux)
    - Many shortcuts were taken to get this working, needs lots of cleanup
    - Add a semi-automated installer like nixos-wizard
Try it:

    qemu-system-x86_64 -enable-kvm -m 2048 \
      -bios /usr/share/ovmf/OVMF.fd \
      -drive file=nixos.root.img,format=raw
Login: nixos/nixos or root/toor

The hardest parts were getting mounts working at boot, making the bootloader setup idempotent, and debugging early init. This disk image could potentially work on a USB stick with a bit more work.

This is very much experimental. My goal is to eventually produce a proper NixBSD installation ISO and consolidate all configuration into one repository while still consuming upstream NixBSD as a flake.

Download: https://github.com/jonhermansen/nixbsd-demo/releases/tag/bui...

Feel free to leave feedback here or on GitHub! Thanks!

github.com
5 2
Summary
Show HN: Cursor for Userscripts
mifydev 1 day ago

Show HN: Cursor for Userscripts

I’ve been experimenting with embedding an Claude Code/Cursor-style coding agent directly into the browser.

At a high level, the agent generates and maintains userscripts and CSS that are re-applied on page load. Rather than just editing DOM via JS in console the agent is treating the page, and the DOM as a file.

The models are often trained in RL sandboxes with full access to the filesystem and bash, so they are really good at using it. So to make the agent behave well, I've simulated this environment.

The whole state of a page and scripts is implemented as a virtual filesystem hacked on top of browser.local storage. URL is mapped to directories, and the agent starts inside this directory. It has the tools to read/edit files, grep around and a fake bash command that is just used for running scripts and executing JS code.

I've tested only with Opus 4.5 so far, and it works pretty reliably. The state of the file system can be synced to the real filesystem, although because Firefox doesn't support Filesystem API, you need to manually import the fs contents first.

This agent is really useful for extracting things to CSV, but it's also can be used for fun.

Demo: https://x.com/ichebykin/status/2015686974439608607

github.com
54 15
embedding-shape 3 days ago

Show HN: One Human + One Agent = One Browser From Scratch in 20K LOC

Related: https://simonwillison.net/2026/Jan/27/one-human-one-agent-on...

emsh.cat
313 147
Summary
Show HN: Cua-Bench – a benchmark for AI agents in GUI environments
someguy101010 3 days ago

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)

- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)

- Generate new tasks from natural language prompts

- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)

- Run oracle validations to verify environments before agent evaluation

- Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench:

% pip install cua-bench

Run a basic evaluation:

% cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard:

% cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp

% cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com"

Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

- Training computer-use models on mobile and desktop environments

- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)

- RL fine-tuning with shell app simulators

- Systematic evaluation across OS themes and browser versions

- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua

Docs: https://cua.ai/docs/cuabench

Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!

github.com
36 8
Summary
ShreyaChaurasia about 10 hours ago

Show HN: Nomod payment integrated into usage-based billing stack

Hi HN,

We just shipped a Nomod integration in Flexprice. For context, flexprice is an open-source billing system that handles invoices, usage, and credit wallets. One gap we wanted to close was supporting region-specific payment providers without breaking billing state.

With this integration: - Invoices finalized in Flexprice can be synced to Nomod - A hosted Nomod payment link is generated for the invoice - Payment status updates flow back into Flexprice - Invoices and payment records stay in sync - Credits (if applicable) are applied only after payment succeeds

This keeps billing logic simple and avoids reconciliation issues later. There's no demo yet, but docs are live here: https://docs.flexprice.io/integrations/nomod/

Happy to answer questions or hear feedback from folks who've built billing or payment integrations before or feel free to join our open-source community if that interests you : http://bit.ly/4huvkDm

8 4
cmkr 3 days ago

Show HN: We Built the 1. EU-Sovereignty Audit for Websites

The article discusses an audit of the European Union's policies and institutions, highlighting the need for greater transparency, accountability, and efficiency in the EU's governance. It emphasizes the importance of addressing concerns about the EU's democratic legitimacy and decision-making processes.

lightwaves.io
104 87
Summary
Weves about 10 hours ago

Show HN: Craft – Claude Code running on a VM with all your workplace docs

I’ve found coding agents to be great at 1/ finding everything they need across large codebases using only bash commands (grep, glob, ls, etc.) and 2/ building new things based on their findings (duh).

What if, instead of a codebase, the files were all your workplace docs? There was a `Google_Drive` folder, a `Linear` folder, a `Slack` folder, and so on. Over the last week, we put together Craft to test this out.

It’s an interface to a coding agent (OpenCode for model flexibility) running on a virtual machine with: 1. your company's complete knowledge base represented as directories/files (kept in-sync) 2. free reign to write and execute python/javascript 3. ability to create and render artifacts to the user

Demo: https://www.youtube.com/watch?v=Hvjn76YSIRY Github: https://github.com/onyx-dot-app/onyx/blob/main/web/src/app/c...

It turns out OpenCode does a very good job with docs. Workplace apps also have a natural structure (Slack channels about certain topics, Drive folders for teams, etc.). And since the full metadata of each document can be written to the file, the LLM can define arbitrarily complex filters. At scale, it can write and execute python to extract and filter (and even re-use the verified correct logic later).

Put another way, bash + a file system provides a much more flexible and powerful interface than traditional RAG or MCP, which today’s smarter LLMs are able to take advantage of to great effect. This comes especially in handy for aggregation style questions that require considering thousands (or more) documents.

Naturally, it can also create artifacts that stay up to date based on your company docs. So if you wanted “a dashboard to check realtime what % of outages were caused by each backend service” or simply “slides following XYZ format covering the topic I’m presenting at next week’s dev knowledge sharing session”, it can do that too.

Craft (like the rest of Onyx) is open-source, so if you want to run it locally (or mess around with the implementation) you can.

Quickstart guide: https://docs.onyx.app/deployment/getting_started/quickstart Or, you can try it on our cloud: https://cloud.onyx.app/auth/signup (all your data goes on an isolated sandbox).

Either way, we’ve set up a “demo” environment that you can play with while your data gets indexed. Really curious to hear what y’all think!

3 0
Show HN: Fuzzy Studio – Apply live effects to videos/camera
ulyssepence 2 days ago

Show HN: Fuzzy Studio – Apply live effects to videos/camera

Back story:

I've been learning computer graphics on the side for several years now and gain so much joy from smooshing and stretching images/videos. I hope you can get a little joy as well with Fuzzy Studio!

Try applying effects to your camera! My housemates and I have giggled so much making faces with weird effects!

Nothing gets sent to the server; everything is done in the browser! Amazing what we can do. I've only tested on macOS... apologies if your browser/OS is not supported (yet).

fuzzy.ulyssepence.com
55 20