Show HN: Crawlspace – A centralized web crawling platform built on Cloudflare
andrethegiant Tuesday, January 21, 2025Crawlspace is a centralized web crawling platform that benefits crawler developers AND website owners. Developers can affordably crawl tens of millions of pages per month, scrape with LLMs, and save data in attached storage. Website owners are shielded by a platform-wide TTL cache that absorbs redundant bot traffic.
AI bots are running rampant on the open web. Many recent HN stories[1][2][3][4] describe how web crawlers have run amok and hammer websites with DDoS-like traffic. They often do this with blatant disregard of website owners' wishes (e.g. ignoring robots.txt, 429s, Retry-After headers, etc) because they face no repercussions for deploying poorly-behaved crawlers (and are not incentivized to improve them).
The knee-jerk reaction to fix this problem is to give more tools to website owners. Maintaining denylists of IP addresses and user agents, implementing honeypots and tarpits, etc are tactics that website owners use to combat the problem. However, this ends up resulting in and endless arms race between web crawlers and website owners, as they each try to employ new mechanisms of one-upping each other.
Crawlspace takes a different approach _by providing a convenient and affordable platform to web crawler developers_. By funneling web crawling traffic through a centralized platform, we can control neat things like making crawlers well-behaved by default, implementing proper caching, and more — all the tedium that that developers don't want to (and therefore, don't) do themselves. Music streaming services like Spotify used convenience and affordability to curb music piracy; we're following the same playbook to curb rampant bot traffic on the internet.
In about 50 lines of code, you can deploy a performant and polite web crawler on Cloudflare's network. Every crawler gets its own queue, SQLite database, vector database, and S3-compatible bucket, which allows you to query your crawl as it's crawling with either SQL statements or a RAG chat interface. We've stitched together 10+ Cloudflare products including Queues, Durable Objects, Browser Rendering, Workers AI, D1, R2, and Vectorize.
Please let us know what you think! Happy to answer any questions.
[1] https://news.ycombinator.com/item?id=42549624
[2] https://news.ycombinator.com/item?id=42660377
[3] https://news.ycombinator.com/item?id=42725147
[4] https://news.ycombinator.com/item?id=42750420