TLDR Data 2026-04-23

“The database is still provisioning? Sure, I'd love to wait longer,” said no one ever. Skip the wait and get to the real work with Lakebase. (Sponsor)

Lakebase is a serverless Postgres DB that lets you branch your whole DB almost instantly. Run tests on prod data, spin up AI apps, and more. With separate storage and compute, you wait less and build faster.

With Lakebase you can:

Branch databases for testing
Scale up fast — and down to zero just as easily
Run apps, agents and AI on one database
Use one database for operational and analytical data

Get the Databricks founders' rundown on Lakebase, jump straight to building or watch a walk-through. The choice is yours.

📱

Deep Dives

Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge (4 minute read)

Meta re-architected Facebook Groups scoped search with a hybrid retrieval stack that combines Unicorn inverted-index lexical search and a 12-layer, 200M-parameter semantic retriever using Faiss ANN over precomputed embeddings. Query preprocessing, feature-level ranking with BM25/TF-IDF plus cosine similarity, and an MTML supermodel jointly optimize clicks, shares, and comments. To scale validation, Meta added an automated Llama 3-based judge in BVT, including a “somewhat relevant” class for finer judgment.

Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest (10 minute read)

Pinterest's MIQPS system normalizes URLs by stripping noise (like tracking parameters and formatting differences) to map many variant URLs to a single canonical form, enabling URLs to be clustered into equivalence groups, with safeguards for precision (avoid over-merging distinct content) and continuous evaluation loops to measure accuracy and adjust rules over time.

Building a fault-tolerant metrics storage system at Airbnb (9 minute read)

Airbnb built an internal metrics storage system capable of ingesting ~50 million samples/sec across ~1.3 billion time series by introducing strict multi-tenant isolation (per-service tenancy, shuffle sharding) and guardrails on reads/writes to prevent any single workload from overwhelming the system.

🚀

Opinions & Advice

The Interface Is the Contract (14 minute read)

Global enterprise ontologies often fail because they force different business contexts to share one denotational model for terms like customer, product, and location. The proposed interface-driven approach keeps rich domain-specific ontologies inside each boundary, and exposes only context-aware projections through RDF 1.2 reification, SHACL 1.2 connotations, named graphs, and SPARQL transforms. That enables auditable meaning shifts, safer cross-domain interoperability, and a practical mix of open-world discovery with closed-world reasoning at the interface layer.

AI-Ready Data vs. Analytics-Ready Data (10 minute read)

Analytics-ready data is designed for humans: it is aggregated, stable, and explainable so dashboards can reliably answer “what happened”. AI-ready data is built for models to preserve raw detail, context, semantics, and timeliness so systems can reason about “what should happen next,” while aggregation often destroys the very signal AI needs.

💻

Launches & Tools

Agentic Analytics Summit with Joe Reis, Ravit Jain, and industry speakers - join free (Sponsor)

Learn from the teams shipping AI-native analytics at Brex, Patagonia, and Jobber, at this exclusive virtual event hosted by Cube. You will also hear from leading voices such as Joe Ries, author of O'Reilly Fundamentals of Data Engineering. Join live on April 29.

ggsql: A grammar of graphics for SQL (11 minute read)

ggsql is a tool, currently in alpha, that lets users create charts directly inside SQL queries instead of switching to Python or R. It's designed to make data visualization faster, clearer, and more scalable by running chart calculations in the database, while also being easier for AI tools to generate.

ML Intern (GitHub Repo)

Hugging Face's ML Intern is an autonomous coding agent that researches, writes, and ships ML projects using docs, datasets, GitHub, and cloud tools. It's basically an AI junior engineer focused on machine learning workflows.

Pgweb (GitHub Repo)

pgweb is a lightweight, open-source PostgreSQL client that runs as a local web server, exposing a browser-based UI for exploring tables, running queries, and exporting data, all packaged as a single Go binary with zero dependencies for easy setup across platforms.

dbt-score (GitHub Repo)

dbt-score is a linter for dbt metadata quality. It scores models and projects against rules for docs, tests, ownership, naming, and SQL complexity, so teams can enforce standards in CI/CD and catch weak models early. It supports custom rules for org-specific governance.

🎁

Miscellaneous

Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction (9 minute read)

A new KV-cache compression method for LLMs replaces simple token pruning with a smarter approach: it identifies low-value context, summarizes it mathematically, and stores a compact version instead of deleting it. In tests, this delivered better accuracy and lower memory use than common Top-K or sliding-window methods, suggesting longer context windows can be handled more efficiently.

Four Horsemen of the AIpocalypse (16 minute read)

Anthropic, OpenAI, and NVIDIA are all running into hard limits of AI economics and infrastructure: uptime issues, capacity shortages, and compute buildouts that lag far behind announced demand. Anthropic's Claude services are cited at 98.79%–99.25% uptime over 90 days, while the broader market reportedly has only 15.2GW of the 114GW of promised AI data-center capacity actually under construction. Rising inference costs are pushing major vendors like Microsoft and Anthropic toward token-based billing, tighter rate limits, and reduced subsidies.

The Last Mile to Apache Iceberg - Building a Basement Data Platform (8 minute read)

Cloudflare R2 plus R2 Data Catalog makes a cheap, laptop-scale Iceberg lake practical: no egress fees, S3-compatible storage, and managed catalog metadata for Trino/DuckDB. The missing piece is ingestion, solved here with a ~500-line Rust HTTP proxy that converts POSTed NDJSON into a single atomic Iceberg commit.

⚡

Quick Links

Five things I believe about the future of analytics (5 minute read)

As analytics is shifting from BI-centric, human-driven analysis to agentic workflows, the bigger disruption is at the “data usage” layer, where AI agents are already running and agent-initiated queries may surpass human-initiated ones within 12 months.

Columnar Storage is Normalization (3 minute read)

This post reframes column stores as simply normalized row stores.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://refer.tldr.tech/9a7c3e77/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to [email protected] and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.