📱

Deep Dives

How Airtable Saved Millions by Cutting Archive Storage Costs by 100x (11 minute read)

Airtable cut archive storage costs by about 100x by moving cold, mostly immutable MySQL data into S3 as partitioned Parquet files and querying it with embedded Apache DataFusion. The dataset became 10x smaller, while S3 was about 10x cheaper per byte. A Flink-based migration, bulk and shadow validation, tiered caching, custom secondary indexes, and Parquet bloom filters preserved interactive latency and enterprise guarantees.

Internal vs. External Storage: What's the Limit of External Tables? (26 minute read)

Internal tables store and manage both data and metadata within the database system, while external tables only store metadata and reference data that lives outside the system, leaving the underlying data untouched. Internal tables enable tighter lifecycle management, whereas external tables decouple storage and compute, making it easier to scale, share, and access large datasets without moving or duplicating data.

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (5 minute read)

Spotify's coding agent Honk automated a complex migration of ~1,800 data pipelines by using tooling (Backstage + Fleet Management) to find dependencies, generate code changes, and manage rollout, saving 10 weeks of engineering work. This worked thanks to systems being standardized and well-instrumented, and testing to reliably make and validate changes at scale.

🚀

Opinions & Advice

Measure Less to Learn More: Using Fewer, Higher-quality Metrics to Capture What Matters (8 minute read)

Discord improved experimentation by removing redundant metrics, grouping related ones, and focusing on a small set of clearly defined “north-star” and guardrail metrics. Adding too many metrics to experiments increases multiple-testing issues and metric correlation, which can require stricter statistical corrections and make real effects harder to detect.

Databases Were Not Designed For This (16 minute read)

Databases were built for predictable apps and human-written queries, not AI agents that generate queries on the fly, retry automatically, and can make silent mistakes at scale. Teams now need stronger guardrails like tighter permissions, timeouts, audit logs, idempotent writes, and clearer schemas so databases stay safe when AI becomes the caller.

When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World (15 minute read)

Cloud high availability can no longer assume regions are safe, independent failure domains: sanctions, data localization laws, conflict zones, and submarine cable cuts can take out an entire region or make it noncompliant. Treat region-level disruption as a first-class risk, with multi-region, jurisdiction-aware data placement, control-plane separation, and dependency audits. The added cost and complexity should be justified with Annual Loss Expectancy modeling rather than assumed.

Stop Letting Tools Lead Your Platform Decisions (3 minute read)

Data platform decisions should start with use cases, constraints, and operating requirements, not with Kafka, Spark, Snowflake, or Airflow. The key questions are latency, data freshness, cost, failure handling, and who will consume the system. Choose the simplest stack that fits the problem, team, budget, and timelines.

💻

Launches & Tools

DuckDB Extension - Whisper (Tool)

Whisper is a DuckDB extension that lets you transcribe audio into text directly with SQL, making voice data easier to search, analyze, and use alongside your normal tables.

Jaeger adopts OpenTelemetry at its core to solve the AI agent observability gap (4 minute read)

Jaeger v2 rebuilds its core on the OpenTelemetry Collector, natively ingesting OTLP and unifying metrics, logs, and traces in one deployment model to improve ingestion and eliminate translation steps. It's also adding agent-facing interfaces like MCP, ACP, and AG-UI so engineers can use natural language to translate incident context into deterministic trace queries and collaborate with AI agents.

tda-mapper (GitHub Repo)

tda-mapper is a Python library that helps find hidden shapes, clusters, and patterns in messy data using the Mapper algorithm from topological data analysis. It's built to scale to large datasets, works with scikit-learn pipelines, and includes visual tools to explore complex data more clearly.

🎁

Miscellaneous

Measurement Engineering: The Part of Data Science That Will Thrive in AI (13 minute read)

As AI takes over more coding, SQL, and dashboard work, the most valuable data skill may become judgment: knowing what to measure, whether metrics are trustworthy, and how to make decisions when results are unclear. Future top performers won't just build analysis, they'll own the harder question of whether the analysis actually reflects reality.

Fixing What LLMs Get Wrong (22 minute read)

Enterprise LLM systems can produce fluent but factually wrong answers against private structured knowledge, creating a “hallucination tax” on pricing, policy, org, and legal data. Fine-tuning, RAG, and static verification each help, but none learn from repeated failures. Reflexion closes the loop by storing natural-language reflections from verified errors in episodic memory and reinjecting them into future prompts.

⚡

Quick Links

HDFS Lost. How Object Storage and Table Formats Won the Data Lake (3 minute read)

Data systems evolved to decouple storage and compute, making it cheaper and easier to scale.

Airflow 2 reaches end of life (2 minute read)

Security patches and provider updates stopped last week.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://refer.tldr.tech/9a7c3e77/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to [email protected] and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

TLDR Data 2026-04-27

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links