📱

Deep Dives

A Conceptual Model for Storage Unification (16 minute read)

Storage unification is becoming increasingly important as object storage dominates, though hot data still requires low-latency solutions. Architectural choices in the tiering approach, virtualization layer, and access method all involve tradeoffs. Shared tiering can reduce duplication (and system cost) but requires strict coordination, ownership, and lifecycle governance to avoid becoming a liability. Materialization, grounded in decades of proven practice, remains the simpler and more reliable approach.

Datadog's Rust-Based Timeseries Engine: Throttling Under Heavy Load (10 minute read)

Datadog's new timeseries engine, built in Rust, achieves major performance gains versus its previous Go and RocksDB-based implementation (60x faster ingestion and 5x faster queries) using a shard‑per‑core architecture and modular LSM-tree storage. Despite perfect sharding, surge traffic or complex queries can push nodes to their limits. To stay robust, Datadog implemented permit-based throttling, shedding or rejecting workloads when metrics like ingestion lag, memory usage, and concurrent query count cross thresholds.

How Fresha Accidentally Became one of UK's First StarRocks Production Pioneer (11 minute read)

Fresha's data stack relied on Postgres and Snowflake for ad-hoc analytics, but performance bottlenecks and unpredictable dashboard latency under load prompted a shift. It adopted StarRocks, valued for MySQL protocol connectivity, federated querying over open-formats (Iceberg/Paimon), and sub-second real-time analytics via internal tables. Engineering integrated Flink and Kafka pipelines into a hybrid model (real-time, historical, and search lanes), all unified under a single SQL surface. Post-migration, critical dashboards' p95 latency dropped from 20 s to ~200 ms.

🚀

Opinions & Advice

Creating AI Agent Solutions for Warehouse Data Access and Security (12 minute read)

Meta is introducing a multi-agent system to streamline and secure data warehouse access, with user agents helping request data and owner agents managing permissions. The system uses LLMs for context- and task-specific decision-making combined with guardrails like query-level controls, data-access budgets, and rule-based risk checks. This approach reduces friction in access requests while maintaining strong security and auditability.

The Database Has a New User—LLMs—and They Need a Different Database (6 minute read)

Embedding natural language semantic descriptions within PostgreSQL schemas enables self-describing databases, significantly enhancing LLM-driven SQL query generation. TigerData's experiments show up to a 27% boost in SQL accuracy (58% to 86% on certain schemas) using an LLM-generated semantic catalog. Storing and version-controlling these YAML-based semantic annotations tightens context, mitigates misinterpretation, and streamlines agentic data interaction. An implementation is shared in a linked GitHub repository.

Enterpriseland and Productland (6 minute read)

Data teams operate in two distinct models: Enterpriseland, where data is a cost center focused on internal reporting, and Productland, where data is integral to revenue-driving products and user experiences. In Productland, modular, domain-driven data models directly power product features and have clear, measurable ROI, while Enterpriseland struggles to quantify value and justify investment. Understanding where your organization stands clarifies expectations and optimizes team focus and impact as the shift toward "data-as-a-product" and AI adoption is redefining data's role from operational support to strategic value creation.

Knowledge, Metrics, and AI: Rethinking the Semantic Layer with David Jayatillake (41 minute podcast)

Semantic layers are shifting from BI lock-in to dynamic, AI-maintained infrastructure. Semantic layers are critical when teams face inconsistent metrics (e.g., revenue and churn) or slow query responses that AI and semantics could resolve instantly. AI enhances semantic layers by generating new metrics, ensuring governance, and enabling natural language queries, making semantics “invisible” and responsive to executive demands for instant answers.

💻

Launches & Tools

Unifying ecomm and operational data with CData and BigQuery (Sponsor)

Join global ecommerce brand Medik8, Google Cloud, and CData for a live webinar on August 27. In the session, you'll see how a modern data stack powered by BigQuery + CData Sync helped Medik8 unify ecommerce data, scale analytics, and fuel global growth. Sign up for free

Build Reliable AI Agents with the dbt MCP Server (6 minute read)

The dbt Model Context Protocol (MCP) server provides a standardized, open interface that bridges AI agents with governed, structured metadata, lineage, and execution context from dbt projects. It supports metadata discovery, semantic-layer querying, and executing dbt commands such as build, run, compile, and test. This enables automation of workflows from answering business questions via natural language to running and validating dbt migrations.

Apache Paimon: Real-Time Lake Storage with Iceberg Compatibility (20 minute read)

Apache Paimon is a streaming-optimized table format built on a Log‑Structured Merge‑tree (LSM) architecture that seamlessly integrates with Apache Flink for low-latency ingestion and merging. A recent key feature, Iceberg compatibility via deletion vectors, allows real-time data to be queried alongside batch data within Iceberg ecosystems, enabling minute‑level freshness. Production deployments at Alibaba, ByteDance, Vivo, and Shopee witness Paimon's increasing maturity and adoption.

From Rainfall to Rows - The Butterfly Effect in PostgreSQL (8 minute read)

Small changes in a PostgreSQL database, like updating a single row, can trigger significant downstream impacts, such as altering dashboards or influencing decisions. Unlike a static tool, PostgreSQL behaves like a responsive system with features like triggers, notify events, and dynamic query planning, allowing it to intelligently adapt to data changes.

🎁

Miscellaneous

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix (6 minute read)

Netflix is evolving data engineering into Media ML Data Engineering, creating a Media Data Lake to standardize and serve multi-modal media data for ML. This enables faster experimentation, richer insights, and tighter integration between creative workflows and machine learning.

Securing private data at scale with differentially private partition selection (6 minute read)

Differentially private (DP) partition selection is the task of safely selecting frequently appearing items across users while preserving privacy by adding noise and thresholding. Google introduced a new algorithm family, MAD, with a two-round variant, MAD2R, that adaptively redistributes “weight” from overly common items to less frequent ones, boosting utility. It achieves state-of-the-art performance across diverse datasets, including the massive Common Crawl corpus (~800 billion entries), covering 99.9% of partitions and 97% of database records, all while maintaining rigorous DP guarantees.

⚡

Quick Links

Tecton is Joining Databricks to Power Real-Time Data for Personalized AI Agents (3 minute read)

Tecton is joining Databricks to enhance real-time data capabilities for AI agents, integrating Tecton's leading enterprise feature store with Databricks' Agent Bricks.

Debunking myths about Airflow's architecture and performance (7 minute read)

Concerns that Airflow has an unreliable scheduler, is hard to scale, cannot process data in tasks, or lacks versioning are outdated.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://refer.tldr.tech/9a7c3e77/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to [email protected] and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

TLDR Data 2025-08-25

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links