A Conceptual Model for Storage Unification (16 minute read)
Storage unification is becoming increasingly important as object storage dominates, though hot data still requires low-latency solutions. Architectural choices in the tiering approach, virtualization layer, and access method all involve tradeoffs. Shared tiering can reduce duplication (and system cost) but requires strict coordination, ownership, and lifecycle governance to avoid becoming a liability. Materialization, grounded in decades of proven practice, remains the simpler and more reliable approach.
|
Datadog's Rust-Based Timeseries Engine: Throttling Under Heavy Load (10 minute read)
Datadog's new timeseries engine, built in Rust, achieves major performance gains versus its previous Go and RocksDB-based implementation (60x faster ingestion and 5x faster queries) using a shardโperโcore architecture and modular LSM-tree storage. Despite perfect sharding, surge traffic or complex queries can push nodes to their limits. To stay robust, Datadog implemented permit-based throttling, shedding or rejecting workloads when metrics like ingestion lag, memory usage, and concurrent query count cross thresholds.
|
How Fresha Accidentally Became one of UK's First StarRocks Production Pioneer (11 minute read)
Fresha's data stack relied on Postgres and Snowflake for ad-hoc analytics, but performance bottlenecks and unpredictable dashboard latency under load prompted a shift. It adopted StarRocks, valued for MySQL protocol connectivity, federated querying over open-formats (Iceberg/Paimon), and sub-second real-time analytics via internal tables. Engineering integrated Flink and Kafka pipelines into a hybrid model (real-time, historical, and search lanes), all unified under a single SQL surface. Post-migration, critical dashboards' p95 latency dropped from 20 s to ~200 ms.
|
|
Creating AI Agent Solutions for Warehouse Data Access and Security (12 minute read)
Meta is introducing a multi-agent system to streamline and secure data warehouse access, with user agents helping request data and owner agents managing permissions. The system uses LLMs for context- and task-specific decision-making combined with guardrails like query-level controls, data-access budgets, and rule-based risk checks. This approach reduces friction in access requests while maintaining strong security and auditability.
|
The Database Has a New UserโLLMsโand They Need a Different Database (6 minute read)
Embedding natural language semantic descriptions within PostgreSQL schemas enables self-describing databases, significantly enhancing LLM-driven SQL query generation. TigerData's experiments show up to a 27% boost in SQL accuracy (58% to 86% on certain schemas) using an LLM-generated semantic catalog. Storing and version-controlling these YAML-based semantic annotations tightens context, mitigates misinterpretation, and streamlines agentic data interaction. An implementation is shared in a linked GitHub repository.
|
Enterpriseland and Productland (6 minute read)
Data teams operate in two distinct models: Enterpriseland, where data is a cost center focused on internal reporting, and Productland, where data is integral to revenue-driving products and user experiences. In Productland, modular, domain-driven data models directly power product features and have clear, measurable ROI, while Enterpriseland struggles to quantify value and justify investment. Understanding where your organization stands clarifies expectations and optimizes team focus and impact as the shift toward "data-as-a-product" and AI adoption is redefining data's role from operational support to strategic value creation.
|
Knowledge, Metrics, and AI: Rethinking the Semantic Layer with David Jayatillake (41 minute podcast)
Semantic layers are shifting from BI lock-in to dynamic, AI-maintained infrastructure. Semantic layers are critical when teams face inconsistent metrics (e.g., revenue and churn) or slow query responses that AI and semantics could resolve instantly. AI enhances semantic layers by generating new metrics, ensuring governance, and enabling natural language queries, making semantics โinvisibleโ and responsive to executive demands for instant answers.
|
|
Build Reliable AI Agents with the dbt MCP Server (6 minute read)
The dbt Model Context Protocol (MCP) server provides a standardized, open interface that bridges AI agents with governed, structured metadata, lineage, and execution context from dbt projects. It supports metadata discovery, semantic-layer querying, and executing dbt commands such as build, run, compile, and test. This enables automation of workflows from answering business questions via natural language to running and validating dbt migrations.
|
Apache Paimon: Real-Time Lake Storage with Iceberg Compatibility (20 minute read)
Apache Paimon is a streaming-optimized table format built on a LogโStructured Mergeโtree (LSM) architecture that seamlessly integrates with Apache Flink for low-latency ingestion and merging. A recent key feature, Iceberg compatibility via deletion vectors, allows real-time data to be queried alongside batch data within Iceberg ecosystems, enabling minuteโlevel freshness. Production deployments at Alibaba, ByteDance, Vivo, and Shopee witness Paimon's increasing maturity and adoption.
|
From Rainfall to Rows - The Butterfly Effect in PostgreSQL (8 minute read)
Small changes in a PostgreSQL database, like updating a single row, can trigger significant downstream impacts, such as altering dashboards or influencing decisions. Unlike a static tool, PostgreSQL behaves like a responsive system with features like triggers, notify events, and dynamic query planning, allowing it to intelligently adapt to data changes.
|
|
Securing private data at scale with differentially private partition selection (6 minute read)
Differentially private (DP) partition selection is the task of safely selecting frequently appearing items across users while preserving privacy by adding noise and thresholding. Google introduced a new algorithm family, MAD, with a two-round variant, MAD2R, that adaptively redistributes โweightโ from overly common items to less frequent ones, boosting utility. It achieves state-of-the-art performance across diverse datasets, including the massive Common Crawl corpus (~800 billion entries), covering 99.9% of partitions and 97% of database records, all while maintaining rigorous DP guarantees.
|
|
Love TLDR? Tell your friends and get rewards!
|
Share your referral link below with friends to get free TLDR swag!
|
|
Track your referrals here.
|
|
|
|