TLDR Data 2025-05-15

☠️ OpenAI Admits You.com's Deep Research is More Accurate Than Its Own (Sponsor)

According to benchmarks, ARI beats OpenAI's Deep Research with a 76% Win Rate.

Put it to the test! You.com is giving TLDR Data readers their Team plan ($375 value), including:

📊 5 ARI reports – deep research agent that analyzes 400+ sources across private, public, and premium data to produce fully-customizable, McKinsey-style reports in < 5 mins

🧠 Access to 31+ latest AI models in one place – includes ChatGPT, Claude, Gemini, Llama, Grok, DeepSeek, and more

🤖 Data AI Agents – automatically synthesize insights, generate executive summaries, stimulate stakeholder responses, track updates across news, etc.

↗️ Try ARI for FREE. Offer expires 5/31.↗️

📱

Deep Dives

Behind the Scenes: Building a Robust Ads Event Processing Pipeline (8 minute read)

Netflix's Ads Event Processing Pipeline was designed to handle the massive scale of ad serving, tracking, and optimization across its platform. Initially built in partnership with Microsoft, the system evolved with a focus on improving scalability, supporting frequency capping, tracking impressions, and providing accurate billing. Key features include real-time event handling, centralized metadata storage, and extensive reporting and analytics. The platform's evolution continues with plans for advanced capabilities like programmatic buying, event de-duplication, and enhanced conversion tracking.

The Evolution of Arbitrary Stateful Stream Processing in Spark (10 minute read)

Apache Spark's Structured Streaming has evolved with the transformWithStateInPandas API, introduced in Spark 4.0, replacing applyInPandasWithState to simplify stateful stream processing with enhanced flexibility. Supporting multiple state types, timers, TTL, and schema evolution, it streamlines complex streaming pipelines by enabling seamless state management, debugging, and integration with Databricks features like Unity Catalog, reducing the need for workarounds like reprocessing state or manual schema handling. The API allows developers to write custom state logic in Python, leveraging Pandas for operational workloads. It has native support for RocksDB state stores, ensuring scalability.

Plato AI: Revolutionising Flipkart's Data Interaction with Natural Language (6 minute read)

Flipkart's Plato-AI enables over 200,000 monthly ad-hoc SQL queries to be executed via natural language, abstracting schema complexity and SQL proficiency with a modular, agent-driven architecture powered by LLMs and an engine-agnostic JSON-based DSL. Users—from analysts to business leaders—can access insights, visualizations, and contextual suggestions in plain English, accelerating data-driven decisions while maintaining security and governance. Key advancements include anomaly detection, personalized memory layers, and smart query autocompletion, dramatically reducing data access friction and democratizing enterprise-scale analytics.

🚀

Opinions & Advice

Why CDAIOs Are Losing Sleep—Top 3 Priorities for 2025 (5 minute read)

Data and AI leaders face three critical challenges in 2025: managing unreliable data lakes, handling resource constraints like token limits, and mitigating AI hallucinations. Unreliable data lakes, often described as "swamps," undermine AI reliability, requiring robust data observability to ensure trusted outputs. Resource constraints, such as token limits, demand optimized data pipelines to maximize efficiency, while hallucinations necessitate improved context curation and validation to enhance model accuracy.

Practical Data Modeling with dbt: Star vs. Snowflake Compared (10 minute read)

This hands-on review compares star and snowflake schemas in data modeling, focusing on their differences in normalization and performance. Star schemas prioritize query speed and ease for BI tools but require more storage due to redundancy. In contrast, snowflake schemas reduce redundancy and improve data integrity but introduce complexity in queries and performance. The decision between them depends on analytical needs, team workflows, and performance requirements.

Example - Building a Traditional Data Model (10 minute read)

Most data modeling tasks occur in brownfield environments, requiring reverse engineering to document and refine existing conceptual, logical, and physical models. Effective modeling hinges on translating stakeholder language into clear, business-aligned entities, attributes, and relationships, highlighting divergent definitions and semantic alignment between business and technical teams. This collaborative approach mitigates risks like poor data quality and performance, supporting analytics and ML pipelines. Systematically layering conceptual, logical, and physical models, platform-agnostic at first, ensures robust documentation, governance, and future-proofed data architectures.

💻

Launches & Tools

Why Compilers Matter (w/ Lukas Schulte) (42 minute podcast)

dbt Labs recently acquired SDF Labs, signaling a transformative push toward a unified, warehouse-agnostic SQL compiler, aiming to abstract away warehouse-specific SQL dialect fragmentation. By leveraging concepts familiar from Low Level Virtual Machines and modern language ecosystems, SDF's compiler introduces a shared intermediate representation, enabling portable logic, reusable SQL libraries, and robust package management across platforms like BigQuery, Snowflake, and Redshift. This evolution promises to streamline analytics pipelines, boost developer productivity, and open the door to an open, interoperable data ecosystem.

How to Get Ready for the New dbt Engine (6 minute read)

The upcoming dbt engine upgrade, launching on May 28, will deliver significant improvements, including faster development, better cost efficiency, and enhanced SQL comprehension. Users will benefit from features like live error detection, smart autocomplete, and automatic orchestration to reduce warehouse costs. To prepare, teams should upgrade to dbt Core v1.10 and resolve any deprecation warnings.

LangGraph Platform is Now Generally Available: Deploy & Manage Long-running, Stateful Agents (4 minute read)

LangGraph Platform is now generally available, providing an infrastructure for deploying and managing long-running, stateful agents at scale. It simplifies agent deployment with 1-click integration, supports burst traffic, and handles asynchronous collaboration and memory persistence. The platform enhances agent development with real-time debugging and workflow visualization while offering centralized management for teams to iterate and scale across use cases.

🎁

Miscellaneous

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant (2 minute read)

Apple's StreamBridge framework transforms offline Video-LLMs into proactive, real-time streaming assistants, enabling multi-turn video understanding and dynamic interaction without auxiliary models. It processes live video streams, leveraging temporal context to anticipate user needs and deliver context-aware responses. The system achieves low-latency performance on resource-constrained devices, suitable for applications like AR and smart home assistants.

QCon London 2025: How to Build a Database Without a Server (3 minute read)

Man Group presented a migration of a hedge fund trading application from a high-maintenance MongoDB cluster to a serverless database using object storage and Conflict-Free Replicated Data Types (CRDTs) for simplified scalability. The serverless approach, defined as a software library paired with object storage, adjusts capacity dynamically based on application demands, addressing issues like clock drift with system or storage clocks. CRDTs enable convergent, commutative updates across replicas without coordination, ensuring consistency.

Ask a Data Ethicist: Is Consent the Wrong Approach for Modern Data Regulation? (6 minute read)

Traditional consent-based data governance is increasingly inadequate due to complex power imbalances, opaque data-sharing practices, and harm propagation through inferred data patterns. Individual consent cannot address downstream impacts on non-consenting users when predictive models act on relational or "data like mine" inferences, leading to exclusion or other harms beyond the original data subject. For data leaders, this highlights the need to shift from user-centric control toward institutional, population-level governance frameworks that better mediate data risks and maximize societal benefit.

⚡

Quick Links

Who's the best CDO? (Website)

A fun web-based simulation game that puts players in the role of a Chief Data Officer, challenging them to balance data innovation, compliance, and risk while optimizing profit and reputation.

MCP Toolbox for Databases (GitHub Repo)

The MCP Toolbox for Databases simplifies the process of building generative AI tools that interact with databases via a unified server interface.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://refer.tldr.tech/9a7c3e77/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to [email protected] and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.