Behind the Scenes: Building a Robust Ads Event Processing Pipeline (8 minute read)
Netflix's Ads Event Processing Pipeline was designed to handle the massive scale of ad serving, tracking, and optimization across its platform. Initially built in partnership with Microsoft, the system evolved with a focus on improving scalability, supporting frequency capping, tracking impressions, and providing accurate billing. Key features include real-time event handling, centralized metadata storage, and extensive reporting and analytics. The platform's evolution continues with plans for advanced capabilities like programmatic buying, event de-duplication, and enhanced conversion tracking.
|
The Evolution of Arbitrary Stateful Stream Processing in Spark (10 minute read)
Apache Spark's Structured Streaming has evolved with the transformWithStateInPandas API, introduced in Spark 4.0, replacing applyInPandasWithState to simplify stateful stream processing with enhanced flexibility. Supporting multiple state types, timers, TTL, and schema evolution, it streamlines complex streaming pipelines by enabling seamless state management, debugging, and integration with Databricks features like Unity Catalog, reducing the need for workarounds like reprocessing state or manual schema handling. The API allows developers to write custom state logic in Python, leveraging Pandas for operational workloads. It has native support for RocksDB state stores, ensuring scalability.
|
Plato AI: Revolutionising Flipkart's Data Interaction with Natural Language (6 minute read)
Flipkart's Plato-AI enables over 200,000 monthly ad-hoc SQL queries to be executed via natural language, abstracting schema complexity and SQL proficiency with a modular, agent-driven architecture powered by LLMs and an engine-agnostic JSON-based DSL. Usersβfrom analysts to business leadersβcan access insights, visualizations, and contextual suggestions in plain English, accelerating data-driven decisions while maintaining security and governance. Key advancements include anomaly detection, personalized memory layers, and smart query autocompletion, dramatically reducing data access friction and democratizing enterprise-scale analytics.
|
|
Why CDAIOs Are Losing SleepβTop 3 Priorities for 2025 (5 minute read)
Data and AI leaders face three critical challenges in 2025: managing unreliable data lakes, handling resource constraints like token limits, and mitigating AI hallucinations. Unreliable data lakes, often described as "swamps," undermine AI reliability, requiring robust data observability to ensure trusted outputs. Resource constraints, such as token limits, demand optimized data pipelines to maximize efficiency, while hallucinations necessitate improved context curation and validation to enhance model accuracy.
|
Practical Data Modeling with dbt: Star vs. Snowflake Compared (10 minute read)
This hands-on review compares star and snowflake schemas in data modeling, focusing on their differences in normalization and performance. Star schemas prioritize query speed and ease for BI tools but require more storage due to redundancy. In contrast, snowflake schemas reduce redundancy and improve data integrity but introduce complexity in queries and performance. The decision between them depends on analytical needs, team workflows, and performance requirements.
|
Example - Building a Traditional Data Model (10 minute read)
Most data modeling tasks occur in brownfield environments, requiring reverse engineering to document and refine existing conceptual, logical, and physical models. Effective modeling hinges on translating stakeholder language into clear, business-aligned entities, attributes, and relationships, highlighting divergent definitions and semantic alignment between business and technical teams. This collaborative approach mitigates risks like poor data quality and performance, supporting analytics and ML pipelines. Systematically layering conceptual, logical, and physical models, platform-agnostic at first, ensures robust documentation, governance, and future-proofed data architectures.
|
|
Why Compilers Matter (w/ Lukas Schulte) (42 minute podcast)
dbt Labs recently acquired SDF Labs, signaling a transformative push toward a unified, warehouse-agnostic SQL compiler, aiming to abstract away warehouse-specific SQL dialect fragmentation. By leveraging concepts familiar from Low Level Virtual Machines and modern language ecosystems, SDF's compiler introduces a shared intermediate representation, enabling portable logic, reusable SQL libraries, and robust package management across platforms like BigQuery, Snowflake, and Redshift. This evolution promises to streamline analytics pipelines, boost developer productivity, and open the door to an open, interoperable data ecosystem.
|
How to Get Ready for the New dbt Engine (6 minute read)
The upcoming dbt engine upgrade, launching on May 28, will deliver significant improvements, including faster development, better cost efficiency, and enhanced SQL comprehension. Users will benefit from features like live error detection, smart autocomplete, and automatic orchestration to reduce warehouse costs. To prepare, teams should upgrade to dbt Core v1.10 and resolve any deprecation warnings.
|
LangGraph Platform is Now Generally Available: Deploy & Manage Long-running, Stateful Agents (4 minute read)
LangGraph Platform is now generally available, providing an infrastructure for deploying and managing long-running, stateful agents at scale. It simplifies agent deployment with 1-click integration, supports burst traffic, and handles asynchronous collaboration and memory persistence. The platform enhances agent development with real-time debugging and workflow visualization while offering centralized management for teams to iterate and scale across use cases.
|
|
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant (2 minute read)
Apple's StreamBridge framework transforms offline Video-LLMs into proactive, real-time streaming assistants, enabling multi-turn video understanding and dynamic interaction without auxiliary models. It processes live video streams, leveraging temporal context to anticipate user needs and deliver context-aware responses. The system achieves low-latency performance on resource-constrained devices, suitable for applications like AR and smart home assistants.
|
QCon London 2025: How to Build a Database Without a Server (3 minute read)
Man Group presented a migration of a hedge fund trading application from a high-maintenance MongoDB cluster to a serverless database using object storage and Conflict-Free Replicated Data Types (CRDTs) for simplified scalability. The serverless approach, defined as a software library paired with object storage, adjusts capacity dynamically based on application demands, addressing issues like clock drift with system or storage clocks. CRDTs enable convergent, commutative updates across replicas without coordination, ensuring consistency.
|
Ask a Data Ethicist: Is Consent the Wrong Approach for Modern Data Regulation? (6 minute read)
Traditional consent-based data governance is increasingly inadequate due to complex power imbalances, opaque data-sharing practices, and harm propagation through inferred data patterns. Individual consent cannot address downstream impacts on non-consenting users when predictive models act on relational or "data like mine" inferences, leading to exclusion or other harms beyond the original data subject. For data leaders, this highlights the need to shift from user-centric control toward institutional, population-level governance frameworks that better mediate data risks and maximize societal benefit.
|
|
Who's the best CDO? (Website)
A fun web-based simulation game that puts players in the role of a Chief Data Officer, challenging them to balance data innovation, compliance, and risk while optimizing profit and reputation.
|
|
Love TLDR? Tell your friends and get rewards!
|
Share your referral link below with friends to get free TLDR swag!
|
|
Track your referrals here.
|
|
|
|