Netflix scaled its generative recommendation models from 1M to 1B parameters, processing 2 trillion tokens and handling catalogs up to 40x GPT-3’s β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ  β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ 

TLDR

TLDR Data 2026-01-05

πŸ“±

Deep Dives

Towards Generalizable and Efficient Large-Scale Generative Recommenders (10 minute read)

Netflix scaled its generative recommendation models from 1M to 1B parameters, processing 2 trillion tokens and handling catalogs up to 40x larger than GPT-3's. Efficiency breakthroughs included sampled softmax, projected heads, and multimodal semantic towers, enabling effective cold-start adaptation and robust handling of real-time and high-latency serving. Introducing a multi-token prediction objective addressed permutation invariance and latency misalignment.
We Built Our Employees a Wrapped: Using SQL and MotherDuck (6 minute read)

MotherDuck built an internal β€œWrapped” for employees using simple SQL on existing query and usage data. It generated rankings and playful personas in about an hour by filtering out service accounts and aggregating stats, like queries run, streaks, and databases created. The takeaway is that engaging, shareable analytics often come from straightforward queries and good data hygiene, not complex tooling.
Drift Detection in Robust Machine Learning Systems (12 minute read)

Effective machine learning system longevity hinges on continuous drift detection, specifically both data drift (feature distribution shift) and concept drift (label-feature relationship shift). Robust monitoring employs univariate metrics (Kolmogorov-Smirnov, PSI, and chi-squared) and multivariate approaches like autoencoder-based reconstruction error, accommodating situations with delayed or unavailable ground truth. Automated detection, clear fallback strategies, and timely model retraining safeguard model reliability, preventing revenue loss and reputational or legal risks.
The 2026 Data Engineering Roadmap: Building Data Systems for the Agentic AI Era (16 minute read)

Data engineering in 2026 must shift from traditional ETL pipelines to building "context systems" that provide rich semantic metadata, knowledge graphs, provenance, and high-quality embeddings to support autonomous AI agents. Key skills include mastering vector databases, active metadata management, agent-friendly APIs, advanced data quality for AI, governance for bias and ethics, and storage optimization.
πŸš€

Opinions & Advice

CDC Strategies in Apache Iceberg (8 minute read)

CDC into Iceberg is a set of trade-offs rather than a single best pattern. Writing changes directly into tables is simpler but limits control, while keeping a raw change log adds complexity in exchange for flexibility, replay, and safer recovery. At scale, constant updates make merge-on-read, careful partitioning, and regular compaction essential for stable performance.
The Hidden Cost Crisis in Data Engineering (6 minute read)

Data engineering faces a hidden cost crisis as cloud expenses explode from inefficient pipelines, poorly optimized queries, duplicated storage, and over-provisioning. Without proactive governance like query optimization, data pruning, resource rightsizing, and audit practices, companies risk severe budgets as they pursue the latest tools.
Have You Tried a Text Box? (7 minute read)

Many β€œAI-for-enterprise” schemes overengineer structure too early. A simple baseline often works: store messy, natural-language explanations (β€œwhy we did this”) and let LLMs classify/summarize later. As an example, OpenAI's internal ChatGPT-usage economics study classifies ~1.1M conversations by prompting an LLM with strict taxonomies, then validating against human labels.
πŸ’»

Launches & Tools

tablediff (GitHub Repo)

tablediff is a lightweight CLI that compares two database tables by primary key to find missing, extra, or mismatched rows. It works across engines via reladiff adapters, with tested support for DuckDB and Snowflake, and is designed for quick, ad-hoc validation rather than heavy data quality frameworks.
Bun Introduces Built-in Database Clients and Zero-Config Frontend Development (3 minute read)

Bun, a JavaScript runtime, has added a unified SQL API for MySQL, MariaDB, PostgreSQL, and SQLite with zero dependencies and introduced a built-in Redis client along with zero-config hot module frontend development. Enhanced performance yields 10–30% lower memory usage on major frameworks, 60% faster build times on macOS, and 9% speed gains for Express.
SAFE-MCP, a Community-Built Framework for AI Agent Security (5 minute read)

SAFE-MCP, now formally adopted by the Linux Foundation and OpenID Foundation, delivers a standardized, community-driven security framework for AI agent ecosystems using Model Context Protocol (MCP). Offering over 80 documented techniques and more than a dozen tactic categories, it provides actionable, MITRE ATT&CK-style guidance for threat detection and mitigation (e.g. prompt manipulation, tool poisoning, and OAuth abuse). This enables auditable, collaborative defense strategies for securing MCP-powered AI systems.
🎁

Miscellaneous

Cloudflare Year in Review: AI Bots Crawl Aggressively, Post-Quantum Encryption Hits 50%, Go Doubles (47 minute read)

Cloudflare's 2025 lookback flags a more bot-heavy, disruption-prone Internet: global traffic grew 19%, Starlink traffic doubled, and post-quantum encryption hit ~52% of human web traffic. AI crawling surged with β€œuser action” crawling up 15x. Non-Google AI bots averaged 4.2% of HTML requests, while Googlebot alone was 4.5% (and >25% of Verified Bot traffic). Cloudflare tracked 174 major outages - about half were tied to government shutdowns.
2025: The Year in LLMs (10 minute read)

LLMs saw rapid advancements last year in reasoning capabilities, agentic systems (especially coding agents), and multimodal features like prompt-driven image editing. Chinese labs dominated open-weight models. Breakthroughs enabled models to win gold at the IMO and handle multi-hour tasks. Despite OpenAI and Anthropic's strong releases, progress raised significant concerns around security risks like prompt injection, environmental impacts from data centers, and the proliferation of low-quality AI-generated content.
⚑

Quick Links

The Most Dangerous Shortcuts in Software (29 minute podcast)

Pressure to ship fast pushes teams to skip testing, reviews, and security, creating technical debt and risk.
Designing API Contracts for Legacy System Modernization (5 minute read)

Practical versioning and contract-testing patterns avert silent failures during incremental legacy migrations.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? πŸ“°

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? πŸ’Ό

Apply here, create your own role or send a friend's resume to [email protected] and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.