At a small scale, rare failures feel negligible, but at billions of executions, even β€œone in a billion” becomes a daily outage β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ  β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ 

TLDR

TLDR Data 2025-08-28

πŸ“±

Deep Dives

Rebuilding Event Infrastructure at Scale (5 minute read)

Klaviyo migrated its event pipeline from RabbitMQ to Kafka, enabling the platform to process 170,000 events per second at peak and reliably manage about 100 billion events monthly with zero data loss and under 5-second real-time SLAs. It reduced infrastructure costs by 30% while enhancing operational reliability and scalability by decoupling ingestion from processing, eliminating head-of-line blocking, and transitioning to unified processing lanes on AWS MSK.
How Wix Slashed Spark Costs by 50% and Migrated 5,000+ Daily Workflows from EMR to EMR on EKS (6 minute read)

Wix migrated 5,000+ daily Spark workflows from AWS EMR to EMR on EKS, achieving about 60% cost reductions on shared clusters and 35-50% on dedicated ones. Startup times dramatically improved as nodes now spin up in 2-3 minutes, with pods launching in seconds. It also streamlined Spark version management via isolated environments, optimized spot instance use across AZs, replaced Livy with direct EMR Containers API calls, and leveraged YuniKorn for sophisticated Kubernetes-native scheduling.
Comment ranker – An ML-based classifier to improve LLM code review quality using Atlassian's proprietary data (7 minute read)

The "comment ranker," a fine-tuned ModernBERT model, improves LLM-generated code review comments for the Rovo Dev agent, reducing pull request cycle time by 30% and achieving a 40-45% code resolution rate (close to the human benchmark of 45%) by filtering low-quality comments using 53K+ internal dogfooding comments as ground truth. It is limited by its dependence on raw comment text, frequent retraining needs due to data drift, computationally heavy fine-tuning, and reliance on A/B testing for threshold optimization.
πŸš€

Opinions & Advice

Why 'Big' and 'Large' matter (5 minute read)

At a small scale, rare failures feel negligible, but at billions of executions, even β€œone in a billion” becomes a daily outage. Accuracy and resilience requirements rise sharply with scale. Processes that once worked will break at new magnitudes. Scaling is not linear, so engineers must rethink tradeoffs, error tolerance, and efficiency at each order of growth.
The Medallion Architecture Farce (3 minute read)

The Medallion Architecture's 'Bronze, Silver, and Gold layers' is an inflexible, oversimplified model that prioritizes Databricks' marketing over practical data engineering. It fails to accommodate diverse use cases like real-time streaming or machine learning. The model's unclear distinction between Silver and Gold layers confuses users, resulting in inefficiencies.
The 8 Principles of Great DX for Data & Analytics Infrastructure (17 minute read)

MooseStack enhances ClickHouse by providing a modern DX tailored for both data and software engineers inspired by web development frameworks like Ruby on Rails and Next.js. It introduces eight principles: git-based version control, local-first development, native programming languages (TypeScript/Python over YAML), infrastructure boilerplate abstractions, horizontal integration with modularization, open-source native design, AI copilot compatibility, and transparent migrations with CI/CD integration.
From Academia to Industry: Bridging Data Engineering Challenges (50 minute podcast)

In this podcast, Paul Groth, professor at the University of Amsterdam, discusses the intersection of AI and data engineering, from lineage vs. provenance to challenges in semantics, access control, and knowledge graph adoption. He highlights how LLMs ease knowledge graph construction, enable multimodal queries, and even act as databases, reshaping architectures around GPUs and edge devices. The biggest gaps remain messy real-world data, fragmented stacks, and the difficulty of choosing the right technologies for evolving data needs.
πŸ’»

Launches & Tools

FilterQL (Github Repo)

FilterQL is a lightweight query language designed for efficiently filtering structured data, making it particularly useful for data engineers who need to handle and manipulate datasets and streamline data retrieval and analysis workflows. Key features include the ability to search across various data types such as code, repositories, users, issues, and pull requests.
Base (Tool)

Base is a lightweight yet powerful SQLite editor for macOS. It provides an intuitive schema inspector, a visual table editor, and a simple data browser to manage structures and contents without heavy SQL work. With query tools, autocomplete, and import/export support, it makes database design and analysis efficient for both beginners and advanced users.
Polars GPU Execution (70% speed up) (7 minute read)

Polars now integrates GPU acceleration within its Lazy API, yielding a dramatic performance boost for large dataset aggregations (from 40s to 12s for a 40GB CSV aggregation on an M4 MacBook). While setup friction remains (schema alignment, Python/CUDA dependencies, and lack of Mac support), data engineers can now leverage significant runtime reductions with minimal code changes (engine="gpu"). This positions Polars as a compelling, cost-efficient alternative for compute-intensive ETL, particularly when using ephemeral GPU infrastructure.
🎁

Miscellaneous

Introducing the Data Act: the Act-cess right (10 minute read)

The EU Data Act, which will be fully enforced from September 12, imposes sweeping, sector-wide requirements on data access, portability, transparency, and contractual obligations for manufacturers, data holders, and users of connected devices. Unlike the narrower AI Act, the Data Act demands complex integration with GDPR and other legal frameworks, mandating machine-readable data provision, pre-contractual disclosures, and multiparty compliance on both personal and non-personal data, with strict delineation of roles (data holder, user, and controller).
Building a CBIR Benchmark with TotalSegmentator and FAISS (5 minute read)

This study established a metadata-independent, large-scale content-based image retrieval (CBIR) benchmark leveraging 290,757 medical image embeddings from the TotalSegmentator CT dataset. High-speed retrieval was achieved using HNSW indexing via FAISS, demonstrating superior performance over LSH while extracting features from 2D slices with ViT, SwinTransformer, and ResNet50 models.
⚑

Quick Links

Widespread Data Theft Targets Salesforce Instances via Salesloft Drift (4 minute read)

UNC6395 orchestrated a widespread data exfiltration campaign from August 8 to 18 that targeted Salesforce instances via compromised OAuth tokens from the Salesloft Drift app, harvesting sensitive credentials, including AWS keys and Snowflake tokens.
AI Contrarians on the Problems With Vibe Coding (6 minute read)

AI-assisted "vibe coding" is leading to increased developer burnout, unpredictable outcomes, and code quality issues.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? πŸ“°

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? πŸ’Ό

Apply here or send a friend's resume to [email protected] and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.