Discord’s search infrastructure indexes trillions of messages using a multi-cluster Elasticsearch setup with logical β€œcells” β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ  β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ 

TLDR

TLDR Data 2025-04-28

πŸ“±

Deep Dives

Building Dash: How RAG and AI agents help us meet the needs of businesses (7 minute read)

Dropbox Dash leverages retrieval-augmented generation (RAG) for efficient information retrieval and multi-step AI agents to handle complex, multi-faceted business tasks, such as tracking project progress across OKRs. RAG retrieves relevant content from diverse enterprise data sources, while AI agents break down tasks into steps, using domain knowledge and contextual planning for accurate execution. Continuous fine-tuning of LLMs ensures business-specific relevance, and platform-agnostic integration with apps like Google Drive and Salesforce enhances accessibility.
How Discord Indexes Trillions of Messages (6 minute read)

Discord's search infrastructure indexes trillions of messages using a multi-cluster Elasticsearch setup with logical β€œcells” of dedicated ingest, master-eligible, and data nodes, optimized for performance. Messages are processed via a PubSub queue for reliable delivery, with Rust-based workers batching them to specific clusters. For Big Freaking Guilds (BFGs) exceeding Lucene's 2-billion-message limit, dedicated Elasticsearch cells employ a fanning approach, using multiple primary shards to distribute indexing load while typically favoring single primary shards per index to co-locate messages on one node, eliminating query-time fanout and coordination overhead.
πŸš€

Opinions & Advice

The 2025 AI-enabled Data Engineering roadmap (6 minute read)

AI automates data engineering tasks like Spark/SQL coding (medium risk) and pipeline fixes (high risk) using tools like Cursor and Windsurf. Windsurf streamlines complex tasks by following a structured prompt, including inputs, orchestration/processing technology, design pattern, and any quality concerns and best practices. To provide the right prompts, engineers have to master design patterns and best practices such as idempotency. Strategic tasks like framework building and conceptual modeling face low disruption due to AI's limitations in innovation and consensus. Soft skills and strategic responsibilities, like creating trusted processes, remain critical and less AI-vulnerable.
Why You're Stuck at Senior Data Engineer (And How to Break Out) (8 minute read)

Breaking past the "senior plateau" for technical professionals involves more than enhancing technical skills. It requires strategic shifts in understanding business impacts and taking ownership of higher-risk decisions. Senior engineers often stall as they focus on execution over influence and fail to gain visibility in strategic initiatives. To advance, data professionals should develop business acumen, engage in cross-functional collaborations, and take initiative on ambiguous projects. This transition requires prioritization, clear communication, and strategic risk-taking, aligning technical expertise with the business's goals for broader impact and leadership roles.
The Top 5 AI Reliability Pitfalls (4 minute read)

AI reliability issues, often mistaken for hallucinations, stem from five key pitfalls: poor source data quality causing garbage outputs, embedding drift degrading retrieval relevance, confused context from ambiguous or incomplete data, output sensitivity to minor prompt or model changes, and imbalanced human-in-the-loop evaluation. Poor data quality, like outdated knowledge bases, leads to inaccurate AI responses, while embedding drift from changing data or models silently reduces performance. Ambiguous or partial context confuses models, small configuration tweaks cause unpredictable outputs, and over- or under-reliance on human evaluators hampers scalability and accuracy.
AI and Data in Production: Insights from Avinash Narasimha [AI Solutions Leader at Koch Industries] (37 minute podcast)

Generative AI has matured beyond pilots, with Koch Industries deploying it in production for over two years, using rigorous experimentation, scalable environments, and continuous feedback loops to ensure reliability. Data readiness, encompassing high-quality data, robust infrastructure, and seamless integration, is critical to AI success but is often hindered by fragmented ecosystems and legacy systems. Strategies like strong data governance, cloud-based tools, and iterative refinement address these challenges, enhancing AI performance.
πŸ’»

Launches & Tools

ClickHouse gets lazier (and faster): Introducing lazy materialization (24 minute read)

ClickHouse's lazy materialization enhances its I/O optimization stack by deferring column reads until required by the query execution plan, powerful for Top N queries on large datasets and LIMIT (common in observability and general analytics). Building on columnar storage, sparse indexes, data-skipping indexes, projections, and PREWHERE, it ensures only necessary columns are loaded for operations like sorting, minimizing data access. Unlike row-oriented databases, ClickHouse's architecture makes such deferred I/O feasible, complementing existing granule-level pruning techniques.
Clustering vs Partitions - Pick your poison (6 minute read)

Traditional Hive-style partitioning, while simple and effective for stable query patterns, is rigid, prone to data skew, and requires costly data rewrites. Features like Z-ordering and OPTIMIZE in Delta Lake improve performance by colocating related data, addressing the limitations of static partitioning. Apache Iceberg's hidden partitioning is defined at the metadata level, offering flexibility without exposing partitions. Delta Lake's liquid clustering dynamically organizes data using clustering keys, enabling continuous optimization without rigid partitions. These modern approaches provide more flexible, automated data layouts to meet the demands of complex, high-performance Lakehouse workloads.
Deep Dive: How Row-level Concurrency Works Out of the Box (4 minute read)

Databricks' row-level concurrency, enabled by default in Databricks Runtime 14.2+ with deletion vectors or Liquid Clustering, automatically resolves write conflicts at the row level for operations like MERGE, UPDATE, and DELETE, eliminating the need for complex data layouts or retry logic. Liquid Clustering avoids small file issues and provides granular concurrency guarantees, while deletion vectors mark rows as deleted without rewriting files, reconciling concurrent modifications efficiently during commits.
🎁

Miscellaneous

Abusing DuckDB-Wasm to Create Doom in SQL (2 minute read)

A developer has recreated a simplified version of Doom using SQL and DuckDB-WASM. In this implementation, the entire game stateβ€”including world geometry, player position, and eventsβ€”is managed within the database. Rendering is achieved through SQL views that perform raycasting, while game actions like movement and shooting are executed via SQL statements, with events such as collisions resulting in DELETE operations. JavaScript's role is minimal, limited to handling keyboard input and assembling SQL queries. A link to the GitHub repository is available in the article.
Government Funding Graph RAG (19 minute read)

Graph Retrieval-Augmented Generation (Graph-RAG) can be used to analyze government funding data. By integrating structured graph databases with large language models (LLMs), Graph-RAG enhances the retrieval of relevant information, enabling more accurate and context-aware responses. This approach allows for efficient querying of complex funding relationships and dependencies, providing deeper insights into governmental financial allocations. This use case highlights the potential of combining graph structures with LLMs to improve data analysis and decision-making processes in public sector funding.
Distributed Cloud Computing: Enhancing Privacy with AI-Driven Solutions (8 minute read)

Distributed cloud computing, enhanced by Privacy Enhanced Technologies (PETs) and AI, provides a secure and scalable framework for data processing without compromising individual privacy. PETs like homomorphic encryption and secure multi-party computation enable collaborative data analysis while maintaining strict privacy. Cloud Service Providers typically provide some key capabilities with tools like Amazon Clean Rooms or Microsoft Azure Purview, which leverage AI in optimizing workflows and detecting anomalies, reducing latency, and minimizing risks of data breaches.
⚑

Quick Links

Convert CSV to Excel with DuckDB, Polars, etc (1 minute read)

DuckDB, Polars, and Pandas offer efficient and easy methods for converting CSV files to Excel format, addressing common business needs for data consumption and sharing with one-liner operations.
Today's LLMs craft exploits from patches at lightning speed (1 minute read)

Generative AI models have significantly accelerated the timeline from vulnerability disclosure to the creation of proof-of-concept (PoC) exploit code, reducing it to just a few hours.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? πŸ“°

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? πŸ’Ό

Apply here or send a friend's resume to [email protected] and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.