TLDR Data 2025-05-12

Calibo: Launch AI and Data Products Faster (Sponsor)

What happens when you're bogged down by months of setting up infrastructure before you can deploy your first line of code?

Innovation dies. ☠️

Calibo's Innovation Sandbox is a pre-built developer platform that comes pre-built with everything you need to get started using secure, industry standard tools:

Over 150 integrations for CI/CD, infra, alerting, and more.
Managed data pipelines and releases from a single control plane.
Self service access with governance + compliance.

With Calibo, you can experiment faster and get your code in front of customers - whether you're launching new data-driven services, modernizing operations, or accelerating AI adoption.

Get started now ↗️

📱

Deep Dives

ClickHouse and Parquet: A Foundation for Fast Lakehouse Analytics (28 minute read)

ClickHouse's query engine efficiently processes Apache Parquet, the columnar storage format powering open table formats like Iceberg and Delta Lake, leveraging parallelism in file reads, parsing, filtering, sorting, and aggregation, while using metadata like min/max statistics and Bloom filters to skip unnecessary data. Its ability to query external Parquet files directly without ingestion makes it inherently Lakehouse-ready, delivering faster performance than many systems querying their native formats. A new native Parquet reader is in development that will introduce dictionary-based filtering, page-level min/max stats, and optimizations like PREWHERE and lazy materialization to further boost speed.

Evolution of Product Classification at Shopify: From Categories to Comprehensive Product Understanding (4 minute read)

Shopify's product classification system uses Vision-Language Models (VLMs) and Shopify's Standard Product Taxonomy for scalable, accurate categorization. Running on a Kubernetes cluster with NVIDIA GPUs and Dynamo for model serving, it processes inputs via dynamic request batching and validation, employing a two-stage prediction for categories and attributes, ensuring consistency with transaction-like handling, retries, and quality monitoring. Outputs are validated against taxonomy rules, formatted, stored, and notifications are sent upon completion.

From DuckDB to DuckHouse: Scalable, S3-Backed Lakehouse with DuckDB (8 minute read)

DuckHouse extends DuckDB beyond a local OLAP engine by pairing it with Apache Iceberg and a Flight SQL server, creating a lightweight, fully S3-backed lakehouse that supports concurrent reads and writes and integrates seamlessly with dbt. Small, frequent queries run in-process on DuckDB for sub-second performance, while large tables live in Iceberg for cloud-scale storage and interoperability. This “best of both” stack delivers cost-efficient durability, simple deployment (single Flight server), and horizontal scalability for modern data platforms. The post includes a link to the complete demonstration code on GitHub.

🚀

Opinions & Advice

AI Won't Save You From Your Data Modeling Problems (8 minute read)

GenAI's need for real-time, context-rich decision-making exposes the limitations of traditional “clean-up-in-analytics” data quality strategies. Embedding AI at the operational layer demands polyglot data modeling—integrating relational, graph, document, and vector data—to unify structured and unstructured sources and provide accurate, up-to-date context. Robust data models and knowledge graphs act as ground truth, enabling reliable agent actions, enforcing business rules, and reducing error propagation. For AI-driven workflows to scale and deliver trustworthy results, data teams must shift left and prioritize comprehensive, flexible modeling across the entire data stack.

No, Really, Everything Becomes BI (10 minute read)

Organizations typically treat AI as a passive BI tool—summarizing data for human judgment—yet decision quality hinges on context that only people possess. By reversing this dynamic and having AI actively query executives for their proprietary knowledge and gut insights, companies can build richer, personalized context windows and enable more precise, closed-loop decision-making. As BI has evolved from pivot tables to chatbots on a semantic layer, the next frontier is for it to grasp our own conversations and documents, as well as our own insights and reactions.

Why APIs Are Essential and MCP Is Optional (for Now) (3 minute read)

The Model Context Protocol (MCP) is rapidly gaining adoption among leading LLM providers like Anthropic, OpenAI, and Gemini. It offers a standardized way for AI assistants to manage context when interfacing with external data. The protocol still fundamentally relies on APIs for core data transmission and authentication. MCP is most valuable for complex architectures that require structured contracts. Robust API infrastructure remains essential for security, access control, and operational clarity. Data teams must prioritize strong API and retrieval system foundations before layering in MCP for context management. They should be cautious about premature MCP adoption as a substitute for sound API design.

💻

Launches & Tools

ServiceNow Acquires Data.World to Bolster its AI-ready Data Capabilities (3 minute read)

ServiceNow has agreed to acquire Data.World—a cloud-native data catalog and governance platform valued at $350 million in 2022—to help customers organize and search their data for large-scale AI deployments. This strategic buy follows March's $2.85 billion Moveworks acquisition and embeds metadata management and purpose-limitation controls into ServiceNow's workflow automation suite. By integrating Data.World's cataloging tools, ServiceNow aims to make enterprise data “AI-ready,” reducing the “data hell” barrier and enhancing the effectiveness of its agentic AI solutions.

Announcing pg_parquet v.0.4.0: Google Cloud Storage, HTTPS Storage, and More (8 minute read)

pg_parquet is a PostgreSQL extension that allows users to read and write Parquet files. Release 0.4 enables direct import/export of Parquet files in Postgres via native COPY commands and supports Google Cloud Storage, HTTP(S), stdin/stdout, and extended data types (UUID, JSON, and JSONB). This streamlines data migration to and from cloud data lakes and warehouses like Snowflake, Redshift, and ClickHouse without third-party tools. pg_parquet offers lightweight, high-performance integration for archiving, lakehouse population, and cross-system analytics, ready for modern, cloud-centric data workflows.

toyDB (GitHub Repo)

toyDB is a distributed SQL database built in Rust that focuses on simplicity and educational purposes. It features Raft-based consensus, ACID transactions, and a pluggable storage engine. While not optimized for performance, toyDB supports basic SQL features like joins, aggregates, and transactions. It includes a benchmarking tool to measure workload performance across different configurations.

🎁

Miscellaneous

Back To The Basics: What Is Columnar Storage (7 minute read)

Columnar storage optimizes analytical queries by processing only relevant columns instead of reading entire rows. Its encoding mechanism (dictionary encoding, run-length encoding, and bit-packing) maximizes storage efficiency, while separate column storage allows parallel reads across multiple CPU cores or nodes, accelerating queries in engines like DuckDB, Presto, and Spark. Columnar file formats (Apache Parquet and Apache ORC) abstract away the complexity to achieve fast query speeds and storage efficiency. Uber adopted Parquet with ZSTD compression, reducing file sizes by up to 39%. Criteo migrated a petabyte of data from RCFile to Parquet, achieving better integration with Spark and Impala and improved compression for storage savings.

Real-Time Database Change Tracking in Go: Implementing PostgreSQL CDC (2 minute read)

Implementing Change Data Capture (CDC) in Go with PostgreSQL's logical replication and the pgx/pglogrepl libraries offers low-latency, event-driven data pipelines without significant database performance impact. While PostgreSQL CDC requires tables to have primary keys and careful schema management, this approach enables granular, real-time tracking of database changes, making it ideal for modern analytics, microservices, and data sync needs.

⚡

Quick Links

Which LLM Writes the Best Analytical SQL? (6 minute read)

LLMs are increasingly used for generating SQL, but a benchmark test showed that while they can produce valid queries, their efficiency and exactness often fall short compared to human-written SQL.

CREATE INDEX: Data Types Matter (2 minute read)

The choice of data type significantly impacts PostgreSQL index creation performance, with integer indexes building fastest due to CPU efficiency, while floating-point and numeric types are slower, and text is the most expensive.

Public-apis (GitHub Repo)

The Public APIs repository is an extensive and manually curated list of public APIs from many domains that you can use for your own products.

gmail-to-sqlite (GitHub Repo)

The gmail-to-sqlite project allows users to sync their Gmail account's emails into a SQLite database, making it easier to analyze email data through SQL queries.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://refer.tldr.tech/9a7c3e77/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to [email protected] and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.