Distributed pipelines that curate petabyte-scale image datasets enabled AI researchers to deduplicate, filter, and cluster billions of web images ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

Together With AWS

TLDR Data 2025-10-30

Data architecture tools and guides for your next AI project (Sponsor)

Accelerate your AI/ML initiatives with enterprise-ready solutions in AWS Marketplace. From vector databases to ML workflow orchestration, explore our tools today to scale your AI applications while maintaining security standards.

Discover our technical guides to streamline implementation, or start your free trial to see how these solutions can transform your AI development journey from proof-of-concept to production.

📱

Deep Dives

Datology's Distributed Pipelines for Handling PBs of Image Data (15 minute read)

Datology built distributed pipelines to curate petabyte-scale image datasets, enabling AI researchers to deduplicate, filter, and cluster billions of web images using custom Spark/Ray operations. Powered by a modified Flyte orchestrator and Postgres catalog, it deploys seamlessly into customer environments, delivering faster, cheaper, and more efficient models.
We Built a Vector Search Engine that Lets You Choose Precision at Query Time (26 minute read)

ClickHouse's QBit is a new column type that stores floating-point vectors as bit planes, enabling users to dynamically choose precision at query time by reading only the needed bits. This eliminates upfront quantization trade-offs, reduces I/O and compute by up to 75% respectively, and delivers tunable recall vs. speed.
How Nubank Built an In-house Logging Platform for 1 Trillion Log Entries (5 minute read)

Nubank replaced a costly, inflexible third-party logging system with an in-house platform that ingests 1 trillion logs daily using Fluent Bit, micro-batching with custom buffering, and processing services. It stores 45 PB in Parquet on AWS S3 and enables 15,000 fast Trino queries per day while cutting costs by 50%.
Vector Sync Patterns: Keeping AI Features Fresh When Your Data Changes (51 minute video)

Vector embeddings in AI apps become stale when source data, models, or business rules change, requiring event-driven synchronization via CDC, Kafka, and Flink to keep semantic search and RAG accurate. Five vector sync patterns, such as Dependency-Aware Propagator and Versioned Vector Registry, are designed to solve this complex, multi-dimensional challenge.
🚀

Opinions & Advice

Beyond the Perimeter: Practical Patterns for Fine-Grained Data Access (100 minute podcast)

Composable data stacks fracture Identity, Credentials, and Access Management. This discussion maps how to restore identity and auditability. Propagate short-lived JWT/OIDC across hops, externalize policy (OPA/Rego, Cedar), enforce via DB RLS/CLS or proxies, label from catalog+lineage, and bind policy to data (OpenTDF) and log provenance. Bottom line: compose trust across identity, policy, and data paths, secure choke points, standardize claims, and design streaming interfaces to avoid brittle per-system hacks.
Are We Thinking About Ontologies Wrong? (16 minute read)

Resource Description Framework reliance on global Internationalized Resource Identifiers (IRI) is limiting for real-world data modeling, where context and local scoping often define semantics more effectively than strict, global standards. Knowledge graphs benefit from contextual, composable property shapes (via SHACL), late binding, and scoped ontologies. This enables schema evolution, effective disambiguation, and federated interoperability without global identifier consensus (via blank nodes). Applying these innovations enables continuous integration of new facts, entity resolution, and knowledge refinement into ever-evolving world views.
💻

Launches & Tools

Write Kafka streams directly to S3 to slash 80% of costs (Sponsor)

Aiven Inkless is diskless Kafka. Run sub-100ms streams and 80% cheaper batch topics — in the same Kafka cluster. Inkless replaces Kafka I/O with cloud-native storage to deliver data persistence via decentralized architecture, deployed as a stateless service directly in your VPC. See how much you can save
SQLite Graph Database Extension (GitHub Repo)

sqlite-graph is a SQLite extension that turns SQLite into a graph database, letting you store nodes and relationships and query them using the Cypher graph language. It supports creating and querying graph structures directly from SQL or Cypher, with basic graph algorithms and Python bindings included. Still in alpha, it is useful for prototyping graph workloads without needing a separate graph database.
Iceberg CDC: Stream a Little Dream of Me (11 minute read)

Apache Iceberg's immutable snapshots excel at batch processing, but struggle with real-time CDC: frequent small updates rely on costly equality deletes. Iceberg v3 introduces deletion vectors for precise row masking without full scans, and row lineage for stable identities, enabling efficient CDC views. v4 proposes a single Root Manifest per snapshot to consolidate deltas, allowing CDC readers to diff changes with minimal I/O.
Valkey 9.0 Debuts Multidatabase Clustering for Massive-Scale Workloads (3 minute read)

Valkey is an in-memory datastore, backward-compatible with Redis, under the BSD 3-Clause License. Valkey 9.0 introduces multidatabase clustering, atomic slot migration, and major performance optimizations, virtually scaling over 1 billion requests per second. This aims to ease migration for large-scale, production-critical workloads across cloud and on-premises environments.
What's New in Apache Polaris 1.2.0: Fine-Grained Access, Event Persistence, and Better Federation (4 minute read)

Apache Polaris 1.2.0 introduces granular access controls, sub-catalog RBAC for federated catalogs, and persistent catalog event logging (currently in preview), enhancing governance and observability across multi-engine Iceberg lakehouses. Additional features include IAM-based authentication for Amazon RDS/Aurora PostgreSQL, extended S3-compatible storage support, and streamlined credential management. The release strengthens security, catalog integrity, and operational flexibility.
🎁

Miscellaneous

Backpressure in Distributed Systems (20 minute read)

Backpressure happens when fast producers overwhelm slower consumers, causing memory issues, dropped data, or high latency. Systems handle it by slowing producers, dropping queued or incoming messages, or scaling consumers so processing keeps pace. The key insight for data professionals is to design pipelines with explicit backpressure strategies rather than relying on infinite buffering, which helps maintain stability and predictable performance in distributed systems.
huggingface_hub v1.0: Five Years of Building the Foundation of Open Machine Learning (8 minute read)

huggingface_hub has reached v1.0 after five years. It now serves as a core dependency for 200,000+ repositories and powers access to over 2 million models, 500,000 datasets, and 1 million Spaces. Version 1.0 delivers major upgrades: a migration to httpx (enabling HTTP/2 and unified async/sync APIs), hf_xet for chunk-based file transfers (77 PB migrated), and a fully revamped CLI with Typer.

Quick Links

Streaming Datasets: 100x More Efficient (4 minute read)

Hugging Face's new `streaming=true` enables instant high-speed training on multi-terabyte remote datasets, often faster than local SSDs.
OpenTelemetry Adoption Update: Rust, Prometheus and Other Speed Bumps (5 minute read)

OpenTelemetry is becoming the standard for observability, but adoption is slowed by complexity and incomplete language support, especially for Rust.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to [email protected] and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.