Datology's Distributed Pipelines for Handling PBs of Image Data (15 minute read)
Datology built distributed pipelines to curate petabyte-scale image datasets, enabling AI researchers to deduplicate, filter, and cluster billions of web images using custom Spark/Ray operations. Powered by a modified Flyte orchestrator and Postgres catalog, it deploys seamlessly into customer environments, delivering faster, cheaper, and more efficient models.
|
Vector Sync Patterns: Keeping AI Features Fresh When Your Data Changes (51 minute video)
Vector embeddings in AI apps become stale when source data, models, or business rules change, requiring event-driven synchronization via CDC, Kafka, and Flink to keep semantic search and RAG accurate. Five vector sync patterns, such as Dependency-Aware Propagator and Versioned Vector Registry, are designed to solve this complex, multi-dimensional challenge.
|
|
Beyond the Perimeter: Practical Patterns for Fine-Grained Data Access (100 minute podcast)
Composable data stacks fracture Identity, Credentials, and Access Management. This discussion maps how to restore identity and auditability. Propagate short-lived JWT/OIDC across hops, externalize policy (OPA/Rego, Cedar), enforce via DB RLS/CLS or proxies, label from catalog+lineage, and bind policy to data (OpenTDF) and log provenance. Bottom line: compose trust across identity, policy, and data paths, secure choke points, standardize claims, and design streaming interfaces to avoid brittle per-system hacks.
|
Are We Thinking About Ontologies Wrong? (16 minute read)
Resource Description Framework reliance on global Internationalized Resource Identifiers (IRI) is limiting for real-world data modeling, where context and local scoping often define semantics more effectively than strict, global standards. Knowledge graphs benefit from contextual, composable property shapes (via SHACL), late binding, and scoped ontologies. This enables schema evolution, effective disambiguation, and federated interoperability without global identifier consensus (via blank nodes). Applying these innovations enables continuous integration of new facts, entity resolution, and knowledge refinement into ever-evolving world views.
|
|
SQLite Graph Database Extension (GitHub Repo)
sqlite-graph is a SQLite extension that turns SQLite into a graph database, letting you store nodes and relationships and query them using the Cypher graph language. It supports creating and querying graph structures directly from SQL or Cypher, with basic graph algorithms and Python bindings included. Still in alpha, it is useful for prototyping graph workloads without needing a separate graph database.
|
Iceberg CDC: Stream a Little Dream of Me (11 minute read)
Apache Iceberg's immutable snapshots excel at batch processing, but struggle with real-time CDC: frequent small updates rely on costly equality deletes. Iceberg v3 introduces deletion vectors for precise row masking without full scans, and row lineage for stable identities, enabling efficient CDC views. v4 proposes a single Root Manifest per snapshot to consolidate deltas, allowing CDC readers to diff changes with minimal I/O.
|
Valkey 9.0 Debuts Multidatabase Clustering for Massive-Scale Workloads (3 minute read)
Valkey is an in-memory datastore, backward-compatible with Redis, under the BSD 3-Clause License. Valkey 9.0 introduces multidatabase clustering, atomic slot migration, and major performance optimizations, virtually scaling over 1 billion requests per second. This aims to ease migration for large-scale, production-critical workloads across cloud and on-premises environments.
|
What's New in Apache Polaris 1.2.0: Fine-Grained Access, Event Persistence, and Better Federation (4 minute read)
Apache Polaris 1.2.0 introduces granular access controls, sub-catalog RBAC for federated catalogs, and persistent catalog event logging (currently in preview), enhancing governance and observability across multi-engine Iceberg lakehouses. Additional features include IAM-based authentication for Amazon RDS/Aurora PostgreSQL, extended S3-compatible storage support, and streamlined credential management. The release strengthens security, catalog integrity, and operational flexibility.
|
|
Backpressure in Distributed Systems (20 minute read)
Backpressure happens when fast producers overwhelm slower consumers, causing memory issues, dropped data, or high latency. Systems handle it by slowing producers, dropping queued or incoming messages, or scaling consumers so processing keeps pace. The key insight for data professionals is to design pipelines with explicit backpressure strategies rather than relying on infinite buffering, which helps maintain stability and predictable performance in distributed systems.
|
|
|
Love TLDR? Tell your friends and get rewards!
|
|
Share your referral link below with friends to get free TLDR swag!
|
|
|
|
Track your referrals here.
|
|
|
|