Eventual-Inc
Daft
Blog
Docs
Changelog
Blog
Docs
Changelog
Overview
Branches
Benchmarks
Runs
Performance History
Latest Results
feat(functions): add string distance/similarity functions - add levenshtein_distance, jaro_similarity, jaro_winkler_similarity, damerau_levenshtein_distance - pure Rust implementations with no external dependencies, following hamming_distance_str pattern - expose as top-level daft.functions API and Expression methods - handle null inputs (return null) and null-typed columns (DataType::Null) - include 24 pytest test cases covering correctness, edge cases, and null handling
nish2292:feat/string-distance-functions
12 minutes ago
perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (#7059) ## Summary - Bumps `tikv-jemallocator` from 0.6.x to 0.7.0 and `tikv-jemalloc-ctl` from 0.6.x to 0.7.0 (underlying jemalloc 5.3.0 → 5.3.1) - jemalloc 5.3.0 has a bug where `muzzy_decay_ms` values other than `-1` cause severe performance degradation (up to 7x slower in Polars benchmarks) - Daft sets `muzzy_decay_ms:1000` on Linux and `muzzy_decay_ms:0` on macOS — both hit this bug ## Context Polars discovered and fixed this in [pola-rs/polars#27797](https://github.com/pola-rs/polars/pull/27797). The jemalloc 5.3.1 release fixes the muzzy decay codepath so non-`-1` values no longer tank performance. ## Test plan - [ ] CI passes on Linux (where `muzzy_decay_ms:1000` was actively hitting the bug) - [ ] Local macOS build succeeds (where `muzzy_decay_ms:0` was hitting the bug) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
main
40 minutes ago
feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (#7067) same as https://github.com/Eventual-Inc/Daft/pull/7065, except into main
main
1 hour ago
clippy
euan/asof-join-aligned-api-surface
2 hours ago
fix(shuffle): compact flight repartition accumulators to bound map-task memory The flight repartition sink accumulates a map task's partitioned input as raw partition slices until finalize writes them in one shot. At high partition counts the per-slice allocator slack (alignment, growth headroom) dwarfs the data: ~1.7GB of measured input reached 40+GB resident at 8192 partitions on TPC-H SF1000 (512 x ~2GB files), OOM-killing workers ~60s into the map phase. Fuse the accumulated output into one batch per partition on every flush, so resident memory tracks the measured data size instead of slice-count-dependent slack. The write path and on-disk layout are unchanged. Validated on a 33-node cluster: 1TB/8192-partition shuffle goes from worker OOM to completing with exact row conservation (5,999,989,709 rows). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
colin/flight-shuffle-map-side-memory
2 hours ago
fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (#7056) ## Problem The flight-shuffle coordinator retains one `FlightPartitionRef` per (map task × output partition) and transposes the full matrix before emitting any reduce task. At 10k map tasks × 8192 partitions that is ~82M refs (~26 GB of heap) — Ray's memory monitor kills the head-node actor right at map completion: ``` ray.exceptions.OutOfMemoryError: ... task name=flotilla-plan-runner ... actual memory used=26.08GB ``` ## Fix The ref matrix is fully structured — `partition_ref_id = (input_id << 32) | partition_idx`, one ref per partition per map input — so it is recoverable from just the map input ids per server. The coordinator now folds the map-output stream into that single map, shared via `Arc` by all reduce tasks (each carrying only its `partition_idx`), and readers reconstruct their exact ref ids at fetch time. Coordinator memory: **O(map_tasks × partitions) → O(map_tasks + partitions)**. The reconstructed requests are byte-identical to what the full matrix would have produced, so the flight server is untouched and fault tolerance is unchanged: retried map tasks' stale registrations are never addressed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
main
2 hours ago
refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition (#7053) Renames `needs_hash_repartition` to `can_skip_hash_repartition` to fix misleading semantics — the function returns `true` when the shuffle can be skipped, not when it's needed.
main
2 hours ago
docs: fix remaining Slack links missed in first pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
everettVT/slack-link-audit
3 hours ago
Latest Branches
CodSpeed Performance Gauge
0%
feat(functions): add string distance/similarity functions
#7068
42 minutes ago
2d4b9b4
nish2292:feat/string-distance-functions
CodSpeed Performance Gauge
0%
feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join
#7067
2 hours ago
60da44f
euan/asof-join-aligned-api-surface
CodSpeed Performance Gauge
0%
fix(shuffle): fuse repartition flush buffers before partitioning to bound map memory
#7064
3 hours ago
a5b09dc
colin/flight-shuffle-map-side-memory
© 2026 CodSpeed Technology
Home
Terms
Privacy
Docs