Eventual-Inc/Daft - CodSpeed

Daft

Blog Docs Changelog

Performance History

Latest Results

feat(functions): add string distance/similarity functions - add levenshtein_distance, jaro_similarity, jaro_winkler_similarity, damerau_levenshtein_distance - pure Rust implementations with no external dependencies, following hamming_distance_str pattern - expose as top-level daft.functions API and Expression methods - handle null inputs (return null) and null-typed columns (DataType::Null) - include 24 pytest test cases covering correctness, edge cases, and null handling

nish2292:feat/string-distance-functions

12 minutes ago

perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (#7059) ## Summary - Bumps `tikv-jemallocator` from 0.6.x to 0.7.0 and `tikv-jemalloc-ctl` from 0.6.x to 0.7.0 (underlying jemalloc 5.3.0 → 5.3.1) - jemalloc 5.3.0 has a bug where `muzzy_decay_ms` values other than `-1` cause severe performance degradation (up to 7x slower in Polars benchmarks) - Daft sets `muzzy_decay_ms:1000` on Linux and `muzzy_decay_ms:0` on macOS — both hit this bug ## Context Polars discovered and fixed this in [pola-rs/polars#27797](https://github.com/pola-rs/polars/pull/27797). The jemalloc 5.3.1 release fixes the muzzy decay codepath so non-`-1` values no longer tank performance. ## Test plan - [ ] CI passes on Linux (where `muzzy_decay_ms:1000` was actively hitting the bug) - [ ] Local macOS build succeeds (where `muzzy_decay_ms:0` was hitting the bug) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

main

40 minutes ago

feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (#7067) same as https://github.com/Eventual-Inc/Daft/pull/7065, except into main

main

1 hour ago

euan/asof-join-aligned-api-surface

2 hours ago

fix(shuffle): compact flight repartition accumulators to bound map-task memory The flight repartition sink accumulates a map task's partitioned input as raw partition slices until finalize writes them in one shot. At high partition counts the per-slice allocator slack (alignment, growth headroom) dwarfs the data: ~1.7GB of measured input reached 40+GB resident at 8192 partitions on TPC-H SF1000 (512 x ~2GB files), OOM-killing workers ~60s into the map phase. Fuse the accumulated output into one batch per partition on every flush, so resident memory tracks the measured data size instead of slice-count-dependent slack. The write path and on-disk layout are unchanged. Validated on a 33-node cluster: 1TB/8192-partition shuffle goes from worker OOM to completing with exact row conservation (5,999,989,709 rows). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

colin/flight-shuffle-map-side-memory

2 hours ago

fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (#7056) ## Problem The flight-shuffle coordinator retains one `FlightPartitionRef` per (map task × output partition) and transposes the full matrix before emitting any reduce task. At 10k map tasks × 8192 partitions that is ~82M refs (~26 GB of heap) — Ray's memory monitor kills the head-node actor right at map completion: ``` ray.exceptions.OutOfMemoryError: ... task name=flotilla-plan-runner ... actual memory used=26.08GB ``` ## Fix The ref matrix is fully structured — `partition_ref_id = (input_id << 32) | partition_idx`, one ref per partition per map input — so it is recoverable from just the map input ids per server. The coordinator now folds the map-output stream into that single map, shared via `Arc` by all reduce tasks (each carrying only its `partition_idx`), and readers reconstruct their exact ref ids at fetch time. Coordinator memory: **O(map_tasks × partitions) → O(map_tasks + partitions)**. The reconstructed requests are byte-identical to what the full matrix would have produced, so the flight server is untouched and fault tolerance is unchanged: retried map tasks' stale registrations are never addressed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

main

2 hours ago

refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition (#7053) Renames `needs_hash_repartition` to `can_skip_hash_repartition` to fix misleading semantics — the function returns `true` when the shuffle can be skipped, not when it's needed.

main

2 hours ago

docs: fix remaining Slack links missed in first pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

everettVT/slack-link-audit

3 hours ago

Latest Branches

0%

feat(functions): add string distance/similarity functions#7068

42 minutes ago

2d4b9b4

nish2292:feat/string-distance-functions

0%

feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join#7067

2 hours ago

60da44f

euan/asof-join-aligned-api-surface

0%

fix(shuffle): fuse repartition flush buffers before partitioning to bound map memory#7064

3 hours ago

a5b09dc

colin/flight-shuffle-map-side-memory

© 2026 CodSpeed Technology

Home Terms Privacy Docs