Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize feature(binary_heap_into_iter_sorted) #76234

Closed

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented Sep 2, 2020

This came up on an itertools issue, and looks like it's baked sufficiently, so I figured I'd ask about stabilization.

This would stabilize the following API:
https://doc.rust-lang.org/nightly/std/collections/struct.BinaryHeap.html#method.into_iter_sorted

impl<T: Ord> BinaryHeap {
    fn into_iter_sorted(self) -> IntoIterSorted<T>;
}
impl<T: Ord> Iterator for IntoIterSorted<T> { type Item = T; }
impl<T: Ord> ExactSizeIterator for IntoIterSorted<T> ;
impl<T: Ord> FusedIterator for IntoIterSorted<T> ;
impl<T: Clone> Clone for IntoIterSorted<T>;
impl<T: Debug> Debug for IntoIterSorted<T>;

There were comments about wishing that the ordinary .into_iter() could just be the sorted one, but .iter() cannot be in that order so that's not necessarily good. And anyways, IntoIter already implements DoubleEndedIterator so we can't really change the order even if we wanted to.

So I think the remaining potential questions are naming (it was originally proposed as into_iter_ordered but went in as into_iter_sorted, which I agree is more consistent with elsewhere in the library) and the standard whether this is useful enough (one could certainly use from_fn on the heap's pop, but that's potentially suboptimal as it doesn't have a size_hint).

This needs an FCP, so picking a libs member explicitly:
r? @KodrAus

Tracking issue #59278 (which also covers a drain version which I've intentionally not included here)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 2, 2020
@matklad
Copy link
Member

matklad commented Sep 2, 2020

should we stabilize drain_sorted in the same step? the two methods feel pretty similar.

@jonas-schievink jonas-schievink added needs-fcp This change is insta-stable, so needs a completed FCP to proceed. relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 2, 2020
@jonas-schievink jonas-schievink added this to the 1.48 milestone Sep 2, 2020
@scottmcm
Copy link
Member Author

scottmcm commented Sep 2, 2020

@matklad There were concerns about that version in the tracking issue and the use I ran into didn't need it, so I didn't include it here. We certainly could, if people wanted, though.

@scottmcm
Copy link
Member Author

scottmcm commented Sep 2, 2020

Hmm, I just saw #76250 -- if a min-max heap is strictly better than a binary heap then another option would be to implement that, which could efficiently be a DoubleEndedIterator...

@KodrAus
Copy link
Contributor

KodrAus commented Sep 3, 2020

Revisiting this after the original PR, the different ordering of into_iter_sorted and into_sorted_vec is a little unfortunate.

One way to resolve this could be to consider deprecating into_sorted_vec over into_iter_sorted().collect::<Vec<_>>(), but I don't think that's what we'd want to do. It looks like the reason to prefer into_sorted_vec (and thus keep it around) over into_iter_sorted().collect::<Vec<_>>() is that into_sorted_vec is more efficient, and specialization of into_iter_sorted().collect::<Vec<_>>() wouldn't help cases where you want to collect into something Vec-like, but isn't a Vec itself.

Another way could be to tweak this API a bit would be to slot a type in-between the iterators that lives in binary_heap:

pub struct Sorted<T>(BinaryHeap<T>);
pub struct SortedMut<'a, T>(&'a mut BinaryHeap<T>);

// current into_iter_sorted
impl<T> IntoIterator for Sorted<T> {
    type Item = T;
}

// equivalent to SortedMut<'a, T> as IntoIterator
impl<'a, T> IntoIterator for &'a mut Sorted<T> {
    type Item = T;
}

impl<'a, T> IntoIterator for SortedMut<'a, T> {
    type Item = T;
}

impl<'a, T> SortedMut<'a, BinaryHeap<T>> {
    // current drain_sorted
    pub fn drain(self) -> SortedMutDrain<'a, T>;
}

impl<T> BinaryHeap<T> {
    pub fn into_sorted(self) -> Sorted<T>;
    pub fn sorted_mut(&mut self) -> SortedMut<T>;
}

It would then be called like:

// instead of into_iter_sorted
for item in heap.into_sorted() {
    ..
}

// into_sorted_vec stays around
let v = heap.into_sorted_vec();

// instead of drain_sorted
for item in heap.sorted_mut().drain() {
    ..
}

// like drain_sorted, but doesn't clear the heap
let max_5 = heap.sorted_mut().into_iter().take(5).collect::<Vec<_>>();

That's extra machinery and indirection though we may not think is worthwhile. I do think it's worth exploring though, since into_iter_sorted is also the only into_iter-style named method we have that isn't just an implementation of IntoIterator, and this would avoid that.

@scottmcm
Copy link
Member Author

scottmcm commented Sep 3, 2020

Hmm, into_sorted_vec is a bit of a strange API -- it forces a heapsort, but even though the input is already in heap order, that's not the expensive part of the sort, so it's plausibly faster to just .into_vec() and call .sort_unstable() (since a heapsort is also unstable). To me, the only reason to use the heap to get sorted things is because you want to be lazy about it since you're going to stop early but at some unknown point. (If you have a known point then [T]::partition_at_index and sorting the subslice is faster.)

Your point about IntoIterator is a good one. It really makes me wish for a completely different heap API: one that doesn't have .iter() and .iter_mut() at all, just .into_iter() and .drain() (with people being expected to flip back to Vec if they really need arbitrary order). Especially if it were double-ended, so that both ascending and descending are available cheaply (one via .rev()) and to address #58174 directly.

@KodrAus
Copy link
Contributor

KodrAus commented Sep 3, 2020

Your point about IntoIterator is a good one. It really makes me wish for a completely different heap API: one that doesn't have .iter() and .iter_mut() at all, just .into_iter() and .drain() (with people being expected to flip back to Vec if they really need arbitrary order).

Do you think the way we'd want to do this would be to introduce a new MinMaxHeap type? I think since our hands seems a bit tied with BinaryHeap maybe we could just rename the into_iter_sorted method to into_sorted_iter and stabilize that along with drain_sorted, since they're useful to have. The naming convention being:

{op}_{sorted?}_{container?}

which then covers:

drain()
drain_sorted()

into_vec()
into_sorted_vec()

into_iter()
into_sorted_iter()

What do you think?

Hmm, into_sorted_vec is a bit of a strange API -- it forces a heapsort, but even though the input is already in heap order, that's not the expensive part of the sort, so it's plausibly faster to just .into_vec() and call .sort_unstable() (since a heapsort is also unstable).

I was curious about this so thought I'd give it a try. A simple microbenchmark seems to support the idea that sort_unstable is better as the size of the heap increases. I ran these on my i9 9900K Windows machine.

With a heap size of 10 elements:

test already_sorted_heap_collect_sorted_vec     ... bench:          79 ns/iter (+/- 0)
test already_sorted_heap_into_sorted_vec        ... bench:          69 ns/iter (+/- 0)
test already_sorted_heap_into_vec_sort_unstable ... bench:          71 ns/iter (+/- 0)
test interleaved_heap_collect_sorted_vec        ... bench:          78 ns/iter (+/- 0)
test interleaved_heap_into_sorted_vec           ... bench:          70 ns/iter (+/- 0)
test interleaved_heap_into_vec_sort_unstable    ... bench:          70 ns/iter (+/- 0)

With a heap size of 100 elements:

test already_sorted_heap_collect_sorted_vec     ... bench:         563 ns/iter (+/- 3)
test already_sorted_heap_into_sorted_vec        ... bench:         462 ns/iter (+/- 1)
test already_sorted_heap_into_vec_sort_unstable ... bench:         139 ns/iter (+/- 4)
test interleaved_heap_collect_sorted_vec        ... bench:         565 ns/iter (+/- 5)
test interleaved_heap_into_sorted_vec           ... bench:         467 ns/iter (+/- 3)
test interleaved_heap_into_vec_sort_unstable    ... bench:         489 ns/iter (+/- 1)

With a heap size of 1000 elements:

test already_sorted_heap_collect_sorted_vec     ... bench:      21,534 ns/iter (+/- 235)
test already_sorted_heap_into_sorted_vec        ... bench:      21,182 ns/iter (+/- 186)
test already_sorted_heap_into_vec_sort_unstable ... bench:         751 ns/iter (+/- 11)
test interleaved_heap_collect_sorted_vec        ... bench:      22,008 ns/iter (+/- 219)
test interleaved_heap_into_sorted_vec           ... bench:      22,112 ns/iter (+/- 381)
test interleaved_heap_into_vec_sort_unstable    ... bench:       6,044 ns/iter (+/- 32)
#![feature(test, binary_heap_into_iter_sorted)]
extern crate test;

use std::collections::BinaryHeap;

// tweak this to change the size of the heap
const N: usize = 1000;

fn interleaved() -> BinaryHeap<usize> {
    let mut heap = BinaryHeap::new();

    for (a, b) in (0..N / 2).zip((N / 2..N).into_iter().rev()) {
        heap.push(a);
        heap.push(b);
    }

    heap
}

fn already_sorted() -> BinaryHeap<usize> {
    let mut heap = BinaryHeap::new();

    for i in (0..N).into_iter().rev() {
        heap.push(i);
    }

    heap
}

#[bench]
fn interleaved_heap_into_sorted_vec(b: &mut test::Bencher) {
    let heap = interleaved();

    b.iter(|| {
        let heap = heap.clone();
        heap.into_sorted_vec()
    });
}

#[bench]
fn interleaved_heap_collect_sorted_vec(b: &mut test::Bencher) {
    let heap = interleaved();
    let mut sorted = Vec::new();

    sorted.extend(heap.clone().into_iter_sorted());

    b.iter(|| {
        let heap = heap.clone();
        sorted.clear();

        // Unlike the `into_vec` methods, this creates a new `Vec` each time
        // We amortize the cost a little be re-using it
        sorted.extend(heap.into_iter_sorted());

        test::black_box(&sorted);
    });
}

#[bench]
fn interleaved_heap_into_vec_sort_unstable(b: &mut test::Bencher) {
    let heap = interleaved();

    b.iter(|| {
        let mut heap = heap.clone().into_vec();
        heap.sort_unstable();
        heap
    });
}

#[bench]
fn already_sorted_heap_into_sorted_vec(b: &mut test::Bencher) {
    let heap = already_sorted();

    b.iter(|| {
        let heap = heap.clone();
        heap.into_sorted_vec()
    });
}

#[bench]
fn already_sorted_heap_collect_sorted_vec(b: &mut test::Bencher) {
    let heap = already_sorted();
    let mut sorted = Vec::new();

    sorted.extend(heap.clone().into_iter_sorted());

    b.iter(|| {
        let heap = heap.clone();
        sorted.clear();

        // Unlike the `into_vec` methods, this creates a new `Vec` each time
        // We amortize the cost a little be re-using it
        sorted.extend(heap.into_iter_sorted());

        test::black_box(&sorted);
    });
}

#[bench]
fn already_sorted_heap_into_vec_sort_unstable(b: &mut test::Bencher) {
    let heap = already_sorted();

    b.iter(|| {
        let mut heap = heap.clone().into_vec();
        heap.sort_unstable();
        heap
    });
}

Should we update the implementation of into_sorted_vec to just do into_vec + sort_unstable? 🙂

@matklad
Copy link
Member

matklad commented Sep 3, 2020

so it's plausibly faster to just .into_vec() and call .sort_unstable()

Oh wow, that means that max heap really doesn’t make sense.

the different ordering of into_iter_sorted and into_sorted_vec is a little unfortunate.

As well as just is the fact that iter_sorted returns reversely-sorted Items. I would totally expect anything _sorted to be ascending. Should we stick a rev into method‘s name?

@bors
Copy link
Contributor

bors commented Sep 3, 2020

☔ The latest upstream changes (presumably #70793) made this pull request unmergeable. Please resolve the merge conflicts.

@KodrAus
Copy link
Contributor

KodrAus commented Sep 11, 2020

As well as just is the fact that iter_sorted returns reversely-sorted Items.

That tripped me up too. Since it's role is to call pop until you're done, maybe we should be thinking of it as iter_priority or something?

@sekineh
Copy link
Contributor

sekineh commented Sep 29, 2020

For record, the reason I picked the method name into_iter_sorted() was:

Discoverability in IDE use case.

  1. A user type heap.into_iter in IDE editor.
  2. IDE will display method candidates:
.into_iter()
.into_iter_sorted()
  1. User can discover the _sorted() version.

@sekineh
Copy link
Contributor

sekineh commented Sep 30, 2020

As well as just is the fact that iter_sorted returns reversely-sorted Items.

That tripped me up too. Since it's role is to call pop until you're done, maybe we should be thinking of it as iter_priority or something?

In rust iterator design, .iter() produces &Ts and .into_iter() produces Ts.
If .iter_priority() or somthing iter_*() produces Ts, it's confusing in my veiw.

My opinion is that the method is a variant of into_iter(), i.e. produces Ts, so it's natural to call it into_iter_<suffixhere>()

@sekineh
Copy link
Contributor

sekineh commented Sep 30, 2020

pub struct Sorted<T>(BinaryHeap<T>);
pub struct SortedMut<'a, T>(&'a mut BinaryHeap<T>);

It's an interesting API redesign, but cannot be done in a backward compatible way.

Sorted adapter newtype / sorted() method feels redundant because BinaryHeap is inherently sorted data structure from the begining.
I'd do in the opposite way, if redesign is permitted:

pub struct Unsorted<T>(BinaryHeap<T>);
pub struct UnsortedMut<'a, T>(&'a mut BinaryHeap<T>);

In this new world, we'd use iterator like below (LHS).

  • heap.unsorted().iter() == current heap.iter()
  • heap.unsorted().into_iter() == current heap.into_iter()

But actualy I won't implement unsorted() beacuse it's nearly same as into_vec() which already exists.

  • heap.into_vec().iter() == heap.iter()
  • heap.into_vec().into_iter() == heap.into_iter()

@sekineh
Copy link
Contributor

sekineh commented Sep 30, 2020

{op}_{sorted?}_{container?}

For the order of name, I feel like into_sorted_vec() should be named as into_vec_sorted(). (But can't change now!)
It's common to define set of similar methods with different suffixes:

{basename}_{suffix?}

For example, min-max-heap crate provide the following set of methods [1]:

  • into_vec(): Returns a vector in arbitrary order
  • into_vec_asc(): Returns an ascending (sorted) vector.
  • into_vec_desc(): Returns an descending (sorted) vector.

They are highly discoverable. I'm not endorsing asc/desc suffixes for this feature. They are using the suffixes because it's double ended.

@KodrAus
Copy link
Contributor

KodrAus commented Oct 5, 2020

In rust iterator design, .iter() produces &Ts and .into_iter() produces Ts.

I actually think of it slightly differently. .iter() takes &self (which might mean it needs to produce &Ts) and .into_iter() consumes self (so will probably produce Ts). I was a bit inaccurate with my comment though and it should've been into_priority_iter.

I can appreciate the discoverability use-case preferring suffixes, and do think into_iter_{sorting} is actually better than into_{sorting}_iter. But I think aligning with the existing and stable into_sorted_vec is more important here.

Since I think into_iter_sorted().collect() is actually into_sorted_vec().rev(), I think my best suggestion right now is into_priority_iter and drain_priority.

@sekineh
Copy link
Contributor

sekineh commented Oct 5, 2020

I agree that into_priority_iter and drain_priority are consistent with existing method name.
The word priority will signal different (reversed) ordering than into_sorted_vec().

Since I think into_iter_sorted().collect() is actually into_sorted_vec().rev(), I think my best suggestion right now is into_priority_iter and drain_priority.

@scottmcm
Copy link
Member Author

I'm going to withdraw this PR, since the conversation has convinced me that things aren't ready yet here.

My personal conclusions:

  • Given a time machine I'd remove direct iteration in heap order, and would make that go through Vec: From<Heap<_>> or some sort of .as_unordered_slice().
  • Assuming it can be implemented efficiently, .into_iter() and something-like-.drain() on a heap should both be double-ended in sorted order. (With probably no .iter() nor .iter_mut().)
  • As a bonus, that would avoid issues like BinaryHeap example should show how to make a min-heap. #58174 as such a heap would natively have .pop_min() and .pop_max()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-fcp This change is insta-stable, so needs a completed FCP to proceed. relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants