Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort: add Find #50340

Closed
robpike opened this issue Dec 24, 2021 · 51 comments
Closed

sort: add Find #50340

robpike opened this issue Dec 24, 2021 · 51 comments

Comments

@robpike
Copy link
Contributor

robpike commented Dec 24, 2021

I have never used sort.Search to build a sorted list (slice). It's too expensive and there are better ways. However, I use it often to do what its name suggests: search.

The API is designed for the growth case, which is fine and general, but the general (growth) case is rare, perhaps very rare, and handling the return value is clumsy†.

I suggest adding another function, say sort.Contains, that just says yes or no to the question of presence of the element. Here is sort.Search's use for the pure search case, right from the documentation:

i := sort.Search(len(data), func(i int) bool { return data[i] >= x })
        if i < len(data) && data[i] == x {
        	// x is present at data[i]
        } else {
        	// x is not present in data,
        	// but i is the index where it would be inserted.
        }

Besides the messy handling, that extra comparison has always bothered me.

Here is what that would look like with the proposed function:

 if sort.Contains(len(data), func(i int) bool { return data[i] >= x }) {
      	// x is present at data[i]
}

This seems so much more intuitive and semantically lightweight. If you needed the index - but why would you if you weren't going to insert there? - you could go back to the existing Search function.

It should be very simple to implement by having both Search and Contains call a helper that reports presence, and returning the appropriate result, or by just writing Contains itself - like Search, it's very short.

I already filled in the proposal questionnaire here: #45624

† And poorly documented, but that is fixable.

@gopherbot gopherbot added this to the Proposal milestone Dec 24, 2021
@robpike
Copy link
Contributor Author

robpike commented Dec 24, 2021

By the way, even in the event type parameters cause a rethink of this package, a lighter-weight Contains function would be nice to have.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Dec 24, 2021
@DeedleFake
Copy link

DeedleFake commented Dec 26, 2021

How would this work? I was under the impression that the reason for the secondary data[i] == x check is because the f function doesn't necessarily provide enough information to be sure if the value at i is actually the one being searched for or not. Since sort.Contains() doesn't actually return i, how would you be sure if the value was correct or not?

I'm also curious what the better way to build a sorted slice is that you mentioned. I have used sort.Search() to do so, many times. The only way I can think of that's even equivalent would be to build a heap and then pop the whole thing once you're done adding stuff to it, but just off the top of my head both seem to be O(n * log(n)) to me.

Edit: Actually, I think that the heap method would be better as it doesn't require as many copies, and potentially allocations, when adding elements thanks to not having to insert into the middle of a slice. It does use interfaces if using container/heap, though, which is a whole different type of overhead.

@robpike
Copy link
Contributor Author

robpike commented Dec 26, 2021

Whatever else it might do, it can do another test. It only has <= but two calls of that will tell it == (NaNs aside, but we're already on thin ice with NaNs. They're not even numbers.)

In case it's not clear from what I wrote above, that final test (or two) may be necessary, but I want the function to do it, not the caller.

@DeedleFake
Copy link

DeedleFake commented Dec 27, 2021

As proposed, there's no way for the function to determine actual equality. The function only takes the index to check against and, therefore, knows nothing about the value actually being checked. For example, given f is func(i int) { return data[i] >= 3 }, consider the the slices []int{1, 2, 3, 4} and []int{1, 2, 4, 5}. f(i) returns the exact same values for all indices in both cases, but the value is actually only present in one of the slices. Unless I'm somehow just completely missing something here, the API proposed does not seem capable of determining for sure whether or not the value being searched for is actually in the slice.

It only has <= but two calls of that will tell it ==

It doesn't have <=, though. It only has x <=. The x is built in to the function passed in. Without the ability change the order of the two things being compared, it can't be used to determine equality.

The only mention that the proposal seems to make for separately determining equality is

It should be very simple to implement by having both Search and Contains call a helper that reports presence, and returning the appropriate result, or by just writing Contains itself - like Search, it's very short.

but it's not clear what exactly that would look like. It sounds like you're saying that sort.Contains() needs a second function passed to it, maybe func(i int) bool, that it will call to determine the final equality. If so, that would work, but I don't see a whole lot of advantage over the existing sort.Search(), and having to pass two closures to sort.Contains() seems more cumbersome to me, not less.

This

present := sort.Contains(
  len(data),
  func(i int) bool { return data[i] >= x },
  func(i int) bool { return data[i] == x },
)
if present {
  // x is present in data
}

doesn't seem much better to me than this

i := sort.Search(len(data), func(i int) bool { return data[i] >= x })
if (i < len(data)) && (data[i] == x) {
  // x is present at data[i]
}

Edit: It occurs to me that this is fairly straightforwards with generics:

Playground Link

func Contains[T any, S ~[]T](s S, x T, less func(v1, v2 T) bool) bool {
	i := sort.Search(len(s), func(i int) bool { return less(x, s[i]) }) - 1
	return (i >= 0) && !less(s[i], x)
}

This is limited to slices, but that probably covers 99% of cases. A more complicated one that covers other cases is also possible, but probably not worth the trouble.

Edit 2: Fix an instance of return condition ? true : false. Sigh... That's what I get for writing code at 2:00 A.M.

@robpike
Copy link
Contributor Author

robpike commented Dec 27, 2021

I had forgotten the challenge of moving the equality check away from the caller. I hadn't thought it through - even though I remember talking to @griesemer about the challenge when he was designing it. Mea ignoramica.

But that affects only the mechanism, not my desire, which is to simplify the API. I like your solution, at least as a discussion point. I can't help feel there's a clean API trying to be found. C's bsearch isn't all that nice to use either, though.

@robpike
Copy link
Contributor Author

robpike commented Dec 27, 2021

There's even this nice (untested) possibility for the simplest and possibly commonest case, if we can choose a different name for it.

func Lookup[T constraints.Comparable, S ~[]T](s S, x T) bool {
	i := sort.Search(len(s), func(i int) bool { return x <= s[i] })
	return i < len(s) && x==s[i]
}

@mlevieux
Copy link
Contributor

mlevieux commented Jan 3, 2022

I'm probably about to say something completely wrong, but isn't the problem solved by asking for a function able to report for multiple states? Like here we're focusing on a closure like func (i int) bool, but we could totally go for func (i int) int that'd be a comparator function so -1 on <, 0 on == and 1 on >. So sort.Contains would take this function and return true if it could compare some value for equality, false otherwise. This makes writing the closure longer and probably a little-less readable. But it also has the advantage of letting the caller define the equality relations according the use case, which is not possible through the use of constraints.Comparable.

@rsc
Copy link
Contributor

rsc commented Jan 12, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Jan 12, 2022
@rsc
Copy link
Contributor

rsc commented Jan 19, 2022

It sounds like we agree that sort.Search is hard to use, but there's not an obvious replacement. In particular, anything that can report an exact position has to take a different function (a 3-way result like strings.Compare, or multiple functions).

For sake of argument, we could do:

func Find(n int, cmp func (i int) int) int

where the cmp function returns <0, 0, or >0 to say whether the item being sought is less than, equal to, or greater than element at position i.

(In comparison, sort.Search's f reports whether the item being sought is less than the element at position i; that is, it's Less(item, elem[i]), whereas Find takes Cmp(item, elem[i]).)

Is that something we should consider adding?

@rsc
Copy link
Contributor

rsc commented Jan 19, 2022

When we do resolve this discussion, perhaps we should revisit #47619 (comment).

@robpike
Copy link
Contributor Author

robpike commented Jan 20, 2022

@rsc I am in favor of adding a function with the signature you suggest, and happy with the name.

@rsc
Copy link
Contributor

rsc commented Jan 26, 2022

Sounds like people are generally in favor of adding sort.Find.
It looks like we'd need to add slices.BinaryFind and slices.BinaryFindFunc as well.
The difference between Search and Find in all cases would be that Find returns -1 when the element is not present,
whereas Search returns the index in the slice where it would be inserted.

Do I have that right? Does anyone object to adding Find (and slices.BinaryFind and slices.BinaryFindFunc)?

@rsc rsc changed the title proposal: sort: add a boolean-valued search function proposal: sort: add Find Jan 26, 2022
@rogpeppe
Copy link
Contributor

I am at least partially responsible for the design of the original API and I also take Rob's view that it's needlessly awkward in the common case (it's also annoying because it's backward from most of the rest of the sort package's API - you have to think in terms of greater-than rather than less-than, but that's another matter).

where the cmp function returns <0, 0, or >0 to say whether the item being sought is less than, equal to, or greater than element at position i.

I think I'd find it be more intuitive to say that cmp returned whether the element is <0, 0 or >0 than the item being sought ("how does element i compare to what I'm looking for?"), but that would probably be an unwarrantable difference in approach from sort.Search.

What would the signature of BinaryFind look like? I'm thinking something like:

func BinaryFind[T any](x []T, item T, func(T, T) int) int

for example:

BinaryFind([]string{"a", "b", "c"}, "b", strings.Compare)

Maybe a good reason to add a time.Time.Compare method:

BinaryFind([]time.Time{...}, t, time.Time.Compare)

@ianlancetaylor
Copy link
Contributor

I think the analogous slices signatures would be

func BinaryFind[E comparable](s []E, E) int
func BinaryFindFunc[E any](s []E, func(E) int) int

It's a reasonable question to ask whether BinaryFindFunc should instead be

func BinaryFindFunc[E any](s []E, v E, func(E, E) int) int

I'm not sure. I think the main argument for the latter would be that we can pass functions like bytes.Compare, as you suggest.

@rogpeppe
Copy link
Contributor

rogpeppe commented Jan 27, 2022

I think the analogous slices signatures would be

func BinaryFind[E comparable](s []E, E) int
func BinaryFindFunc[E any](s []E, func(E) int) int

It's a reasonable question to ask whether BinaryFindFunc should instead be

func BinaryFindFunc[E any](s []E, v E, func(E, E) int) int

I'm not sure. I think the main argument for the latter would be that we can pass functions like bytes.Compare, as you suggest.

The main reason from my point of view is that the item that's being searched for is rarely a constant, so the former always in practice requires the caller to create a closure that refers to that value, which is unwieldy and not really that much easier to use than sort.Find would be, I think.

Example calls:

	i = sort.Find(len(xs), func(i int) int {
		return strings.Compare(x, xs[i])
	})
	i = slices.FindFunc1(xs, func(y string) int {
		return strings.Compare(x, y)
	})
	i = slices.FindFunc2(xs, x, strings.Compare)

The latter is the clear winner, I think - it's shorter and easier to get right. With the closure versions,
you have to remember the order of the comparison - strings.Compare(y, x) would
be just as possible, but a function of the form func(T, T) int has a natural ordering.

I suspect it might end up more efficient too, because FindFunc is unlikely to be inlined, so
the closure wouldn't be inlined, so there would be an extra indirect function call on each comparison AFAICS.

@rsc
Copy link
Contributor

rsc commented Feb 2, 2022

The three-argument FindFunc (FindFunc2 above) would be different from the other Funcs we have, like IndexFunc. It doesn't make sense to me to diverge from those.

Should we leave BinaryFindFunc out until we are more sure it's needed?

@DeedleFake
Copy link

DeedleFake commented Feb 2, 2022

I feel like the FindFunc2() signature is just a workaround for the unwieldy nature of simple closures. Maybe something like that should be reconsidered in the context of #21498? A shorter syntax would make that entirely unnecessary, in my opinion.

@rogpeppe
Copy link
Contributor

rogpeppe commented Feb 3, 2022

The three-argument FindFunc (FindFunc2 above) would be different from the other Funcs we have, like IndexFunc. It doesn't make sense to me to diverge from those.

I'd suggest that FindFunc is different from IndexFunc in that IndexFunc is only rarely used to find a locally defined value.

A quick scan of a random corpus of a few million lines of code turned up 20 uses of IndexFunc, of which only one used a closure to define the value being searched for. All the rest were similar to strings.IndexFunc(s, unicode.IsSpace) - that is, the predicate function was independently defined only in terms of its argument.

One of the nice things about IndexFunc is that it can be used with many existing "out of the box" functions (unicode.IsSpace being just one example). With FindFunc1 above, that's not the case - you will always need to define your own closure, and there's always to possibility of getting the argument ordering wrong and the potential subtle bugs that could arise as a result:

i = slices.FindFunc1(xs, func(y string) int {
	return strings.Compare(y, x)   // WRONG!
})

On the other hand, with a two-argument form, many existing functions and methods can be used as they are:

slices.FindFunc2(byteSlices, byteSlice, bytes.Compare)
slices.FindFunc2(ipAddrs, ipAddr, netip.Addr.Compare)
slices.FindFunc2(semvers, version, semver.Compare)
slices.FindFunc2(bigInts, i, (*big.Int).Cmp)

I'd argue that passing a compare function is more natural than passing a single argument function, has less potential for subtle bugs, and makes better use of existing available functions and methods.

I don't believe this is about closures being unwieldy. Even if closures were more syntactically lightweight, you'd still need to remember to pass the arguments to the underlying comparison function (and there will usually be an underlying comparison function, I suspect) in the right order.

@rogpeppe
Copy link
Contributor

rogpeppe commented Feb 3, 2022

One other thing that's worth clarifying: currently sort.Search is guaranteed to find the smallest value such that f(i) is true, which means if there are multiple exact matches, it will always return the earliest one.

With sort.Find, we could potentially return immediately when the compare function returns 0. That's more efficient but a bit inconsistent with sort.Search. Personally I don't think that matters (if someone really needs the very first instance in a run, they can still use sort.Search), but opinions might well vary. WDYT?

@ianlancetaylor
Copy link
Contributor

I think sort.Find should be defined as returning any arbitrary index for which the comparison function returns zero.

@rsc
Copy link
Contributor

rsc commented Feb 9, 2022

Okay, so it sounds like we all agree on sort.Find.

slices.Anything is a bit harder. It sounds like maybe the signature should be different from sort.Find.
Also BinaryFind is not a real thing like binary search is.

Perhaps slices.BinarySearchFunc should change to be the FindFunc2 we've been talking about.
Specifically, right now we have:

slices.BinarySearch(slice []T, target T)
slices.BinarySearchFunc(slice []T, less func(value T) bool)

and maybe we should change it to:

slices.BinarySearch(slice []T, target T)
slices.BinarySearchFunc(slice []T, target T, cmp func(T, T) int)

That would line up well with what other languages do and it avoids the problems @rogpeppe points out.
And we'd be back down to having just one.

Thoughts?

@ianlancetaylor
Copy link
Contributor

CC @eliben

@eliben
Copy link
Member

eliben commented Feb 10, 2022

I agree with decoupling the interfaces the sort and slices packages. The former works on abstract indices, whereas the latter works on actual slices.

The changes to slices sound alright to me; to be clear, there's no change to BinarySearch, but there is change to the semantics of BinarySearchFunc, from the current semantics of taking ok func(Elem) bool where it finds and returns the smallest index i at which element ok returns true for element i. The new semantics will be taking cmp func(T, T) int and returning the smallest index i at which cmp returns 0?

I don't have a preference w.r.t. sort package, other than to express that having both Search and Find functions with subtly different semantics in it somewhat confusing.

@rogpeppe
Copy link
Contributor

The new semantics will be taking cmp func(T, T) int and returning the smallest index i at which cmp returns 0?

I believe as per this comment the new semantics will be returning the first (arbitrary) index i encountered for which cmp returns 0.

@rsc rsc added this to the Backlog milestone Mar 23, 2022
@eliben
Copy link
Member

eliben commented Mar 24, 2022

I'll take a stab at implementing the slices functions first.

@gopherbot
Copy link

Change https://go.dev/cl/395414 mentions this issue: slices: rework the APIs of BinarySearch*

gopherbot pushed a commit to golang/exp that referenced this issue Mar 28, 2022
For golang/go#50340

Change-Id: If115b2b66d463d5f3788d017924f8dd38867551c
Reviewed-on: https://go-review.googlesource.com/c/exp/+/395414
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Trust: Eli Bendersky‎ <eliben@golang.org>
@eliben
Copy link
Member

eliben commented Mar 28, 2022

Now that the new functions have landed in exp/slices, I can work on sort.Find as described in #50340 (comment)

Unless @robpike you have a prototype already that you want to send...

@gopherbot
Copy link

Change https://go.dev/cl/396514 mentions this issue: sort: add Find function

@Merovius
Copy link
Contributor

Merovius commented Apr 5, 2022

This came up on golang-nuts, where someone asked the predicate-based version to return. I also think it should.

It is easy, when you have a comparison function, to get to a predicate:

var b T
pred := func(a T) bool { return cmp(a, b) >= 0 }

But not the other way around, so a predicate is more general.

I used the predicate version of sort.Search to look for prefix-matches in a sorted slice of paths. i.e. I was not looking for an equality match of the needle, but also something of which the needle is a prefix.

When I saw this issue fly by, I assumed this would be about adding a simpler version of sort.Search to make the common use-case simpler, which seemed a good idea, so I didn't follow this discussion. I would've spoken up sooner, if I expected the more general version to disappear altogether.

@ianlancetaylor
Copy link
Contributor

I agree that a predicate is more general, but the slice still has to be sorted according to the predicate. And the comparison function seems easier to reason about. And given a predicate you can get the comparison by calling the predicate twice. And slices.BinarySearchFunc will return the index you want even if the element is not present.

So, maybe?

gopherbot pushed a commit that referenced this issue Apr 7, 2022
For #50340

Change-Id: I3b4d278affc8e7ec706db8c9777f7a8c8ce7441d
Reviewed-on: https://go-review.googlesource.com/c/go/+/396514
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
Auto-Submit: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@DeedleFake
Copy link

@rogpeppe

Maybe a good reason to add a time.Time.Compare method:

As long as the comparison function isn't required to only return -1, 0, or 1, time.Time.Sub is already a comparison function, except that it returns time.Duration, not int.

@rogpeppe
Copy link
Contributor

@DeedleFake Yes, true, although we do have a fairly strong convention for the way that comparison functions work already.
Still, we could potentially allow this with:

func BinarySearchFunc[T any, C constraints.Ordered](slice []T, target T, cmp func(T, T) C) (int, bool)

which would allow BinarySearchFunc([]time.Time{...}, t, time.Time.Sub)

Maybe worth considering?

@DeedleFake
Copy link

I was actually going to mention that in my comment, but decided against it because it seemed too silly, and it also doesn't work with the index-based search, anyways.

@rogpeppe
Copy link
Contributor

@DeedleFake

it also doesn't work with the index-based search, anyways.

How so?

@DeedleFake
Copy link

Because the comparison function in BinarySearch(length int, cmp func(int) int) wants an index into the data structure, so time.Time.Sub can't be used directly regardless. I completely forgot about the BinarySearchFunc variant.

@Merovius
Copy link
Contributor

If anything, we should make the return type of the comparison stronger, not weaker. That is, something like

type Ordering int

const (
    Lt Ordering = -1
    Eq Ordering = 0
    Gt Ordering = 1 
)

constraints.Ordered seems very wrong, in any case. It includes string types, which… okay. But also float64, which is very bad with all them NaNs.

@DeedleFake
Copy link

DeedleFake commented Apr 21, 2022

Oh, I didn't notice which constraint that was. Yeah, don't use constraints.Ordered. It also includes string and unsigned ints, neither which makes any sense. Use constraints.Signed.

@odeke-em
Copy link
Member

odeke-em commented May 4, 2022

@eliben thank you for CL https://go-review.googlesource.com/c/go/+/396514 which I believe implemented this feature. @eliben @rsc @ianlancetaylor shall we close this issue then? Just waiting for your confirmation before I close it. Thank you everyone for the discourse and hard work.

@rsc
Copy link
Contributor

rsc commented Aug 5, 2022

Shipped in Go 1.19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests