August 26th, 2024

Thoughts on finding the essential elements of a set

Suppose you have a set of n items, two of which are essential, and the rest are superfluous. You can pass any subset of these items to an oracle, and the oracle will tell you whether the set contains all of the essential elements. The objective is to identify those essential elements.

You might run into this problem if the elements are package dependencies, and you want to figure out which ones are actually necessary for your project to build, and which ones are just cargo cult.

If the problem had been formulated with just one essential element, then it would be a simply binary search: Divide the set into equal-sized subsets and ask the oracle which subset contains the essential element. Recurse on that subset, and you can find the essential element in O(log n) steps.

But what if there are two essential elements? You could try the same thing and divide in half, but if the oracle says “Neither half contains both of the essential elements,” then you’re in a bit of a pickle because you don’t know which pieces of the two halves need to be combined.

One option is to try to peel off the essential elements one at a time. For example, an inefficient algorithm would be to remove one element and ask the oracle of the remaining elements include all the essential elements. If it says yes, then you can recurse with the smaller set. if it says no, then you know that the element you removed is one of the essential elements, and you can now use the “find one essential element” algorithm on the rest. (Just remember to add the essential element you already found to each query you pass to the oracle.)

Now that we have an inefficient algorithm, we can try to make it more efficient: Instead of removing one element at a time, you can use a binary search to find the “highest-numbered essential element”: At the start, you know that the (zero-based) index of the highest-numbered essential element is somewhere between 1 and n − 1,¹ inclusive. At each step, find the midpoint between the low and high boundaries of the range and ask the oracle whether all the elements up to that midpoint element include all the essential elements. If so, then you can move the upper boundary of the range down to the midpoint; if not, then you can move the lower boundary of the range up to the midpoint. In this way, you can do a binary search on “the highest-numbered essential elements.”

And then once you’ve found one essential element, you can use a regular binary search to find the other one.

We can generalize this to the case where there are m essential elements: Start with a known range of (m − 1) to (n − 1), and use binary search to find the highest-numbered essential element. Once you’ve done that, you’ve reduced the problem to finding the m − 1 essential elements below the highest-numbered essential element, and so on, for a total complexity of O(m log n).

¹ You know that it cannot be zero, because there are two essential elements, and the earliest you can get both of them is if one of them is at index 0 and the other is at index 1.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

Newest
Newest
Popular
Oldest
  • Ian Yates

    Reminds me of a YouTube video I just watched last night. It's an efficient way to solve the flashlight & 4 good vs 4 bad batteries problem. Flashlight only works if it's given 2 good batteries simultaneously - so 1 good + 1 bad, or 2 bad, means no light. You need to find 2 good batteries - how many steps are required in worst case?
    Surprisingly it's 7 (including the final "test" where you...

    Read more
  • Jonas Barklund · Edited

    Sorry if I’m repeating myself, I cannot see my previous reply so trying one more time.
    If one thinks of this as a sorting problem where element A < element B if A is essential and B is not, then it seems fairly obvious that one can find the essential elements in an array of n elements in O(n log n) time.

  • alan robinson

    Depends on who's efficiency you care about, the programmer or the CPU. This if it's the programmer, a genetic algorithm (GA) inspired approach where you are evolving which part to discard would do the job nicely with minimal problem-specific code. All you do is randomly discard half the set and submit it to the oracle; if it passes restart with that as the new set to search. Otherwise, submit a new random permutation of...

    Read more
  • David Gershnik

    A different algorithm, though one unfortunately requiring exact knowledge of m (how many essential elements).
    Partition the set into m+1 parts. I'll number them starting at 1. (For m=2, thirds)
    Test the whole set except part 1 - If yes, eliminate part 1 and restart with your smaller set
    If no, test the whole set except part 2. Continue like this until one of your tests says yes. (One of them will by the pigeonhole...

    Read more
  • Stuart Ballard

    For the specific case of two essential elements, how about this approach:

    Split the set in two. Test each half. If one half passes, that set contains both the essential elements; throw out the rest and recurse.
    If neither half passes, you know you have one essential element in each half. That means you can use the algorithm for finding a single essential element on each half, by asking if (subset of A) u (whole of...

    Read more
  • Kevin Norris

    Harder generalization: What if we don't know how many essential elements there are? We just know that "some" elements are essential.

    Raymond's naive algorithm does work for this case (with slight modifications to ensure it does not terminate until it has gone through the whole set). I would tend to assume that you can adapt the fast algorithm in a similar way... but that would seem to have a running time of O(n log n), which...

    Read more
    • Raymond ChenMicrosoft employee Author

      Yeah, if you extend this algorithm to an unknown number of essential elements, you degrade to O(n log n) in the worst case that there are n/2 essential elements spread out at every other position.

      • Maks Verver · Edited

        There is a simple algorithm that doesn't go slower than O(n) in the case where the number of required elements is unknown.

        Instead of starting a binary search to find the first required element, try dropping the last 1, 2, 4, 8, 16 etc. unknown elements, until you get a negative response (meaning: you dropped a required element). Then you know that the last chunk contains a required element, and you can do a binary search...

        Read more
    • GL · Edited

      If the only a priori information is that some elements are essential, say an arbitrary non-empty subset is essential, then there are 2^n-1>2^(n-1) possible answers (if n>1), so a (binary) decision tree of depth at least n is necessary to fully determine the answer, so the naive algorithm is the best algorithm already.

Feedback