Python/greedy_methods/greedy_set_cover.py at c5d0461b752f3ebde5de65cd17c0190395ee1ad3 · TheAlgorithms/Python · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
"""
Greedy approximation algorithm for the minimum set cover problem.

Author: Ben Chaddha (https://github.com/benchaddha)

Problem Definition:
    Given a universe U and a collection S of subsets of U such that the union
    of all subsets equals U, find the minimum number of subsets whose union
    covers U.

    This problem is NP-complete (Karp, 1972), making exact solutions
    computationally infeasible for large instances.

Algorithm:
    This implementation uses the standard greedy heuristic that iteratively
    selects the subset covering the most uncovered elements until all elements
    are covered.

Complexity:
    Time: O(|U| * |S|) where |S| is the number of subsets
    Space: O(|U| + |S|)

Approximation Guarantee:
    The greedy algorithm achieves an H_d-approximation where:
    - d = max_i |S_i| (maximum subset size)
    - H_d = sum(1/i for i in 1..d) (d-th harmonic number)
    - H_d ≤ ln(d) + 1

    For the general case, this gives a ln(|U|) + 1 approximation. Feige (1998)
    proved this is essentially optimal: no polynomial-time algorithm can achieve
    a (1 - ε) * ln(|U|) approximation for any ε > 0, unless P = NP.

Limitations:
    - This implementation assumes all subsets have equal cost (unweighted).
    - For the weighted set cover variant, subset selection should be based on
      the cost-effectiveness ratio (uncovered elements / cost).
    - The greedy approach may be far from optimal for specific instances,
      though it provides the theoretical guarantee stated above.

References:
- R. M. Karp, "Reducibility Among Combinatorial Problems", 1972.
- D. S. Johnson, "Approximation algorithms for combinatorial problems", 1974.
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein,
  "Introduction to Algorithms", Chapter 35.3, Set Cover.
- U. Feige, "A Threshold of ln n for Approximating Set Cover",
  Journal of the ACM, 45(4), 634-652, 1998.

Note:
    Element and subset identifier types must be hashable for set operations.

Example:
    >>> universe = {1, 2, 3, 4, 5}
    >>> subsets = {
    ...     "A": {1, 2},
    ...     "B": {2, 3, 4},
    ...     "C": {3, 4},
    ...     "D": {4, 5},
    ... }
    >>> cover = greedy_set_cover(universe, subsets)
    >>> # Verify the cover is valid
    >>> set().union(*(subsets[i] for i in cover)) == universe
    True
    >>> # The greedy algorithm selects B and D (or B and a subset covering 1, 5)
    >>> len(cover) <= 3
    True

    >>> # Optimal case: one subset covers everything
    >>> universe2 = {1, 2, 3}
    >>> subsets2 = {"A": {1, 2, 3}, "B": {1}, "C": {2}}
    >>> cover2 = greedy_set_cover(universe2, subsets2)
    >>> cover2 == {"A"}
    True

    >>> # Example showing greedy may not be optimal
    >>> # Optimal is 2 sets {B,C}, but greedy might pick A first
    >>> universe3 = {1, 2, 3, 4}
    >>> subsets3 = {"A": {1, 2}, "B": {1, 3, 4}, "C": {2, 3, 4}}
    >>> cover3 = greedy_set_cover(universe3, subsets3)
    >>> len(cover3) <= 3
    True
    >>> set().union(*(subsets3[i] for i in cover3)) == universe3
    True

    >>> # Error handling - empty universe
    >>> greedy_set_cover(set(), {"A": {1}})
    Traceback (most recent call last):
        ...
    ValueError: Universe must be non-empty.

    >>> # Error handling - empty subsets
    >>> greedy_set_cover({1, 2}, {})
    Traceback (most recent call last):
        ...
    ValueError: Subsets mapping must be non-empty.

    >>> # Error handling - subsets don't cover universe
    >>> greedy_set_cover({1, 2, 3}, {"A": {1}, "B": {2}})
    Traceback (most recent call last):
        ...
    ValueError: The provided subsets do not cover the universe.
"""

from __future__ import annotations

from collections.abc import Hashable, Iterable, Mapping


def greedy_set_cover(
    universe: Iterable[Hashable],
    subsets: Mapping[Hashable, Iterable[Hashable]],
) -> set[Hashable]:
    """
    Greedy approximation for minimum set cover.

    Args:
        universe: The set of elements to be covered.
        subsets: A mapping from subset identifiers to their elements.

    Returns:
        A set of subset identifiers that covers the universe.

    Raises:
        ValueError: If the universe is empty, subsets is empty, or if the
                   provided subsets cannot cover the universe.

    Time Complexity: O(|universe| * |subsets|)
    Space Complexity: O(|universe| + |subsets|)
    """

    # Normalize inputs to sets so we do not mutate user-provided structures.
    universe_set = set(universe)
    if not universe_set:
        raise ValueError("Universe must be non-empty.")

    if not subsets:
        raise ValueError("Subsets mapping must be non-empty.")

    normalized_subsets: dict[Hashable, set[Hashable]] = {
        key: set(s) for key, s in subsets.items()
    }

    # Quick feasibility check: if the union of all subsets does not cover U,
    # we can terminate early. This is preferable to silently returning
    # an incomplete "cover".
    union_of_subsets: set[Hashable] = set().union(*normalized_subsets.values())
    if not universe_set.issubset(union_of_subsets):
        raise ValueError("The provided subsets do not cover the universe.")

    uncovered: set[Hashable] = set(universe_set)
    chosen_subsets: set[Hashable] = set()

    # Standard greedy loop: at each step, select the subset that covers
    # the largest number of remaining uncovered elements.
    while uncovered:
        best_key: Hashable | None = None
        best_gain = 0

        for key, subset in normalized_subsets.items():
            if key in chosen_subsets:
                continue  # already selected
            # Intersection with uncovered elements gives the marginal gain.
            gain = len(uncovered & subset)
            if gain > best_gain:
                best_gain = gain
                best_key = key

        # If no subset yields a positive gain, but uncovered is non-empty,
        # the instance is effectively uncoverable (should not happen if the
        # feasibility check above passed, but we keep this for robustness).
        if best_key is None or best_gain == 0:
            raise ValueError("The provided subsets do not cover the universe.")

        # Commit to the chosen subset and mark its elements as covered.
        chosen_subsets.add(best_key)
        uncovered -= normalized_subsets[best_key]

    return chosen_subsets


if __name__ == "__main__":
    import doctest

    doctest.testmod()