Multisets

In mathematics, a set is a collection of distinct objects, while a multiset (or bag) is a generalization of the concept of a set, allowing multiple instances of a given object to exist in the same multiset.



Working with multisets is a little bit more trickier and more computationally challenging, but it's quite fun, as we'll see in a moment.

In programming, we can simulate a multiset using an array. The fun part comes when we have to implement some operations on this multisets, but we have to chose the right algorithms that will scale nicely and work efficiently even with multisets that contain more than a thousand elements.

Before implementing the algorithms, let's take a brief look at some definitions of multiset operations.

# Operations


1) Intersection (AND):

The intersection of two multisets, A and B, denoted by A ∩ B, is the multiset of all things that are members of both A and B.

[1, 2]        [2, 3]          = [2]
[1, 1, 1, 3]  [1, 1, 2]       = [1, 1]
[1, 1, 2, 4]  [1, 1, 1, 2, 3] = [1, 1, 2]
2) Symmetric difference (XOR):

The symmetric difference of two multisets, denoted by A △ B, is the multiset of elements which are in either of the multisets and not in their intersection.

[1, 2, 3]      [2]             = [1, 3]
[1, 2, 3]      [3, 4]          = [1, 2, 4]
[1, 1, 2]      [1, 4, 4]       = [1, 2, 4, 4]
[7, 8, 9, 10]  [9, 10, 11, 12] = [7, 8, 11, 12]
3) Union (OR):

The union of A and B, denoted by A ∪ B, is the multiset of all things that are members of either A or B.

[1, 2]  [1]          = [1, 2]
[1, 2]  [1, 1]       = [1, 1, 2]
[1, 1]  [1, 1, 1, 2] = [1, 1, 1, 2]

# Algorithms

1) Intersection of A and B:
sub intersect {    # aka "AND"
    my ($A, $B) = @_;

    my @x = sort { $a cmp $b } @{$A};
    my @y = sort { $a cmp $b } @{$B};

    my $i = 0;
    my $j = 0;

    my $endx = $#x;
    my $endy = $#y;

    my @new;
    while ($i <= $endx and $j <= $endy) {

        my $cmp = $x[$i] cmp $y[$j];

        if ($cmp < 0) {
            ++$i;
        }
        elsif ($cmp > 0) {
            ++$j;
        }
        else {
            push @new, $x[$i];
            ++$i;
            ++$j;
        }
    }

    return @new;
}

2) Symmetric difference of A and B:

sub symdiff {    # aka "XOR"
    my ($A, $B) = @_;

    my @x = sort { $a cmp $b } @{$A};
    my @y = sort { $a cmp $b } @{$B};

    my $endx = $#x;
    my $endy = $#y;

    my $i = 0;
    my $j = 0;

    my @new;
    while (1) {

        my $cmp = $x[$i] cmp $y[$j];

        if ($cmp < 0) {
            push @new, $x[$i];
            ++$i;
        }
        elsif ($cmp > 0) {
            push @new, $y[$j];
            ++$j;
        }
        else {
            ++$i;
            ++$j;
        }

        if ($i > $endx) {
            push @new, @y[$j .. $endy];
            last;
        }
        elsif ($j > $endy) {
            push @new, @x[$i .. $endx];
            last;
        }
    }

    return @new;
}
3) Union of A and B:

The union operation is equivalent with the concatenation between the symmetric difference and the intersection of A and B, therefore we can simply say:
sub union {    # aka "OR"
    my ($A, $B) = @_;

    my @x = intersect($A, $B);
    my @y = symdiff($A, $B);

    return (@x, @y);
}

# Conclusion

The beauty and elegance of this algorithms is astonishing, and beside their elegance, they also have some pretty good scalability properties, the intersection algorithm having a time complexity of O(|A|*log(|A|) + |B|*log(|B|) + |A| + |B|), where |X| is the number of elements in the multiset X, which is far better than O(|A| * |B|) of a naive implementation of the same operation.

As an exercise, consider implementing the "union" operation explicitly, using a similar algorithm to those for the "intersection" and the "symmetric difference" operations.

Comments