Removing Items From a Set — remove(), pop(), and difference

ODSC - Open Data Science
5 min readOct 22, 2020

--

Python has a rich collection of built-in data structures. These data structures are sometimes called “containers” or “collections” because they contain a collection of individual items. These structures cover a wide variety of common programming situations. In this recipe, we’ll look at how we can update a set by removing or replacing items.

This article is an excerpt from the book Modern Python Cookbook, Second Edition by Steven F Lott. This book features 133 recipes using the latest version of Python 3.8. The recipes will benefit everyone, from beginners just starting out with Python to experts. You’ll not only learn Python programming concepts but also how to build complex applications.

Python gives us several ways to remove items from a set collection. We can use the remove() method to remove a specific item. We can use the pop() method to remove (and return) an arbitrary item.

Additionally, we can compute a new set using the set intersection, difference, and symmetric difference operators: &, –, and ^. These will produce a new set that is a subset of a given input set.

Getting ready

Sometimes, we’ll have log files that contain lines with complex and varied formats. Here’s a small snippet from a long, complex log:

[2016–03–05T09:29:31–05:00] INFO: Processing ruby_block[print IP] action run (@recipe_files::/home/slott/ch4/deploy.rb line 9)
[2016–03–05T09:29:31–05:00] INFO: Installed IP: 111.222.111.222
[2016–03–05T09:29:31–05:00] INFO: ruby_block[print IP] called

– execute the ruby block print IP
[2016–03–05T09:29:31–05:00] INFO: Chef Run complete in 23.233811181 seconds

Running handlers:
[2016–03–05T09:29:31–05:00] INFO: Running report handlers
Running handlers complete
[2016–03–05T09:29:31–05:00] INFO: Report handlers complete
Chef Client finished, 2/2 resources updated in 29.233811181 seconds

We need to find all of the IP: 111.222.111.222 lines in this log.

Here’s how we can create a set of matches:

>>> import re
>>> pattern = re.compile(r"IP: \d+\.\d+\.\d+\.\d+")
>>> matches = set(pattern.findall(log))
>>> matches
{'IP: 111.222.111.222'}

The problem we have is extraneous matches. The log file has lines that look similar but are examples we need to ignore. In the full log, we’ll also find lines containing text like IP: 1.2.3.4, which need to be ignored. It turns out that there is a set of irrelevant values that need to be ignored.

This is a place where set intersection and set subtraction can be very helpful.

How to do it…

1. Create a set of items we’d like to ignore:

>>> to_be_ignored = {'IP: 0.0.0.0', 'IP: 1.2.3.4'}

2. Collect all entries from the log. We’ll use the re module for this, as shown earlier. Assume we have data that includes good addresses, plus dummy and placeholder addresses from other parts of the log:

>>> matches = {'IP: 111.222.111.222', 'IP: 1.2.3.4'}

3. Remove items from the set of matches using a form of set subtraction. Here are two examples:

>>> matches - to_be_ignored 
{'IP: 111.222.111.222'}
>>> matches.difference(to_be_ignored)
{'IP: 111.222.111.222'}

Both of these are operators that return new sets as their results. Neither of these will mutate the underlying set objects.

It turns out the difference() method can work with any iterable collection, including lists and tuples. While permitted, mixing sets and lists can be confusing, and it can be challenging to write type hints for them.

We’ll often use these in statements, like this:

>>> valid_matches = matches - to_be_ignored 
>>> valid_matches
{'IP: 111.222.111.222'}

This will assign the resulting set to a new variable, valid_matches, so that we can do the required processing on this new set.

We can also use the remove() and pop() methods to remove specific items. The remove() method raises an exception when an item cannot be removed. We can use this behavior to both confirm that an item is in the set and remove it. In this example, we have an item in the to_be_ignored set that doesn’t need to exist in the original matches object, so these methods aren’t helpful.

How it works…

A set object tracks membership of items. An item is either in the set or not. We specify the item we want to remove. Removing an item doesn’t depend on an index position or a key value.

Because we have set operators, we can remove any of the items in one set from a target set. We don’t need to process the items individually.

There’s more…

We have several other ways to remove items from a set:

  • In this example, we used the difference() method and the — operator. The difference() method behaves like an operator and creates a new set.
  • We can also use the difference_update() method. This will mutate a set in place. It does not return a value.
  • We can remove an individual item with the remove() method.
  • We can also remove an arbitrary item with the pop() method. This doesn’t apply to this example very well because we can’t control which item is popped.

Here’s how the difference_update() method looks:

>> valid_matches = matches.copy() 
>>> valid_matches.difference_update(to_be_ignored)
>>> valid_matches
{'IP: 111.222.111.222'}

We applied the difference_update() method to remove the undesirable items from the valid_matches set. Since the valid_matches set was mutated, no value is returned. Also, since the set is a copy, this operation doesn’t modify the original matches set.

We could do something like this to use the remove() method. Note that remove() will raise an exception if an item is not present in the set:

>>> valid_matches = matches.copy() 
>>> for item in to_be_ignored:
... if item in valid_matches:
... valid_matches.remove(item)
>>> valid_matches

{‘IP: 111.222.111.222’}

We tested to see if the item was in the valid_matches set before attempting to remove it. Using an if statement is one way to avoid raising a KeyError exception. An alternative is to use a try: statement to silence the exception that’s raised when an item is not present:

>>> valid_matches = matches.copy() 
>>> for item in to_be_ignored:
... try:
... valid_matches.remove(item)
... except KeyError:
... pass
>>> valid_matches

{‘IP: 111.222.111.222’}

We can also use the pop() method to remove an arbitrary item. This method is unusual in that it both mutates the set and returns the item that was removed. For this application, it’s nearly identical to remove().

About the Author

Steven F. Lott has been programming since the ’70s, when computers were large, expensive, and rare. As a contract software developer and architect, he has worked on hundreds of projects, from very small to very large. He’s been using Python to solve business problems for almost 20 years.

He’s currently leveraging Python to implement cloud management tools. His other titles with Packt Publishing include Python Essentials, Mastering Object-Oriented Python, Functional Python Programming, and Python for Secret Agents.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet