For example, the mere act of stopping certain kinds of loops requires that a computer match a pattern between the existing state of a variable and the desired state. Likewise, user input requires that the application match the user’s input to a set of acceptable inputs.
Using pattern matching in data analysis
Developers recognize that function declarations also form a kind of pattern and that to call the function successfully, the caller must match the pattern. Sending the wrong number or types of variables as part of the function call causes the call to fail. Data structures also form a kind of pattern because the data must appear in a certain order and be of a specific type.Where you choose to set the beginning for pattern matching depends on how you interpret the act. Certainly, pattern matching isn’t the same as counting, as in a for
loop in an application. However, someone could argue that testing for a condition in a while
loop matches the definition of pattern matching to some extent.
The act of searching is just one aspect, however, of a broader application of pattern matching in analysis. The act of filtering data also requires pattern matching. A search is a singular approach to pattern matching in that the search succeeds the moment that the application locates a match.
Filtering is a batch process that accepts all the matches in a document and discards anything that doesn’t match, enabling you to see all the matches without doing anything else. Filtering can also vary from searching in that searching generally employs static conditions, while filtering can employ some level of dynamic condition, such as locating the members of a set or finding a value within a given range.
Filtering is the basis for many of the analysis features in declarative languages, such as SQL, when you want to locate all the instances of a particular data structure (a record) in a large data store (the database). The level of filtering in SQL is much more dynamic than in mere filtering because you can now apply conditional sets and limited algorithms to the process of locating particular data elements.
Regular expressions, although not the most advanced of modern pattern-matching techniques, offer a good view of how pattern matching works in modern applications. You can check for ranges and conditional situations, and you can even apply a certain level of dynamic control. Even so, the current master of pattern matching is the algorithm, which can be fully dynamic and incredibly responsive to particular conditions.
Working with pattern matching
Pattern matching in Python closely matches the functionality found in many other languages. Python provides robust pattern-matching capabilities using the regular expression (re) library. Here’s a good overview of the Python capabilities. The sections below detail Python functionality using a number of examples.Performing simple Python matches
All the functionality you need for employing Python in basic RegEx tasks appears in the re library. The following code shows how to use this library:import re vowels = "[aeiou]" print(re.search(vowels, "This is a test sentence.").group())The
search()
function locates only the first match, so you see the letter i
as output because it’s the first item in vowels. You need the group()
function call to output an actual value because search()
returns a match
object.When you look at the Python documentation, you find quite a few functions devoted to working with regular expressions, some of them not entirely clear in their purpose. For example, you have a choice between performing a search or a match. A match works only at the beginning of a string. Consequently, this code:
print(re.match(vowels, "This is a test sentence."))returns a value of
None
because none of the vowels appears at the beginning of the sentence. However, this code:
print(re.match("a", "abcde").group())returns a value of
a
because the letter a appears at the beginning of the test string.
Neither search
nor match
will locate all occurrences of the pattern in the target string. To locate all the matches, you use findall
or finditer
instead. For example, this code:
print(re.findall(vowels, "This is a test sentence."))returns a list like this:
['i', 'i', 'a', 'e', 'e', 'e', 'e']Because this is a list, you can manipulate it as you would any other list.
Match objects are useful in other ways. For example, you can create a more complete search by using the start()
and end()
functions, as shown in the following code:
testSentence = "This is a test sentence." m = re.search(vowels, testSentence) while m: print(testSentence[m.start():m.end()]) testSentence = testSentence[m.end():] m = re.search(vowels, testSentence)This code keeps performing searches on the remainder of the sentence after each search until it no longer finds a match, as shown here:
i i a e e e eUsing the
finditer()
function would be easier, but this code points out that Python does provide everything needed to create relatively complex pattern-matching code.
Doing more than pattern matching
Python’s regular expression library makes it quite easy to perform a wide variety of tasks that don’t strictly fall into the category of pattern matching. One of the most commonly used is splitting strings. For example, you might use the following code to split a test string using a number of whitespace characters:testString = "This is\ta test string.\nYippee!" whiteSpace = "[\s]" print(re.split(whiteSpace, testString))The escaped character,
\s
, stands for all space characters, which includes the set of [ \t\n\r\f\v
. The split()
function can split any content using any of the accepted regular expression characters, so it’s an extremely powerful data manipulation function. The output from this example looks like this:
['This', 'is', 'a', 'test', 'string.', 'Yippee!']Performing substitutions using the
sub()
function is another forte of Python. Rather than perform common substitutions one at a time, you can perform them all simultaneously, as long as the replacement value is the same in all cases. Consider the following code:
testString = "Stan says hello to Margot from Estoria." pattern = "Stan|hello|Margot|Estoria" replace = "Unknown" re.sub(pattern, replace, testString)The output of this example is
Unknown says Unknown to Unknown from Unknown.You can create a pattern of any complexity and use a single replacement value to represent each match. This is handy when performing certain kinds of data manipulation for tasks such as dataset cleanup prior to analysis.