Home

How Pattern Matching Works in Data Science

|
|  Updated:  
2020-02-17 15:12:14
Data Science Essentials For Dummies
Explore Book
Buy On Amazon
Pattern matching is extremely useful in data science. But, how exactly does pattern matching work? Patterns consist of a set of qualities, properties, or tendencies that form a characteristic or consistent arrangement — a repetitive model. Humans are good at seeing strong patterns everywhere and in everything. In fact, we purposely place patterns in everyday things, such as wallpaper or fabric.

However, computers are better than humans are at seeing weak or extremely complex patterns because computers have the memory capacity and processing speed to do so. The capability to see a pattern is pattern matching. Pattern matching is an essential component in the usefulness of computer systems and has been from the outset, so learning the functionality is hardly something radical or new.

Even so, understanding how computers find patterns is incredibly important in defining how this seemingly old technology plays such an important part in new applications such as AI, machine learning, deep learning, and data analysis of all sorts.

The most useful patterns are those that we can share with others. To share a pattern with someone else, you must create a language to define it — an expression. Here, you also discover regular expressions, a particular kind of pattern language, and their use in performing tasks such as data analysis.

The creation of a regular expression helps you describe to an application what sort of pattern it should find, and then the computer, with its faster processing power, can locate the precise data you need in a minimum amount of time. This basic information helps you understand more complex pattern matching of the sort that occurs within the realms of AI and advanced data analysis.

Of course, working with patterns using pattern matching through expressions of various sorts works a little differently in the functional programming paradigm.

How to find patterns in data

When you look at the world around you, you see patterns of all sorts. The same holds true for data that you work with, even if you aren’t fully aware of seeing the pattern at all. For example, telephone numbers and social security numbers are examples of data that follows one sort of pattern — that of a positional pattern.

A telephone number in the United States consists of an area code of three digits, an exchange of three digits (even though the exchange number is no longer held by a specific exchange), and an actual number within that exchange of four digits.

The positions of these three entities is important to the formation of the telephone number, so you often see a telephone number pattern expressed as (999) 999-9999 (or some variant), where the value 9 is representative of a number. The other characters provide separation between the pattern elements to help humans see the pattern.

Other sorts of patterns exist in data, even if you don’t think of them as such. For example, the arrangement of letters from A to Z is a pattern. This statement may not seem like a revelation, but the use of this particular pattern occurs almost constantly in applications when the application presents data in ordered form to make it easier for humans to understand and interact with the data. Organizational patterns are essential to the proper functioning of applications today, yet humans take them for granted, for the most part.

Another sort of pattern is the progression. One of the easiest and most often applied patterns in this category is the exponential progression expressed as Nx, where a number N is raised to the x power. For example, an exponential progression of 2 starting with 0 and ending with 4 would be: 1, 2, 4, 8, and 16. The language used to express a pattern of this sort is the algorithm, and you often use programming language features, such as recursion, to express it in code.

Some patterns are abstractions of real-world experiences. Consider color, for example. To express color in terms that a computer can understand requires the use of three or four three-digit variables, where the first three are always some value of red, blue, and green. The fourth entry can be an alpha value, which expresses opacity, or a gamma value, which expresses a correction used to define a particular color with the display capabilities of a device in mind.

These abstract patterns help humans model the real world in the computer environment so that still other forms of pattern matching can occur (along with other tasks, such as image augmentation or color correction).

Transitional patterns help humans make sense of other data. For example, referencing all data to a known base value enables you to compare data from different sources, collected at different times and in different ways, using the same scale. Knowing how various entities collect the required data provides the means for determining which transition to apply to the data so that it can become useful as part of a data analysis.

Data can even have patterns when missing or damaged. The pattern of unusable data could signal a device malfunction, a lack of understanding of how the data collection process should occur, or even human behavioral tendencies. The point is that patterns occur in all sorts of places and in all sorts of ways, which is why having a computer recognize them can be important. Humans may see only part of the picture, but a properly trained computer can potentially see them all.

So many kinds of patterns exist that documenting them all fully would easily take an entire book. Just keep in mind that you can train computers to recognize and react to data patterns automatically in such a manner that the data becomes useful to humans in various endeavors.

The automation of data patterns is perhaps one of the most useful applications of computer technology today, yet very few people even know that the act is taking place. What they see instead is an organized list of product recommendations on their favorite site or a map containing instructions on how to get from one point to another — both of which require the recognition of various sorts of patterns and the transition of data to meet human needs.

What are regular expressions?

Regular expressions are special strings that describe a data pattern. The use of these special strings is so consistent across programming languages that knowing how to use regular expressions in one language makes it significantly easier to use them in all other languages that support regular expressions.

As with all reasonably flexible and feature-complete syntaxes, regular expressions can become quite complex, which is why you’ll likely spend more than a little time working out the precise manner by which to represent a particular pattern to use in pattern matching.

You use regular expressions to refer to the technique of performing pattern matching using specially formatted strings in applications. However, the actual code class used to perform the technique appears as Regex, regex, or even RegEx, depending on the language you use. Some languages use a different term entirely, but they’re in the minority. Consequently, when referring to the code class rather than the technique, use Regex (or one of its other capitalizations).

The following information constitutes a brief overview of regular expressions. You can find the more details in Python's documentation. This source of additional help can become quite dense and hard to follow, though, so you might also want to review the tutorial for further insights.

Defining special characters using escapes in pattern matching

Character escapes usually define a special character of some sort, very often a control character. You escape a character using the backslash (\), which means that if you want to search for a backslash, you must use two backslashes in a row (\\). The character in question follows the escape. Consequently, \b signals that you want to look for a backspace character. Programming languages standardize these characters in several ways:
  • Control character: Provides access to control characters such as tab (\t), newline (\n), and carriage return (\r). Note that the \n character (which has a value of \u000D) is different from the \r character (which has a value of \u000A).
  • Numeric character: Defines a character based on numeric value. The common types include octal (\nnn), hexadecimal (\xnn), and Unicode (\unnnn). In each case, you replace the n with the numeric value of the character, such as \u0041for a capital letter A in Unicode. Note that you must supply the correct number of digits and use 0s to fill out the code.
  • Escaped special character: Specifies that the regular expression compiler should view a special character, such as ( or [, as a literal character rather than as a special character. For example, \( would specify an opening parenthesis rather than the start of a subexpression.

Defining wildcard characters in pattern matching

A wildcard character can define a kind of character, but never a specific character. You use wildcard characters to specify any digit or any character at all. The following list tells you about the common wildcard characters. Your language may not support all these characters, or it may define characters in addition to those listed. Here's what the following characters match with:
Character Matches With
. Any character (with the possible exception of the newline character or other control characters).
\w Any word character
\W Any nonword character
\s Any whitespace character
\S Any non-whitespace character
\d Any decimal digit
\D Any nondecimal digit

Working with anchors in pattern matching

Anchors define how to interact with a regular expression. For example, you may want to work with only the start or end of the target data. Each programming language appears to implement some special conditions with regard to anchors, but they all adhere to the basic syntax (when the language supports the anchor). The following table defines the commonly used anchors:
Anchor What It Does
^ Looks at the start of the string.
$ Looks at the end of the string.
* Matches zero or more occurrences of the specified character.
+ Matches one or more occurrences of the specified character. The character must appear at least once.
? Matches zero or one occurrences of the specified character.
{m} Specifies m number of the preceding characters required for a match.
{m,n} Specifies the range from m to n, which is the number of the preceding characters required for a match.

expression|expression Performs or searches where the regular expression compiler will locate either one expression or the other expression and count it as a match.

You may find figuring out some of these anchors difficult. The idea of matching means to define a particular condition that meets a demand. For example, consider this pattern: h?t, which would match hit and hot, but not hoot or heat, because the ? anchor matches just one character. If you instead wanted to match hoot and heat as well, then you’d use h*t, because the * anchor can match multiple characters. Using the right anchor is essential to obtaining a desired result.

Delineating subexpressions using grouping constructs

A grouping construct tells the regular expression compiler to treat a series of characters as a group. For example, the grouping construct [a-z] tells the regular expression compiler to look for all lowercase characters between a and z.

However, the grouping construct[az] (without the dash between a and z) tells the regular expression compiler to look for just the letters a and z, but nothing in between, and the grouping construct [^a-z] tells the regular expression compiler to look for everything but the lowercase letters a through z. The following list describes the commonly used grouping constructs. The italicized letters and words in this list are placeholders.

Construct What It Means
[x] Look for a single character from the characters specified by x.
[x-y] Search for a single character from the range of characters specified by x and y.
[^expression] Locate any single character not found in the character expression.
(expression) Define a regular expression group. For example, ab{3} would match the letter a and then three copies of the letter b, that is, abbb. However, (ab){3} would match three copies of the expression ab: ababab.

About This Article

This article is from the book: 

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.