KNOW-HOW BIBLIO: Using Regular Expressions to extract desired data

This article tries to demonstrate how Regular Expression (RE) can help us on the basis of regular work with data and messy text data. According to Wikipedia, "A regular expression is a sequence of characters that specifies a search pattern."[1] REs are such patterns that are applied in modifying, finding, and preprocessing text data. There are some sophisticated tools for using RE, such as R, and Python etc. In this article, Google Sheets will be used to show how REs work.

Stephen Cole Kleene, a mathematician, was the introducer or REs. It was originated in 1951.[2] REs are abbreviated as Regex which finds, extracts, matches, and replaces text. In this article, only the extract function has been discussed. More about Regex can be found in web resources like Wiki. There are a huge number of Regex syntaxes that are used as characters. But some important characters have been shown here. Some important operations of Regex are:

Character classes
Anchors
Escaped character
Grouping
Quantification

In this article, no broad discussion is made but some useful operations have been explored. Learners can go through the Wiki sources for more insights. Now, let's work with Regex.

#1 Basic extractions

First, we can extract any text (name, phrases etc.) using the simple Regex function in Google Sheets. We need to formulate =REGEXEXTRACT(A12, "Example Library"); it will give results the desired data. As we can see in the formula bar, the name of the library is written within a double quotation, so in that case, it will extract whatever we put within the quotation; it can be anything, names, digits etc.

#2 Extracting phone numbers or PIN codes from any text

In many cases, we find unorganized data. So, we can cope with this one. any cell contains both text and numbers then we can use the function =REGEXEXTRACT(A12,"\d+"). It can be explained as:

\d Digit. Matches any character (0-9).

+ Quantifier. Matches One or more than the preceding token.

We could have used [0-9]+ within the bracket. It will result from the same.

We may find multiple strings of numbers, then we need to change the pattern. The earlier pattern only will extract the first string that is PIN, not the Mobile number. So, we can use this pattern: =REGEXEXTRACT(A12,"Mob:.\d+"). It can be explained as:

Mob: is case sensitive, we can put it in the starting point; then it will extract all the text regarding Mob:.

. Dot. Matches any character (because there is a space between: and 0)

\d Digit. Matches any character (0-9).

+ Quantifier. Matches One or more than the preceding token.

This is more or like the previous one except for the starting point. In the same manner, we may use [0-9]+

#3 Extracting e-mail ids

Now, here comes the most important task. We may face problems while extracting the email ids from a certain text. Then we can use the following function.

In this worksheet (Column A), commonly used types of email ids have been shown. To extract email ids from diverse text, we may use the function (pattern) =REGEXETRACT(A3, "[a-z0-9._]+@+[a-z]+.[a-z]+.[a-z]+"). As the pattern was in the B1, so this has been locked by $B$1. The expressions within the double quotations will extract any email ids. This can be explained as:

[ Character set. Match any character in the set.
   a-z Range. Matches a character in the range a-z.
   0-9 Range. Matches a character in the range 0-9.
   . Character. Matches "." character.
   - Character. Matches "-" character.
]
   + Quantifier. Matches 1 or more of preceding token
@ Character. Matches "@" character.
   + Quantifier. Matches 1 or more of preceding token
[ Character set. Match any character in the set.
   a-z Range. Matches a character in the range a-z.
]
+ Quantifier. Matches 1 or more of preceding token
. Dot. Matches any character
[ Character set. Match any character in the set.
   a-z Range. Matches a character in the range a-z.
]
+ Quantifier. Matches 1 or more of preceding token
. Dot. Matches any character
[ Character set. Match any character in the set.
   a-z Range. Matches a character in the range a-z.
]
+ Quantifier. Matches 1 or more of preceding token

This can be more complex than the previous two patterns. But we may be able to work adequately. This is how we easily can extract email ids from any text.

[This article is for educational purposes]

Share it:

14 July 2021

Using Regular Expressions to extract desired data

#1 Basic extractions

#2 Extracting phone numbers or PIN codes from any text

#3 Extracting e-mail ids