Welcome to the regex world

I know, I might sound a bit strange, but I really love regex! When I say that, usually the person I am speaking with does a weird expression. It is quite common to think about regex as evil black boxes that no one can understand or use easily. Nothing has ever been more false, and I am sure I will convince you!

Why should I learn them?

When you have learned how to use regex, you will never come back. It is like riding a bicycle. They are super useful, using them you can perform a lot of otherwise really time consuming find and replace operations, and only for that is worth it. Then, you will mastering string processing in a way that will make you really productive! I love using Atom editor, that also supports regex find and replace. I think that the combination Atom + regex makes a really good team.

Before starting, I have to advise you that there are a lot of flavours of regex! I will use the Python one, but you can take the same knowledge to the dialect you need, and the notions are exactly the same, with a little bit of syntax changes.

In order to follow this tutorial I suggest you to use regex101 online editor, that is really useful for trying or troubleshoot regex. Be sure to select on the left the Python dialect to be 100% aligned with the content of this guide! Ok, let’s start!

Basics

The character . means every character! For example, you can match the word word, rest or 1234 with the same regex, that is:

....

All the other characters can be matched just typing them. If you want to match the dot character how can you do? You can escape it. To do that, just add a backslash as prefix, doing the following:

\.

Regex come with shortcut for you to avoiding repeating every single character. These shortcuts must be placed on the right of the characters you are applying them. They are:

  • + one or more
  • * zero or more
  • {n} exactly n
  • {m,n} from m to n
  • {m,}allows m or more
  • ? zero or one

If you want to match every string with at least one characters, you can then do:

.+

If you want to match every string that has from three to five characters, you can do:

.{3,5}

If you want to match both the string test and tests you can do:

tests?

There is one last basic thing to know. You can use a boolean or with the | character. You can match the word pizza or hotdog (unfortunately not both!) with the following regex:

pizza|hotdog

Sets

Now things are starting to get more interesting. You can match both the word test and Test doing the following:

[Tt]est

All the characters between the square brackets are treated as in a set. Of course, you can use all the operator shown before with the groups.

You can also say that you want to match everything but the specified character! To do that, insert as first character in the square bracket the negation characters, that is ^.

You can also specify a range of characters without explicitly writing them! For example, if you want a string composed by four letters from a to z, followed by a single characters that is everything but a characters between a to z, you can do (you can match for example the string test!):

[a-z]{4}[^a-z]

Be careful! the character ^ outside a group means “start of the string”, and $ means “end of the string”. If you want to match a word like the one before, but as a full match (not allowing something before or after it) you can do the following:

^[a-z]{4}[^a-z]$

Special escaped characters

We spoke about escaped dot to match exactly a dot. There are escaped characters that assume a special meaning when they have as a prefix the backslash. They always match a single character, if not specified otherwise (using for example +, *, so on and so forth). It is worth to notice that a lot of these special escaped characters assume the negative meaning when uppercase. Here we go:

  • \n newline
  • \r carriage return
  • \t tab
  • \0 null character
  • \s whitespace and \S everthing but whitespace
  • \w word and \W everything but word
  • \d digit and \D everything but digit
  • \b word boundary \B everything but a word boundary
  • \A start of the string (yes, like ^)
  • \Z end of the string (yes, like $)

There are other two escaped characters that are really rare, but you might need to use them. They are hexadecimal characters \x and octal characters \d. For example, for the hexadecimal character aa, you have to write \xaa.

You already have learned a lot of tools! For example, you are able to create your own regex for detecting a valid email address, for example. It might be something like that (it’s not perfect, just to give you an idea):

^[0-9A-z_.-+]+@[A-z0-9-]+\.[A-z0-9.-]+$

Just a little note. I have used [A-z] to match all uppercase and lowercase characters, but usually you will find the easier to read but equally right notation [a-zA-Z].

Groups

Groups are usually determined by everything that is inside an open and a closed parenthesis, and they can be of various types:

  • (abc) capturing groups, means that you can later access to the captured value with the number of the occurrence (the first group with \1, the second with \2, so on and so forth).
  • (?P<name>abc) you can access later with variable name to the captured value, that is abc in that case.
  • (:?abc) not capturing groups, they match everything inside but you cannot access to the value later.
  • (:#abc) comments, they are not considered by the engine.

For example, if you can have a regex matching the same word written twice (i.e. goofygoofy) with the following syntax:

^([A-z]+)\1$

You can have a form with a name and a surname separated by a comma, and access them later using a named group using:

^[A-Z][a-z]*,[A-Z][a-z]*$

Lookahead and lookbehind

Ok, we are near to the end of our journey. There is the topic that makes you a real regex guru, that is lookaheads.

They have the same syntax of groups, that is a couple of parenthesis, but they are a really powerful tool. Other than powerful, they are really heavy to use, therefore the usual quote:

With great power comes great responsibility

Spider-Man

Ok, sorry for that.

The first one is the positive lookahead. The syntax is (?=abc). This means that the group content abc can be matched without consuming characters! For example, you can check that three characters are not digits, and that the same characters are not between a, b or c characters, and then capture everything. How can you double check the same three characters? Using the positive lookahead, that in fact does not moves, it looks!

(?=^\D{3}$)(?=^[^abc]{3}).*

With that regex you can match the string def, but not dea or defg. Cool, isn’t it?

Instead of looking ahead, you can look behind, with a positive lookbehind. The syntax is (?<=abc). You can write the exact previous expression in the following way:

(?=^\D{3}$).*(?<=^[^abc]{3})

If you grasp those two, nothing can stop you! In fact, the last two are the negative lookahead (the same of the positive but should not match) and the negative lookbehind. Their syntax is the following:

  • Negative lookahead: (?!abc)
  • Negative lookbehind: (?<!abc)

Conclusions

That was intense. I hope that you had the chance to learn something new, or at least to discover a few tricks that you didn’t know before. Let me know in the comments if you liked this article, see you in the next post!

One thought on “Welcome to the regex world

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: