Codementor Events

Python Regex in a nutshell

Published May 25, 2020
Python Regex in a nutshell

Regular expression is one of the tools that make programming easy and Python programming is not an excemption. In this article, I write on Python regex expecially and how I manage to keep a hang of them as they are kind of very easy to forget.

Let me start with definition of regular expression, what I understand regular expression to be. Regular expression is a tool that allows us to search string of data using the pattern that matches the information we seek. Imagine it like this: Your boss have a chunk of nebulous and ovelwemingly obfuscating string of data and she has instructed you to fetch all the emails in that data. So instead of having to look up the emails one after the other in a 5000-line string of data, all you need to do is to define a regular expression pattern that matches email to help you get all the emails in that string of data.

There are four key things that I put in mind about python regex: METACHARACTERS, SPECIAL SEQUENCE, GROUPS, CLASSES

METACHARACTER
They are characters that make regex more powerful than string methods. Examples include ".", "^", "","+","","?","""""."isametacharacterthatmatchesanycharacter..soifIhaveapattern"..."iswillmatchanythreecharacterwithnonewlines""matchesthebeginningofastring"", "+", "*", "?", "{}" "|" "." is a metacharacter that matches any character..so if I have a pattern "..." is will match any three character with no newlines "^" matches the beginning of a string "" matches the end of a string
"*" means zero of more of the pattern it follows
"?" means zero or 1 of the pattern it follows
"{}" It holds numerical value that indicates quantity
"i|j" matches either of the two options, i or j

SPECIAL SEQUENCES
Regular expression avails some unique sequences to match pattern from string. They include \d, \s, \w which matche digits, space and word characters respectively.In order to make these metacharacters do the opposite of what their respective functions stated above, we just need to make the cases upper: \D, \S, \W will match all except digits, spaces and word characters respectively
There are more special sequences like \A, \Z match beginning and end of string respectively. \b can help us match empty string between \w and \W

CHARACTER CLASSES
Character classes are classes that define a pool of pattern any of which is qualified to be matched. For instance, if I create a class like this [A-Z0-9a-z] it means I want to match any upper case letter, any lower case letter or any digit in the rough string. If you put metacharacters - e.g "&" "." - within a character class, the metacharacter will behave like normal character. However the metacharacter "^" is only effective within character class if it comes at the beginnig. For example: write a character class to match a strings that are not having digits.
import re
pattern = r'[^0-9]'
string = ""
match = re.search (pattern, string)
The match above will yuse the character class defined to make selection from the string. And it select everything in the string except numbers

GROUPS
Groups are ways of combinning single metacharacters, character and character classes. For instance, in order to write a pattern that matches emails address: "codegenius1010@gmail.com", I will need 3 groups. One for 'codegenius', one for "gmail" and one for ".com".
Let's write them one after the other and then put them together:
codegenius : ([\w.-]+)
gmail: ([\w.-]+)
.com: (.[\w.]+)
Those are the groups needed for us to match our email. But is that all? No. We need to put a constant "@" between the first and second group because all emails have it.
Note that we added "+" as a "quantifier" for each of the groups because we know the occurence of the match can be one or more.
Now lets put everything together:

task write a python regex to extract email from a very rough random string

import re
pattern = r"([\w.-]+)@([\w.-]+)(.[\w.]+)"
email = re.search (pattern, rough_string)
if email:
print (email.group())

That's about what I have on Python Regular Expression For now. Please feel free to ask questions in the comment box. Thanks

Discover and read more posts from Babatunde
get started
post comments2Replies
zendannyy
4 years ago

This is good, I had previously used
([a-zA-Z0-9._%&+-]+)([a-zA-Z0-9.-]+)(\.[a-zA-Z0-9]{2,3})

but I think your pattern is simpler

Babatunde
4 years ago

Hello zendanny, thanks for the feedback.