TheDeveloperBlog.com


Python Re Match, Search Examples

Python Re Match, Search Examples

Regular expressions. These are tiny programs that process text. In Python we access regular expressions through the re library. We call methods like re.match().


With methods, such as match() and search(), we run these little programs. More advanced methods like groupdict can process groups. Findall handles multiple matches. It returns a list.


Match example. This program uses a regular expression in a loop. It applies a for-loop over the elements in a list. And in the loop body, we call re.match().

Then: We test this call for success. If it was successful, groups() returns a tuple containing the text content that matches the pattern.

Pattern: This uses metacharacters to describe what strings can be matched. The "\w" means "word character." The plus means "one or more."

Tip: Much of the power of regular expressions comes from patterns. We cover Python methods (like re.match) and these metacharacters.

Based on:

Python 3

Python program that uses match

import re

# Sample strings.
list = ["dog dot", "do don't", "dumb-dumb", "no match"]

# Loop.
for element in list:
    # Match if two words starting with letter d.
    m = re.match("(d\w+)\W(d\w+)", element)

    # See if success.
    if m:
        print(m.groups())

Output

('dog', 'dot')
('do', 'don')
('dumb', 'dumb')

Pattern details

Pattern: (d\w+)\W(d\w+)

d        Lowercase letter d.
\w+      One or more word characters.
\W       A non-word character.

Search. This method is different from match. Both apply a pattern. But search attempts this at all possible starting points in the string. Match just tries the first starting point.

So: Search scans through the input string and tries to match at any location. In this example, search succeeds but match fails.

Python program that uses search

import re

# Input.
value = "voorheesville"

m = re.search("(vi.*)", value)
if m:
    # This is reached.
    print("search:", m.group(1))

m = re.match("(vi.*)", value)
if m:
    # This is not reached.
    print("match:", m.group(1))

Output

search: ville

Pattern details

Pattern: (vi.*)

vi       The lowercase letters v and i together.
.*       Zero or more characters of any type.

Split. The re.split() method accepts a pattern argument. This pattern specifies the delimiter. With it, we can use any text that matches a pattern as the delimiter to separate text data.

Here: We split the string on one or more non-digit characters. The regular expression is described after the script output.

Tip: A split() method is also available directly on a string. This method handles no regular expressions. It is simpler.

Split
Python program that uses split

import re

# Input string.
value = "one 1 two 2 three 3"

# Separate on one or more non-digit characters.
result = re.split("\D+", value)

# Print results.
for element in result:
    print(element)

Output

1
2
3

Pattern details

Pattern: \D+

\D+      One or more non-digit characters.

Findall. This is similar to split(). Findall accepts a pattern that indicates which string store turn in a list. It is like split() but we specify matching parts, not delimiters.

Here: We scan a string for all words starting with the letter d or p, and with one or more following word characters.

Python program that uses findall

import re

# Input.
value = "abc 123 def 456 dot map pat"

# Find all words starting with d or p.
list = re.findall("[dp]\w+", value)

# Print result.
print(list)

Output

['def', 'dot', 'pat']

Pattern details

Pattern: [dp]\w+

[dp]     A lowercase d, or a lowercase p.
\w+      One or more word characters.

Finditer. Unlike re.findall, which returns strings, finditer returns matches. For each match, we call methods like start() or end(). And we can access the value of the match with group().

Python program that uses finditer

import re

value = "123 456 7890"

# Loop over all matches found.
for m in re.finditer("\d+", value):
    print(m.group(0))
    print("start index:", m.start())

Output

123
start index: 0
456
start index: 4
7890
start index: 8

Start, end. We can use special characters in an expression to match the start and end of a string. For the start, we use the character "^" and for the end, we use the "$" sign.

Here: We loop over a list of strings and call re.match. We detect all the strings that start or end with a digit character "\d."

Tip: The match method tests from the leftmost part of the string. So to test the end, we use ".*" to handle these initial characters.

Python program that tests starts, ends

import re

list = ["123", "4cat", "dog5", "6mouse"]
for element in list:

    # See if string starts in digit.
    m = re.match("^\d", element)
    if m:
        print("START:", element)

    # See if string ends in digit.
    m = re.match(".*\d$", element)
    if m:
        print("  END:", element)

Output

START: 123
  END: 123
START: 4cat
  END: dog5
START: 6mouse

Pattern details

^\d     Match at the start, check for single digit.
.*\d$   Check for zero or more of any char.
        Check for single digit.
        Match at the end.

Or, repeats. Here we match strings with three letters or three dashes at their starts. And the final three characters must be digits. We use non-capturing groups with the "?:" syntax.

And: We use the "3" codes to require three repetitions of word characters or hyphens.

Finally: We specify digit characters with the code "\d" and the metacharacter "$" to require the end of the string.

Python that uses re, expressions, repeats, or

import re

values = ["cat100", "---200", "xxxyyy", "jjj", "box4000", "tent500"]
for v in values:

    # Require 3 letters OR 3 dashes.
    # ... Also require 3 digits.
    m = re.match("(?:(?:\w{3})|(?:\-{3}))\d\d\d$", v)
    if m:
        print("  OK:", v)
    else:
        print("FAIL:", v)

Output

  OK: cat100
  OK: ---200
FAIL: xxxyyy
FAIL: jjj
FAIL: box4000
FAIL: tent500

Pattern details

(?:    The start of a non-capturing group.
\w{3}  Three word characters.
|      Logical or: a group within the chain must match.
\-     An escaped hyphen.
\d     A digit.
$      The end of the string.

Sub method. The re.sub method can apply a method or lambda to each match found in a string. We specify a pattern and a method that receives a match. And we can process matches in any way.

Re Sub, Subn

Named groups. A regular expression can have named groups. This makes it easier to retrieve those groups after calling match(). But it makes the pattern more complex.

Here: We can get the first name with the string "first" and the groups() method. We use "last" for the last name.

Python that uses named groups

import re

# A string.
name = "Clyde Griffiths"

# Match with named groups.
m = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name)

# Print groups using names as id.
if m:
    print(m.group("first"))
    print(m.group("last"))

Output

Clyde
Griffiths

Pattern details

Pattern:         (?P<first>\w+)\W+(?P<last>\w+)

(?P<first>\w+)   First named group.
\W+              One or more non-word characters.
(?P<last>\w+)    Second named group.

Groupdict. A regular expression with named groups can fill a dictionary. This is done with the groupdict() method. In the dictionary, each group name is a key.

And: Each value is the data matched by the regular expression. So we receive a key-value store based on groups.

Here: With groupdict, we eliminate all references to the original regular expression. We can change the data to dictionary format.

Python that uses groupdict

import re

name = "Roberta Alden"

# Match names.
m = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name)

if m:
    # Get dict.
    d = m.groupdict()

    # Loop over dictionary with for-loop.
    for t in d:
        print("  key:", t)
        print("value:", d[t])

Output

  key: last
value: Alden
  key: first
value: Roberta

Performance. Regular expressions often hinder performance in programs. I tested the in-operator on a string against the re.search method. This searches the input string for the letter "x."

Result: I found that the in-operator was much faster than the re.search method. For searching with no pattern, prefer the in-operator.

However: The re.search method has much more power. It evaluates a pattern to search. We should choose the simplest method possible.

Python that tests re.search

import time
import re

input = "max"

if "x" in input:
    print(1)

if re.search("x", input):
    print(2)

print(time.time())

# Version 1: in.
c = 0
i = 0
while i < 1000000:
    if "x" in input:
        c += 1
    i += 1

print(time.time())

# Version 2: re.search.
i = 0
while i < 1000000:
    if re.search("x", input):
        c += 1
    i += 1

print(time.time())

Output

1
2
1381081435.177
1381081435.615 [in        = 0.438 s]
1381081437.224 [re.search = 1.609 s]

Re.match performance. In another test I rewrote a method that uses re.match to use if-statements and a for-loop. It became much faster.

Re, Performance

Word count. We implement a simple word-counting routine. We use re.findall and count non-whitespace sequences in a string. And then we return the length of the resulting list.

Word Count

Tip: Implementing small methods, like word counting ones, will help us learn to use Python in a versatile way.


A summary. A regular expression is often hard to correctly write. But when finished, it is shorter and overall simpler to maintain. It describes a specific type of logic.


Text processing. Re handles only text processing, in a concise way. We can search and match strings based on patterns. Performance suffers when regular expressions are excessively used.