The Regular Expression Library

Contents[Show]

My original plan was it to write about the rules of the C++ Core Guidelines to the regex and chrono library, but besides the subsection title, there is no content available. I already wrote a few posts about time functionality. So I'm done. Today, I fill the gap and write about the regex library.

 

concept 18290 1280

 

Okay, here are my rules for regular expressions.

Only use a Regular Expression if you have to

Regular expressions are powerful but also sometimes expensive and complicated machinery to work with text. When the interface of a std::string or the algorithms of the Standard Template Library can do the job, use them.  

Okay, but when should you use regular expressions? Here are the typical use-cases.

Use-Case for Regular Expressions

  • Check if a text matches a text pattern: std::regex_match
  • Search for a text pattern in a text: std::regex_search
  • Replace a text pattern with a text: std::regex_replace
  • Iterate through all text patterns in a text: std::regex_iterator and std::regex_token_iterator

I hope you noticed it. The operations work on text patterns and not on text.

First, you should use raw strings to write your regular expression.

Use Raw Strings for Regular Expressions

First of all, for simplicity purposes, I will break the previous rule.

The regular expression for the text C++ is quite ugly: C\\+\\+. You have to use two backslashes for each + sign. First, the + sign is a special character in a regular expression. Second, the backslash is a special character in a string. Therefore one backslash escapes the + sign, the other backslash escapes the backslash.
By using a raw string literal the second backslash is not necessary any more, because the backslash is not be interpreted in the string.

The following short example may not convince you.

std::string regExpr("C\\+\\+");
std::string regExprRaw(R"(C\+\+)");

 

Both strings stand for regular expression which matches the text C++. In particular, the raw string R"(C\+\+) is quite ugly to read. R"(Raw String)" delimits the raw string. By the way, regular expressions and path names on windows "C:\temp\newFile.txt" are typical use-case for raw strings.

Imagine, you want to search for a floating point number in a text, which you identify by the following sequence of signs: Tabulator FloatingPointNumber Tabulator \\DELIMITER. Here is a concrete example for this pattern: "\t5.5\t\\DELIMITER".

The following program uses a regular expression encode in a string and in a raw string to match this pattern.

// regexSearchFloatingPoint.cpp

#include <regex>
#include <iostream>
#include <string>

int main(){

    std::cout << std::endl;

    std::string text = "A text with floating pointer number \t5.5\t\\DELIMITER and more text.";
    std::cout << text << std::endl;
    
    std::cout << std::endl;

    std::regex rgx("\\t[0-9]+\\.[0-9]+\\t\\\\DELIMITER");          // (1) 
    std::regex rgxRaw(R"(\t[0-9]+\.[0-9]+\t\\DELIMITER)");         // (2) 

    if (std::regex_search(text, rgx)) std::cout << "found with rgx" << std::endl;
    if (std::regex_search(text, rgxRaw)) std::cout << "found with rgxRaw" << std::endl;

    std::cout << std::endl;

}

The regular expression rgx("\\t[0-9]+\\.[0-9]+\\t\\\\DELIMITER") is pretty ugly. To find n "\"-symbols (line 1), you have to write 2 * n "\"-symbols. In constrast, using a raw string to define a regular expression, makes it possible, to express the pattern your are looking for directly in the regular expression: rgxRaw(R"(\t[0-9]+\.[0-9]+\t\\DELIMITER)") (line 2). The subexpression [0-9]+\.[0-9]+ of the regular expression stands for a floating point number: at least one number [0-9]+ followed by a dot \. followed by at least one number [0-9]+. 

Just for completeness, the output of the program.

regexSearchFloatingPoint

Honestly, this example was rather simple. Most of the times, you want to analyse your match result.

For further analyse use your match_result

Using a regular expression consists typically of three steps. This holds for std::regex_search, and std::regex_match.

  1. Define the regular expression.
  2. Store the result of the search.
  3. Analyse the result.

Let's see what that means. This time I want to find the first e-mail address in a text. The following regular expression (RFC 5322 Official Standard) for an e-mail address finds not all e-mail addresses because they are very irregular.

 	
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[az0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x2\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

 

For readability, I made a line break in the regular expression. The first line matches the local part and the second line the domain part of the e-mail address. My program uses a simpler regular expression for matching an e-mail address. It's not perfect, but it will do its job. Additionally, I want to match the local part and the domain part of my e-mail address.

Here we are:

// regexSearchEmail.cpp

#include <regex>
#include <iostream>
#include <string>

int main(){

  std::cout << std::endl;

  std::string emailText = "A text with an email address: This email address is being protected from spambots. You need JavaScript enabled to view it..";

  // (1) 
  std::string regExprStr(R"(([\w.%+-]+)@([\w.-]+\.[a-zA-Z]{2,4}))");
  std::regex rgx(regExprStr);

  // (2)
  std::smatch smatch;

  if (std::regex_search(emailText, smatch, rgx)){
      
    // (3)  

    std::cout << "Text: " << emailText << std::endl;
    std::cout << std::endl;
    std::cout << "Before the email address: " << smatch.prefix() << std::endl;
    std::cout << "After the email address: " << smatch.suffix() << std::endl;
    std::cout << std::endl;
    std::cout << "Length of email adress: " << smatch.length() << std::endl;
    std::cout << std::endl;
    std::cout << "Email address: " << smatch[0] << std::endl;          // (6)
    std::cout << "Local part: " << smatch[1] << std::endl;             // (4)
    std::cout << "Domain name: " << smatch[2] << std::endl;            // (5)

  }

  std::cout << std::endl;

}

 

Lines 1, 2, and 3 stand for the beginning of the 3 typical steps of the usage of a regular expression. The regular expression in line 2 needs a few additional words.

Here it is:([\w.%+-]+)@([\w.-]+\.[a-zA-Z]{2,4})

  • [\w.%+-]+: At least one of the following characters: "\w", ".", "%", "+", or "-". "\w" stands for a word character.
  • [\w.-]+\.[a-zA-Z]{2,4}: At least one of a "\w", ".", "-", followed by a dot ".", followed by 2 - 4 characters from the range a-z or the range A-Z.
  • (...)@(...): The round braces stand for a capture group. They allow you to identify a submatch in a match. The first capture (line 4) group is the local part of an address. The second capture group (line 5) is the domain part of the e-mail address. You can address the entire match with the 0-th capture group (line 6).

 

The output of the program shows the detailed analyse.

regexSearchEmail

What's next?

I'm not done. There is more to write about regular expressions in my next post. I write about various types of text and iterating through all matches.

 

 

Thanks a lot to my Patreon Supporters: Paul Baxter,  Meeting C++, Matt Braun, Avi Lachmish, Roman Postanciuc, Venkata Ramesh Gudpati, Tobias Zindl, Marko, Ramesh Jangama, G Prvulovic, Reiner Eiteljörge, Benjamin Huth, Reinhold Dröge, Timo, Abernitzke, Richard Ohnemus , Frank Grimm, Sakib, and Broeserl.

 

Thanks in particular to:
 TakeUpCode 450 60
crp4

 

   

Get your e-book at Leanpub:

The C++ Standard Library

 

Concurrency With Modern C++

 

Get Both as one Bundle

cover   ConcurrencyCoverFrame   bundle
With C++11, C++14, and C++17 we got a lot of new C++ libraries. In addition, the existing ones are greatly improved. The key idea of my book is to give you the necessary information to the current C++ libraries in about 200 pages.  

C++11 is the first C++ standard that deals with concurrency. The story goes on with C++17 and will continue with C++20.

I'll give you a detailed insight in the current and the upcoming concurrency in C++. This insight includes the theory and a lot of practice with more the 100 source files.

 

Get my books "The C++ Standard Library" (including C++17) and "Concurrency with Modern C++" in a bundle.

In sum, you get more than 600 pages full of modern C++ and more than 100 source files presenting concurrency in practice.

 

Get your interactive course

 

Modern C++ Concurrency in Practice

C++ Standard Library including C++14 & C++17

educative CLibrary

Based on my book "Concurrency with Modern C++" educative.io created an interactive course.

What's Inside?

  • 140 lessons
  • 110 code playgrounds => Runs in the browser
  • 78 code snippets
  • 55 illustrations

Based on my book "The C++ Standard Library" educative.io created an interactive course.

What's Inside?

  • 149 lessons
  • 111 code playgrounds => Runs in the browser
  • 164 code snippets
  • 25 illustrations

Add comment


My Newest E-Books

Course: Modern C++ Concurrency in Practice

Course: C++ Standard Library including C++14 & C++17

Course: Embedded Programming with Modern C++

Course: Generic Programming (Templates)

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code

Visitors

Today 1360

All 3003055

Currently are 210 guests and no members online

Kubik-Rubik Joomla! Extensions

Latest comments