The Regular Expression Library

Contents[Show]

My original plan was to write about the rules of the C++ Core Guidelines for the regex and chrono library, but besides the subsection title, no content is available. I already wrote a few posts about time functionality. So I'm done. Today, I fill the gap and write about the regex library.

 

concept 18290 1280

 

Okay, here are my rules for regular expressions.

 

Rainer D 6 P2 540x540Modernes C++ Mentoring

Be part of my mentoring programs:

 

 

 

 

Do you want to stay informed about my mentoring programs: Subscribe via E-Mail.

Only use a Regular Expression if you have to

Regular expressions are powerful but also sometimes expensive and complicated machinery to work with text. When the interface of a std::string or the algorithms of the Standard Template Library can do the job, use them.  

Okay, but when should you use regular expressions? Here are the typical use cases.

Use-Case for Regular Expressions

  • Check if a text matches a text pattern: std::regex_match
  • Search for a text pattern in a text: std::regex_search
  • Replace a text pattern with a text: std::regex_replace
  • Iterate through all text patterns in a text: std::regex_iterator and std::regex_token_iterator

I hope you noticed it. The operations work on text patterns and not on text.

First, you should use raw strings to write your regular expression.

Use Raw Strings for Regular Expressions

First of all, for simplicity purposes, I will break the previous rule.

The regular expression for the C++ text is quite ugly: C\\+\\+. You have to use two backslashes for each + sign. First, the + sign is a unique character in a regular expression. Second, the backslash is a special character in a string. Therefore one backslash escapes the + sign; the other backslash escapes the backslash.
By using a raw string literal, the second backslash is not necessary anymore because the backslash is not interpreted in the string.

The following short example may not convince you.

std::string regExpr("C\\+\\+");
std::string regExprRaw(R"(C\+\+)");

 

Both strings stand for regular expression, which matches the text C++. In particular, the raw string R"(C\+\+) is quite ugly to read. R"(Raw String)" delimits the raw string. By the way, regular expressions and path names on windows "C:\temp\newFile.txt" are typical use cases for raw strings.

Imagine you want to search for a floating-point number in a text, which you identify by the following sequence of signs: Tabulator FloatingPointNumber Tabulator \\DELIMITER. Here is a concrete example for this pattern: "\t5.5\t\\DELIMITER".

The following program uses a regular expression encoded in a string and a raw string to match this pattern.

// regexSearchFloatingPoint.cpp

#include <regex>
#include <iostream>
#include <string>

int main(){

    std::cout << std::endl;

    std::string text = "A text with floating pointer number \t5.5\t\\DELIMITER and more text.";
    std::cout << text << std::endl;
    
    std::cout << std::endl;

    std::regex rgx("\\t[0-9]+\\.[0-9]+\\t\\\\DELIMITER");          // (1) 
    std::regex rgxRaw(R"(\t[0-9]+\.[0-9]+\t\\DELIMITER)");         // (2) 

    if (std::regex_search(text, rgx)) std::cout << "found with rgx" << std::endl;
    if (std::regex_search(text, rgxRaw)) std::cout << "found with rgxRaw" << std::endl;

    std::cout << std::endl;

}

 

The regular expression rgx("\\t[0-9]+\\.[0-9]+\\t\\\\DELIMITER") is pretty ugly. To find n "\"-symbols (line 1), you have to write 2 * n "\"-symbols. In contrast, using a raw string to define a regular expression makes it possible to express the pattern you are looking for directly in the regular expression: rgxRaw(R"(\t[0-9]+\.[0-9]+\t\\DELIMITER)") (line 2). The subexpression [0-9]+\.[0-9]+ of the regular expression stands for a floating point number: at least one number [0-9]+ followed by a dot \. followed by at least one number [0-9]+. 

Just for completeness, the output of the program.

regexSearchFloatingPoint

Honestly, this example was relatively simple. Most of the time, you want to analyze your match result.

For further analysis, use your match_result

Using a regular expression typically consists of three steps. This holds for std::regex_search, and std::regex_match.

  1. Define the regular expression.
  2. Store the result of the search.
  3. Analyze the result.

Let's see what that means—this time I want to find the first e-mail address in a text. The following regular expression (RFC 5322 Official Standard) for an e-mail address finds not all e-mail addresses because they are very irregular.

 	
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[az0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x2\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

 

For readability, I made a line break in the regular expression. The first line matches the local part, and the second line the domain part of the e-mail address. My program uses a more straightforward regular expression for matching an e-mail address. It's not perfect, but it will do its job. Additionally, I want to match the local part and the domain part of my e-mail address.

Here we are:

// regexSearchEmail.cpp

#include <regex>
#include <iostream>
#include <string>

int main(){

  std::cout << std::endl;

  std::string emailText = "A text with an email address: This email address is being protected from spambots. You need JavaScript enabled to view it..";

  // (1) 
  std::string regExprStr(R"(([\w.%+-]+)@([\w.-]+\.[a-zA-Z]{2,4}))");
  std::regex rgx(regExprStr);

  // (2)
  std::smatch smatch;

  if (std::regex_search(emailText, smatch, rgx)){
      
    // (3)  

    std::cout << "Text: " << emailText << std::endl;
    std::cout << std::endl;
    std::cout << "Before the email address: " << smatch.prefix() << std::endl;
    std::cout << "After the email address: " << smatch.suffix() << std::endl;
    std::cout << std::endl;
    std::cout << "Length of email adress: " << smatch.length() << std::endl;
    std::cout << std::endl;
    std::cout << "Email address: " << smatch[0] << std::endl;          // (6)
    std::cout << "Local part: " << smatch[1] << std::endl;             // (4)
    std::cout << "Domain name: " << smatch[2] << std::endl;            // (5)

  }

  std::cout << std::endl;

}

 

Lines 1, 2, and 3 begin the three typical steps of using a regular expression. The regular expression in line 2 needs a few additional words.

Here it is:([\w.%+-]+)@([\w.-]+\.[a-zA-Z]{2,4})

  • [\w.%+-]+: At least one of the following characters: "\w", ".", "%", "+", or "-". "\w" stands for a word character.
  • [\w.-]+\.[a-zA-Z]{2,4}: At least one of a "\w", ".", "-", followed by a dot ".", followed by 2 - 4 characters from the range a-z or A-Z.
  • (...)@(...): The round braces stand for a capture group. They allow you to identify a sub match in a match. The first capture (line 4) group is the local part of an address. The second capture group (line 5) is the domain part of the e-mail address. You can address the match with the 0th capture group (line 6).

 

The output of the program shows a detailed analysis.

regexSearchEmail

What's next?

I'm not done. There is more to write about regular expressions in my next post. I write about various types of text and iterate through all matches.

 

 

Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Animus24, Jozo Leko, John Breland, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Kris Kafka, Mario Luoni, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Daniel Hufschläger, Alessandro Pezzato, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Leo Goodstadt, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, Robin Furness, Michael Young, Holger Detering, Bernd Mühlhaus, Matthieu Bolt, Stephen Kelley, Kyle Dean, Tusar Palauri, Dmitry Farberov, Juan Dent, George Liao, Daniel Ceperley, Jon T Hess, Stephen Totten, Wolfgang Fütterer, Matthias Grün, Phillip Diekmann, Ben Atakora, Ann Shatoff, and Rob North.

 

Thanks, in particular, to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, Rusty Fleming, John Nebel, Mipko, Alicja Kaminska, and Slavko Radman.

 

 

My special thanks to Embarcadero CBUIDER STUDIO FINAL ICONS 1024 Small

 

My special thanks to PVS-Studio PVC Logo

 

My special thanks to Tipi.build tipi.build logo

 

My special thanks to Take Up code TakeUpCode 450 60

 

Seminars

I'm happy to give online seminars or face-to-face seminars worldwide. Please call me if you have any questions.

Bookable (Online)

German

Standard Seminars (English/German)

Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.

  • C++ - The Core Language
  • C++ - The Standard Library
  • C++ - Compact
  • C++11 and C++14
  • Concurrency with Modern C++
  • Design Pattern and Architectural Pattern with C++
  • Embedded Programming with Modern C++
  • Generic Programming (Templates) with C++

New

  • Clean Code with Modern C++
  • C++20

Contact Me

Modernes C++,

RainerGrimmDunkelBlauSmall

Stay Informed about my Mentoring

 

Mentoring

English Books

Course: Modern C++ Concurrency in Practice

Course: C++ Standard Library including C++14 & C++17

Course: Embedded Programming with Modern C++

Course: Generic Programming (Templates)

Course: C++ Fundamentals for Professionals

Course: The All-in-One Guide to C++20

Course: Master Software Design Patterns and Architecture in C++

Subscribe to the newsletter (+ pdf bundle)

All tags

Blog archive

Source Code

Visitors

Today 4007

Yesterday 4371

Week 39814

Month 169939

All 12057705

Currently are 216 guests and no members online

Kubik-Rubik Joomla! Extensions

Latest comments