{"id":5745,"date":"2019-07-18T05:59:58","date_gmt":"2019-07-18T05:59:58","guid":{"rendered":"https:\/\/www.modernescpp.com\/index.php\/more-rules-to-the-regular-expression-library\/"},"modified":"2023-06-26T10:04:51","modified_gmt":"2023-06-26T10:04:51","slug":"more-rules-to-the-regular-expression-library","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/more-rules-to-the-regular-expression-library\/","title":{"rendered":"More Rules about the Regular Expression Library"},"content":{"rendered":"<p>There is more to write about the usage of regular expressions than I wrote in my last post <a href=\"https:\/\/www.modernescpp.com\/index.php\/regular-expressions\">The Regular Expression Library<\/a>. Let&#8217;s continue.<\/p>\n<p><!--more--><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5741\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/antique-hand-knowledge-207681.jpg\" alt=\"antique hand knowledge 207681\" width=\"600\" height=\"542\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/antique-hand-knowledge-207681.jpg 1172w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/antique-hand-knowledge-207681-300x271.jpg 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/antique-hand-knowledge-207681-1024x924.jpg 1024w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/antique-hand-knowledge-207681-768x693.jpg 768w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<h2>&nbsp;<\/h2>\n<h2>The text determines the regular expression, the result, and the capture groups<\/h2>\n<p>First of all, the type of text determines the character type of the regular expression, the type of the search result, and the type of the capture group. Of course, my argument also holds if other parts of the regex machinery are applied to text. Okay, that sounds worse than it is. Capture is a subexpression in your search result, which you can define in round braces. I wrote already about it in my last post&nbsp;<a href=\"https:\/\/www.modernescpp.com\/index.php\/regular-expressions\">The Regular Expression Library<\/a>.<\/p>\n<p>The table gives all the types depending on the text type.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5742\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/CCoreGuidelinesMoreRulesToRegexNew.jpg\" alt=\"CCoreGuidelinesMoreRulesToRegexNew\" width=\"550\" height=\"113\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/CCoreGuidelinesMoreRulesToRegexNew.jpg 1003w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/CCoreGuidelinesMoreRulesToRegexNew-300x62.jpg 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/CCoreGuidelinesMoreRulesToRegexNew-768x159.jpg 768w\" sizes=\"auto, (max-width: 550px) 100vw, 550px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Here is an example of all the variations of <span style=\"font-family: courier new, courier;\">std::regex_search <\/span>depending on the text type.<\/p>\n<div id=\"simple-translate\">&nbsp;<\/div>\n<div style=\"background: #f0f3f3; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #0099ff; font-style: italic;\">\/\/ search.cpp<\/span>\r\n\r\n<span style=\"color: #009999;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;regex&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;string&gt;<\/span>\r\n\r\n<span style=\"color: #007788; font-weight: bold;\">int<\/span> <span style=\"color: #cc00ff;\">main<\/span>(){\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ regular expression for time<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>regex crgx(<span style=\"color: #cc3300;\">\"([01]?[0-9]|2[0-3]):[0-5][0-9]\"<\/span>);\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ const char*<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"const char*\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n  std<span style=\"color: #555555;\">::<\/span>cmatch cmatch;\r\n\r\n  <span style=\"color: #006699; font-weight: bold;\">const<\/span> <span style=\"color: #007788; font-weight: bold;\">char<\/span><span style=\"color: #555555;\">*<\/span> ctime{<span style=\"color: #cc3300;\">\"Now it is 23:10.\"<\/span>};\r\n\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (std<span style=\"color: #555555;\">::<\/span>regex_search(ctime, cmatch, crgx)){\r\n\r\n     std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> ctime <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n     std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Time: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> cmatch[<span style=\"color: #ff6600;\">0<\/span>] <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n   }\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ std::string<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"std::string\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n  std<span style=\"color: #555555;\">::<\/span>smatch smatch;\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>string stime{<span style=\"color: #cc3300;\">\"Now it is 23:25.\"<\/span>};\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (std<span style=\"color: #555555;\">::<\/span>regex_search(stime, smatch, crgx)){\r\n\r\n    std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> stime <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n    std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Time: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> smatch[<span style=\"color: #ff6600;\">0<\/span>] <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  }\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ regular expression holder for time<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>wregex wrgx(<span style=\"color: #cc3300;\">L\"([01]?[0-9]|2[0-3]):[0-5][0-9]\"<\/span>);\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ const wchar_t*<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"const wchar_t* \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n  std<span style=\"color: #555555;\">::<\/span>wcmatch wcmatch;\r\n\r\n  <span style=\"color: #006699; font-weight: bold;\">const<\/span> <span style=\"color: #007788; font-weight: bold;\">wchar_t<\/span><span style=\"color: #555555;\">*<\/span> wctime{<span style=\"color: #cc3300;\">L\"Now it is 23:47.\"<\/span>};\r\n\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (std<span style=\"color: #555555;\">::<\/span>regex_search(wctime, wcmatch, wrgx)){\r\n\r\n       std<span style=\"color: #555555;\">::<\/span>wcout <span style=\"color: #555555;\">&lt;&lt;<\/span> wctime <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n       std<span style=\"color: #555555;\">::<\/span>wcout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Time: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> wcmatch[<span style=\"color: #ff6600;\">0<\/span>] <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  }\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ std::wstring<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"std::wstring\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n  std<span style=\"color: #555555;\">::<\/span>wsmatch wsmatch;\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>wstring  wstime{<span style=\"color: #cc3300;\">L\"Now it is 00:03.\"<\/span>};\r\n\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (std<span style=\"color: #555555;\">::<\/span>regex_search(wstime, wsmatch, wrgx)){\r\n\r\n    std<span style=\"color: #555555;\">::<\/span>wcout <span style=\"color: #555555;\">&lt;&lt;<\/span> wstime <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n    std<span style=\"color: #555555;\">::<\/span>wcout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Time: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> wsmatch[<span style=\"color: #ff6600;\">0<\/span>] <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  }\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n}\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>First, I used a<span style=\"font-family: courier new, courier;\"> const char*<\/span>, a <span style=\"font-family: courier new, courier;\">std::strin<\/span>g, a <span style=\"font-family: courier new, courier;\">const wchar_t*<\/span>, and finally, <span style=\"font-family: courier new, courier;\">a std::wstring<\/span> as text. Because it is almost the same code in the four variations, from now on and for the rest of this post, I will only refer to the <span style=\"font-family: courier new, courier;\">std::strin<\/span>g.<\/p>\n<p>The text contains a substring that stands for a time expression. Thanks to the regular expression <span style=\"color: #000000;\">&#8220;<span style=\"font-family: courier new, courier;\">([01]?[0-9]|2[0-3]):[0-5][0-9]<\/span>&#8220;, I can search for it.&nbsp; The regular expression defines a time format consisting of an hour and a minute, separated by a colon. Here is the hour and minute part:<br \/><\/span><\/p>\n<ul>\n<li><span style=\"color: #000000;\">hour: <span style=\"font-family: courier new, courier;\">[01]?[0-9]|2[0-3<\/span><span style=\"color: #000000;\"><\/span>]: <br \/><\/span>\n<ul>\n<li><span style=\"font-family: courier new, courier;\">[01]?<\/span>: 0 or 1 (optional)<\/li>\n<li><span style=\"font-family: courier new, courier;\">[0-9]<\/span>: a number from 0 to 9<\/li>\n<li><span style=\"font-family: courier new, courier;\">|:<\/span> stands for or<\/li>\n<li><span style=\"font-family: courier new, courier;\">2[0-3]<\/span>: 2 followed by a number from 0 to 3<\/li>\n<\/ul>\n<\/li>\n<li>minute: <span style=\"font-family: courier new, courier;\">[0-5][0-9]:<\/span> a number from 0 to 5 followed by a number from 0 to 9&nbsp;<span style=\"font-family: courier new, courier;\"> <\/span><span style=\"color: #000000;\"><\/span><\/li>\n<\/ul>\n<p>Finally, the output of the program.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5743\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/search.PNG\" alt=\"search\" width=\"300\" height=\"326\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/search.PNG 1223w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/search-276x300.png 276w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/search-942x1024.png 942w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/search-768x835.png 768w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>&nbsp;<\/p>\n<\/p>\n<h2>Use <span style=\"font-family: courier new, courier;\">regex_iterator<\/span> or <span style=\"font-family: courier new, courier;\">regex_token_iterator<\/span> for repeated search<\/h2>\n<p>Don&#8217;t repeat <span style=\"font-family: courier new, courier;\">std::search<\/span> calls because you can quickly lose word boundaries or have empty hits. Use std::regex_iterator or std::regex_token_iterator for repeated search instead. <span style=\"font-family: courier new, courier;\">std::regex_token_iterator<\/span> allows you to address each capture group&#8217;s components or the text between the matches.<\/p>\n<p>The &#8220;Hello World&#8221; of repeated search with regex is to count how often a word appears in a text. Here is the corresponding program.<\/p>\n<div style=\"background: #f0f3f3; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #0099ff; font-style: italic;\">\/\/ wordCount.cpp<\/span>\r\n\r\n<span style=\"color: #009999;\">#include &lt;algorithm&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;cstdlib&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;fstream&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;regex&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;string&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;map&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;unordered_map&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;utility&gt;<\/span>\r\n\r\n<span style=\"color: #006699; font-weight: bold;\">using<\/span> str2Int <span style=\"color: #555555;\">=<\/span> std<span style=\"color: #555555;\">::<\/span>unordered_map<span style=\"color: #555555;\">&lt;<\/span>std<span style=\"color: #555555;\">::<\/span>string, std<span style=\"color: #555555;\">::<\/span><span style=\"color: #007788; font-weight: bold;\">size_t<\/span><span style=\"color: #555555;\">&gt;<\/span>;          <span style=\"color: #0099ff; font-style: italic;\">\/\/ (1)<\/span>\r\n<span style=\"color: #006699; font-weight: bold;\">using<\/span> intAndWords <span style=\"color: #555555;\">=<\/span> std<span style=\"color: #555555;\">::<\/span>pair<span style=\"color: #555555;\">&lt;<\/span>std<span style=\"color: #555555;\">::<\/span><span style=\"color: #007788; font-weight: bold;\">size_t<\/span>, std<span style=\"color: #555555;\">::<\/span>vector<span style=\"color: #555555;\">&lt;<\/span>std<span style=\"color: #555555;\">::<\/span>string<span style=\"color: #555555;\">&gt;&gt;<\/span>;\r\n<span style=\"color: #006699; font-weight: bold;\">using<\/span> int2Words<span style=\"color: #555555;\">=<\/span> std<span style=\"color: #555555;\">::<\/span>map<span style=\"color: #555555;\">&lt;<\/span>std<span style=\"color: #555555;\">::<\/span><span style=\"color: #007788; font-weight: bold;\">size_t<\/span>,std<span style=\"color: #555555;\">::<\/span>vector<span style=\"color: #555555;\">&lt;<\/span>std<span style=\"color: #555555;\">::<\/span>string<span style=\"color: #555555;\">&gt;&gt;<\/span>; \r\n\r\n\r\n<span style=\"color: #0099ff; font-style: italic;\">\/\/ count the frequency of each word<\/span>\r\nstr2Int <span style=\"color: #cc00ff;\">wordCount<\/span>(<span style=\"color: #006699; font-weight: bold;\">const<\/span> std<span style=\"color: #555555;\">::<\/span>string <span style=\"color: #555555;\">&amp;<\/span>text) {\r\n  std<span style=\"color: #555555;\">::<\/span>regex wordReg(R<span style=\"color: #cc3300;\">\"(\\w+)\"<\/span>);                                        <span style=\"color: #0099ff; font-style: italic;\">\/\/ (2)<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>sregex_iterator wordItBegin(text.begin(), text.end(), wordReg); <span style=\"color: #0099ff; font-style: italic;\">\/\/ (3)<\/span>\r\n  <span style=\"color: #006699; font-weight: bold;\">const<\/span> std<span style=\"color: #555555;\">::<\/span>sregex_iterator wordItEnd;\r\n  str2Int allWords;\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (; wordItBegin <span style=\"color: #555555;\">!=<\/span> wordItEnd; <span style=\"color: #555555;\">++<\/span>wordItBegin) {\r\n    <span style=\"color: #555555;\">++<\/span>allWords[wordItBegin<span style=\"color: #555555;\">-&gt;<\/span>str()];\r\n  }\r\n  <span style=\"color: #006699; font-weight: bold;\">return<\/span> allWords;\r\n}\r\n\r\n<span style=\"color: #0099ff; font-style: italic;\">\/\/ get to each frequency the words<\/span>\r\nint2Words <span style=\"color: #cc00ff;\">frequencyOfWords<\/span>(str2Int <span style=\"color: #555555;\">&amp;<\/span>wordCount) {\r\n  int2Words freq2Words;\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> wordIt <span style=\"color: #555555;\">:<\/span> wordCount) {\r\n    <span style=\"color: #006699; font-weight: bold;\">auto<\/span> freqWord <span style=\"color: #555555;\">=<\/span> wordIt.second;\r\n    <span style=\"color: #006699; font-weight: bold;\">if<\/span> (freq2Words.find(freqWord) <span style=\"color: #555555;\">==<\/span> freq2Words.end()) {\r\n      freq2Words.insert(intAndWords(freqWord, {wordIt.first}));\r\n    } <span style=\"color: #006699; font-weight: bold;\">else<\/span> {\r\n      freq2Words[freqWord].push_back(wordIt.first);\r\n    }\r\n  }\r\n  <span style=\"color: #006699; font-weight: bold;\">return<\/span> freq2Words;\r\n}\r\n\r\n<span style=\"color: #007788; font-weight: bold;\">int<\/span> <span style=\"color: #cc00ff;\">main<\/span>(<span style=\"color: #007788; font-weight: bold;\">int<\/span> argc, <span style=\"color: #007788; font-weight: bold;\">char<\/span> <span style=\"color: #555555;\">*<\/span>argv[]) {\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ get the filename<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>string myFile;\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (argc <span style=\"color: #555555;\">==<\/span> <span style=\"color: #ff6600;\">2<\/span>) {\r\n    myFile <span style=\"color: #555555;\">=<\/span> {argv[<span style=\"color: #ff6600;\">1<\/span>]};\r\n  } <span style=\"color: #006699; font-weight: bold;\">else<\/span> {\r\n    std<span style=\"color: #555555;\">::<\/span>cerr <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Filename missing !\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n    exit(EXIT_FAILURE);\r\n  }\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ open the file<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>ifstream file(myFile, std<span style=\"color: #555555;\">::<\/span>ios<span style=\"color: #555555;\">::<\/span>in);\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (<span style=\"color: #555555;\">!<\/span>file) {\r\n    std<span style=\"color: #555555;\">::<\/span>cerr <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Can't open file \"<\/span> <span style=\"color: #555555;\">+<\/span> myFile <span style=\"color: #555555;\">+<\/span> <span style=\"color: #cc3300;\">\"!\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n    exit(EXIT_FAILURE);\r\n  }\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ read the file<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>stringstream buffer;\r\n  buffer <span style=\"color: #555555;\">&lt;&lt;<\/span> file.rdbuf();\r\n  std<span style=\"color: #555555;\">::<\/span>string text(buffer.str());\r\n\r\n  <span style=\"color: #0099ff; font-style: italic;\">\/\/ get the frequency of each word<\/span>\r\n  <span style=\"color: #006699; font-weight: bold;\">auto<\/span> allWords <span style=\"color: #555555;\">=<\/span> wordCount(text);                                     \r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"The first 20 (key, value)-pairs: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n  <span style=\"color: #006699; font-weight: bold;\">auto<\/span> end <span style=\"color: #555555;\">=<\/span> allWords.begin();\r\n  std<span style=\"color: #555555;\">::<\/span>advance(end, <span style=\"color: #ff6600;\">20<\/span>);\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> pair <span style=\"color: #555555;\">=<\/span> allWords.begin(); pair <span style=\"color: #555555;\">!=<\/span> end; <span style=\"color: #555555;\">++<\/span>pair) {            <span style=\"color: #0099ff; font-style: italic;\">\/\/ (4)<\/span>\r\n    std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"(\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> pair<span style=\"color: #555555;\">-&gt;<\/span>first <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\": \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> pair<span style=\"color: #555555;\">-&gt;<\/span>second <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\")\"<\/span>;\r\n  }\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"allWords[Web]: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> allWords[<span style=\"color: #cc3300;\">\"Web\"<\/span>] <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;      <span style=\"color: #0099ff; font-style: italic;\">\/\/ (5)<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"allWords[The]: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> allWords[<span style=\"color: #cc3300;\">\"The\"<\/span>] <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Number of unique words: \"<\/span>;\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> allWords.size() <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;                              <span style=\"color: #0099ff; font-style: italic;\">\/\/ (6)<\/span>\r\n\r\n  <span style=\"color: #007788; font-weight: bold;\">size_t<\/span> sumWords <span style=\"color: #555555;\">=<\/span> <span style=\"color: #ff6600;\">0<\/span>;\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> wordIt <span style=\"color: #555555;\">:<\/span> allWords)\r\n    sumWords <span style=\"color: #555555;\">+=<\/span> wordIt.second;\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Total number of words: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> sumWords <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;\r\n\r\n  <span style=\"color: #006699; font-weight: bold;\">auto<\/span> allFreq <span style=\"color: #555555;\">=<\/span> frequencyOfWords(allWords);                           \r\n\r\n                                                                       <span style=\"color: #0099ff; font-style: italic;\">\/\/ (7)<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"Number of different frequencies: \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> allFreq.size() <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"All frequencies: \"<\/span>;                                    <span style=\"color: #0099ff; font-style: italic;\">\/\/ (8)<\/span>\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> freqIt <span style=\"color: #555555;\">:<\/span> allFreq)\r\n    std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> freqIt.first <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\" \"<\/span>;\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"The most frequently used word(s): \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;      <span style=\"color: #0099ff; font-style: italic;\">\/\/ (9)<\/span>\r\n  <span style=\"color: #006699; font-weight: bold;\">auto<\/span> atTheEnd <span style=\"color: #555555;\">=<\/span> allFreq.rbegin();\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> atTheEnd<span style=\"color: #555555;\">-&gt;<\/span>first <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\" :\"<\/span>;\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> word <span style=\"color: #555555;\">:<\/span> atTheEnd<span style=\"color: #555555;\">-&gt;<\/span>second)\r\n    std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> word <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\" \"<\/span>;\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"<\/span><span style=\"color: #cc3300; font-weight: bold;\">\\n\\n<\/span><span style=\"color: #cc3300;\">\"<\/span>;\r\n\r\n                                                                       <span style=\"color: #0099ff; font-style: italic;\">\/\/ (10)<\/span>\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"All words which appear more than 1000 times:\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n  <span style=\"color: #006699; font-weight: bold;\">auto<\/span> biggerIt <span style=\"color: #555555;\">=<\/span>\r\n      std<span style=\"color: #555555;\">::<\/span>find_if(allFreq.begin(), allFreq.end(),\r\n                   [](intAndWords iAndW) { <span style=\"color: #006699; font-weight: bold;\">return<\/span> iAndW.first <span style=\"color: #555555;\">&gt;<\/span> <span style=\"color: #ff6600;\">1000<\/span>; });\r\n  <span style=\"color: #006699; font-weight: bold;\">if<\/span> (biggerIt <span style=\"color: #555555;\">==<\/span> allFreq.end()) {\r\n    std<span style=\"color: #555555;\">::<\/span>cerr <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\"No word appears more than 1000 times !\"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n    exit(EXIT_FAILURE);\r\n  } <span style=\"color: #006699; font-weight: bold;\">else<\/span> {\r\n    <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> allFreqIt <span style=\"color: #555555;\">=<\/span> biggerIt; allFreqIt <span style=\"color: #555555;\">!=<\/span> allFreq.end(); <span style=\"color: #555555;\">++<\/span>allFreqIt) {\r\n      std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> allFreqIt<span style=\"color: #555555;\">-&gt;<\/span>first <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\" :\"<\/span>;\r\n      <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #006699; font-weight: bold;\">auto<\/span> word <span style=\"color: #555555;\">:<\/span> allFreqIt<span style=\"color: #555555;\">-&gt;<\/span>second)\r\n        std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> word <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\" \"<\/span>;\r\n      std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n    }\r\n  }\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n}\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>To better understand the program, I added a few comments to it.<\/p>\n<p>The <span style=\"font-family: courier new, courier;\">using<\/span> declarations in line 1 help me to type less. The function <span style=\"font-family: courier new, courier;\">wordCount<\/span> determines the frequency of each word, and the function <span style=\"font-family: courier new, courier;\">frequencyOfWords<\/span> return to each frequency of all words. What is a word? Line 2 defines it with the regular expression, and line 3 uses it in a <span style=\"font-family: courier new, courier;\">std::sregex_iterator.<\/span> Let&#8217;s see which answers I can give with the two functions.<\/p>\n<ul>\n<li>Line 4: first 20 (key, value)-pairs<\/li>\n<li>Line 5: frequency of the words &#8220;Web&#8221; and &#8220;The&#8221;<\/li>\n<li>Line 6: number of unique words<\/li>\n<li>Line 7: number of frequencies<\/li>\n<li>Line 8: all appearing frequencies<\/li>\n<li>Line 9: the most frequently used word<\/li>\n<li>Line 10: words that appear more than 1000 times<\/li>\n<\/ul>\n<p>Now, I need a lengthy text. Of course, I will use Grimm&#8217;s fairy tales from the <a href=\"https:\/\/www.gutenberg.org\/\">project Gutenberg <\/a>.&nbsp; Here is the output:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5744\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/wordCount.png\" alt=\"wordCount\" width=\"500\" height=\"684\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/wordCount.png 741w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2019\/07\/wordCount-219x300.png 219w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><\/p>\n<h2>What&#8217;s next?<\/h2>\n<p>I&#8217;m almost done with the regex functionality in C++, but I have one guideline in mind which makes repeated search often easier: Search not for the text patterns but the delimiters of the text patterns. I call this a negative search.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There is more to write about the usage of regular expressions than I wrote in my last post The Regular Expression Library. Let&#8217;s continue.<\/p>\n","protected":false},"author":21,"featured_media":5741,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[372],"tags":[469],"class_list":["post-5745","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-modern-c","tag-regular-expressions"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5745","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=5745"}],"version-history":[{"count":1,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5745\/revisions"}],"predecessor-version":[{"id":6779,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5745\/revisions\/6779"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/5741"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=5745"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=5745"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=5745"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}