{"id":5052,"date":"2016-11-29T21:23:12","date_gmt":"2016-11-29T21:23:12","guid":{"rendered":"https:\/\/www.modernescpp.com\/index.php\/buckets-capacity-and-load-factor\/"},"modified":"2023-06-26T12:34:42","modified_gmt":"2023-06-26T12:34:42","slug":"buckets-capacity-and-load-factor","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/buckets-capacity-and-load-factor\/","title":{"rendered":"Buckets, Capacity, and Load Factor"},"content":{"rendered":"<p>The hash function maps a potentially infinite number of keys on a finite number of buckets. What is the strategy of the C++ runtime and how can you tailor it to your needs, that is what this article is all about.<\/p>\n<p><!--more--><\/p>\n<p>&nbsp;<\/p>\n<p>In order not to lose the big picture I want to explicitly stress the point. This article is about<span style=\"font-family: courier new,courier;\"> std::unordered_set<\/span> but observations hold also for the other three unordered associative containers in C++11. If you want to have a closer look, you should read my previous post <a href=\"https:\/\/www.modernescpp.com\/index.php\/hash-tables\">Hash tables.<\/a><\/p>\n<\/p>\n<h2>Rehashing<\/h2>\n<p>The hash function decides to which bucket the key goes. Because the hash function reduces a potentially infinite number of keys on a finite number of buckets, various keys may go to the same bucket. This event is called a collision. The keys in each bucket are typically stored as a singly linked list. With this knowledge in mind, you can quite simply reason how fast the access time on a key is in an unordered associative container. <strong>The application of the hash function is a constant operation; the search of the key in the singly linked list is a linear operation.<\/strong> Therefore, the goal of the hash function is to produce few collisions.<\/p>\n<p>The number of buckets is called the capacity, and the average number of elements of each bucket is the load factor. The C++ runtime creates, by default, new buckets if the load factor goes beyond 1. This process is called rehashing and you can explicitly trigger it by setting the capacity of the unordered associative container to a higher value. Once more, the few new terms are on the spot.<\/p>\n<ul>\n<li><strong>Capacity:<\/strong> Number of buckets<\/li>\n<li><strong><span style=\"color: #000000;\">Load factor:<\/span><\/strong> Average number of elements (keys) per bucket<\/li>\n<li><strong>Rehashing<\/strong>: Creation of new buckets<\/li>\n<\/ul>\n<p>You can read and adjust these characteristics.<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41\r\n42\r\n43\r\n44\r\n45\r\n46\r\n47\r\n48\r\n49\r\n50\r\n51\r\n52\r\n53\r\n54\r\n55\r\n56\r\n57\r\n58\r\n59\r\n60\r\n61\r\n62\r\n63\r\n64\r\n65<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\">\/\/ rehash.cpp\r\n\r\n<span style=\"color: #008000;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #008000;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #008000;\">#include &lt;unordered_set&gt;<\/span>\r\n\r\nvoid getInfo(const std::unordered_set&lt;int&gt;&amp; hash){\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"hash.bucket_count(): \"<\/span> &lt;&lt; hash.bucket_count() &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"hash.load_factor(): \"<\/span> &lt;&lt; hash.load_factor() &lt;&lt; std::endl;\r\n}\r\n\r\nvoid fillHash(std::unordered_set&lt;int&gt;&amp; h,int n){\r\n  std::random_device seed;\r\n  \/\/ default generator\r\n  std::mt19937 engine(seed());\r\n  \/\/ get random numbers 0 - 1000\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(0,1000);\r\n\r\n  <span style=\"color: #0000ff;\">for<\/span> ( int i=1; i&lt;= n; ++i){\r\n    h.insert(uniformDist(engine));\r\n  }\r\n}\r\n\r\nint main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::unordered_set&lt;int&gt; hash;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"hash.max_load_factor(): \"<\/span> &lt;&lt; hash.max_load_factor() &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  getInfo(hash);\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  \/\/ only to be sure\r\n  hash.insert(500);\r\n  \/\/ get the bucket of 500\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"hash.bucket(500): \"<\/span> &lt;&lt; hash.bucket(500) &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  \/\/ add 100 elements\r\n  fillHash(hash,100);\r\n  getInfo(hash);\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  \/\/ at least 500 buckets\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"hash.rehash(500): \"<\/span> &lt;&lt; std::endl;\r\n  hash.rehash(500);\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  getInfo(hash);\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  \/\/ get the bucket of 500\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"hash.bucket(500): \"<\/span> &lt;&lt; hash.bucket(500) &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>At first to the helper functions <span style=\"font-family: courier new,courier;\">getInfo<\/span> (lines 7 &#8211; 10) and <span style=\"font-family: courier new,courier;\">fillHash<\/span> (lines 12 &#8211; 22). The function <span style=\"font-family: courier new,courier;\">getInfo<\/span> returns for the <span style=\"font-family: courier new,courier;\">std::unordered_set<\/span> the number of buckets and the load factor. The function <span style=\"font-family: courier new,courier;\">fillHash<\/span> fills the unordered associative container with n arbitrarily created integers.<\/p>\n<p>It is interesting to compare the execution of the program on Linux and Windows. I used on Linux the GCC; I used on Windows the cl.exe compiler. I translated the program without optimization.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5050\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/11\/rehashLinux.png\" alt=\"rehashLinux\" width=\"469\" height=\"404\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/11\/rehashLinux.png 469w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/11\/rehashLinux-300x258.png 300w\" sizes=\"auto, (max-width: 469px) 100vw, 469px\" \/><\/p>\n<p>At first, I&#8217;m interested in the empty container&#8217;s maximum load factor (line 29). If the <span style=\"font-family: courier new,courier;\">std::unordered_set<\/span> exceeds the maximum load factor, a rehashing will take place. The maximum load factor is 1. The GCC initially starts with 11 buckets, Windows starts with 8 buckets. Of course, the load factor is 0. Putting the key 500 (line 38) into the hash table will go to bucket five on Linux and to bucket six on Windows. In line 45, I added 100 keys to the hash table. Afterward, Linux has 97, and Windows has 512 buckets. Linux has less than 100 buckets because I got a few identical keys. Linux now has a load factor near 1, but Windows is about 0.2. Therefore, I can put a lot more elements into the Windows hash table without the need for rehashing. I trigger the rehashing on Linux by requesting at least 500 buckets (line 51). Here you see the difference. On Linux, I get new buckets, and the keys must be newly distributed. But this is not happening on Windows because 512 is bigger than 500. In the end, the key 500 (line 61) is in a different bucket.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5051\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/11\/rehashWindows.PNG\" alt=\"rehashWindows\" width=\"668\" height=\"494\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/11\/rehashWindows.PNG 668w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/11\/rehashWindows-300x222.png 300w\" sizes=\"auto, (max-width: 668px) 100vw, 668px\" \/><\/p>\n<p>It isn&#8217;t easy to conclude the observed behavior on Linux and Windows. In the unoptimized case, it seems that Linux optimizes memory consumption and Windows for performance. But I want to give one piece of advice for using unordered associative containers.<\/p>\n<h3>My rule of thumb<\/h3>\n<p>If you know how large your&nbsp;unordered associative container will become, start with a <em>reasonable<\/em> number of buckets. Therefore, you can spare a lot of expensive rehashings because each rehashing includes memory allocation and the new distribution of all keys. The question is, what is <em>reasonable?<\/em> My rule of thumb is to start with a bucket size similar to the number of your keys. Therefore, your load factor is close to 1.<\/p>\n<p>I&#8217;m looking forward to a discussion about my rule of thumb. If possible, I will add them to this post.<\/p>\n<h2>What&#8217;s next?<\/h2>\n<p>POD&#8217;s stands for <strong>P<\/strong>lain <strong>O<\/strong>ld <strong>D<\/strong>ata. These are data types that have a C standard layout. Therefore, you can directly manipulate them with the C functions <span style=\"font-family: courier new,courier;\">memcpy,<\/span> <span style=\"font-family: courier new,courier;\">memmove, memcmp,<\/span> or <span style=\"font-family: courier new,courier;\">memset<\/span>. With C++11 even instances of classes can be PODs. I will write in the <a href=\"https:\/\/www.modernescpp.com\/index.php\/generalized-plain-old-data\">next post<\/a><span id=\"transmark\"><\/span>, about which requirements must hold for the class.<a href=\"https:\/\/www.modernescpp.com\/index.php\/blog\/plain-old-data\"><br \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The hash function maps a potentially infinite number of keys on a finite number of buckets. What is the strategy of the C++ runtime and how can you tailor it to your needs, that is what this article is all about.<\/p>\n","protected":false},"author":21,"featured_media":5050,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[364],"tags":[472],"class_list":["post-5052","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-embedded","tag-associative-containers"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5052","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=5052"}],"version-history":[{"count":1,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5052\/revisions"}],"predecessor-version":[{"id":6920,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5052\/revisions\/6920"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/5050"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=5052"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=5052"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=5052"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}