Home » How can the containsKey() method of a Java Hash Table be O(1)? [duplicate]

How can the containsKey() method of a Java Hash Table be O(1)? [duplicate]

Solutons:


The reason for hash access being O(1) is actually quite different from array access.

Arrays are contiguous memory areas. To read element 15, you only have to multiply the element size by 15 and add the start address. Both addition and multiplication are O(1) (it takes just as long to add two big numbers as two small numbers), and since computers provide O(1) memory access, the overall complexity is still O(1).

A hash works quite differently. It stores things at predictable places, but those indexed places are not user-visible. A string that you put into a hashtable isn’t stored at the address you specify; instead the table takes the content of that string and computes a fitting address by hashing that content to a number taken from a small set of possibilities. Then it stores the value to that place, and if you ask for that key again, it re-computes the hashed value and looks up that cell.

Since the set of possible hash values is smaller than the set of possible keys, you can have collisions, so that you might have to spend a little more time to find the right value when more than one of them has been put into the same bucket, but it can be shown that this happens infrequently and doesn’t affect the overall amortized complexity, which is still O(1).

So you see that an array can find things fast because you tell it where to load from; a hash table returns things fast because it knows where it put them, and can reconstruct that computation efficiently.

In Java, every object have a hashCode() method that returns a 32 bit integer value, which is always the same for the same object. In the simplest version, a hashtable just contain an array of size 2^32, and a key-value pair is stored at the index corresponding to the hash code of the key. Since array access by index is O(1), hashtable access by key (for storing or retrieving) can also be O(1).

Of course it is a bit more complex in reality. First, you can always have collisions, that is, two different object giving the same hashcode. So the items are not stored directly in the array, rather each array index contains a “bucket”, which is an ordinary list of key-value pairs. (In the Java hashtable the buckets are implemented as linked lists.) You have to search through the bucket to find the item and this search will be O(n), but unless your hashtable contains an extreme number of items (or your hash algorithm is bad), the items will be distributed pretty evenly across the array, and each bucket will contain only a few items. (Only one in the best case.)

Second, you will not initially create an array of size 2^32, since that would be a crazy waste of space. Instead you initially create a smaller array, where each entry maps to multiple hashcodes. This will of course lead to higher risk of collision. You keep track of the number of entries, and when they reach a certain threshold you double the size of the array, and then re-distribute the items. Of course this will also have a performance cost. There is some design tradeoff in deciding when to resize the array. The bigger the array relative to the number of items, the fewer collisions and hence better performance, but also more waste of space.

So finding an item is O(n) in the worst case where all items happen to end up in the same bucket, but O(1) in the common case (given a well-behaved hash function. Since anybody can override hashCode() this is of course not guaranteed. If you write int hashCode(){return 17;} you get worst case performance consistently). And if the number of items grows larger than the hash size, the buckets start to grow and again you get O(n) lookup. On 32 bit systems you would run out of memory before this ever happened, but with 64 bit memory it could theoretically be an issue.

Adding an item is also O(1) in the common case, but O(n) if the add triggers a resize of the array. However the aggregate cost of the resize operations are predictable and proportional to the number of items, so the amortized cost for adds is still O(1). This is not the case for the worst case with lookups, since if we are unlucky and all items ends up in the same bucket every lookup will have the worst-case performance and there is no way to amortize this cost.

Of course both the worst case and the common or average case may be relevant. In a real-time system, it is pretty important to know the worst case performance of an operation. For most business application, there average case is the more important metric.

When talking about measurements (even abstract measurements such as “algorithmic complexity”) you should always specify exactly what you are measuring, otherwise what you say is completely meaningless.

In this particular case, you are simply saying “hash tables are O(1)”, but you are not saying what exactly you are measuring.

In particular, accessing a value by key in a (properly designed) hash table has

  • worst-case step complexity of O(n) (or more precisely, worst-case step complexity of whatever data structure is used for the “bucket”, which usually is a simple linked list)
  • amortized worst-case step complexity of O(1)

In other words, your whole confusion is due to the fact that you are talking about the worst-case and the others are talking about the amortized worst-case, but except for @Kilian Foth nobody bothered to mention that.

The argument is similar to the one for why adding an element to a dynamically-sized array is O(n) worst-case and O(1) amortized worst-case. @JacquesB explains how this amortization works for hash tables.

Related Solutions

Custom query with Castle ActiveRecord

In this case what you want is HqlBasedQuery. Your query will be a projection, so what you'll get back will be an ArrayList of tuples containing the results (the content of each element of the ArrayList will depend on the query, but for more than one value will...

What is the “You have new mail” message in Linux/UNIX?

Where is this mail? It's likely to be in the spool file: /var/mail/$USER or /var/spool/mail/$USER are the most common locations on Linux and BSD. (Other locations are possible – check if $MAIL is set – but by default, the system only informs you about...

How can I find the implementations of Linux kernel system calls?

System calls aren't handled like regular function calls. It takes special code to make the transition from user space to kernel space, basically a bit of inline assembly code injected into your program at the call site. The kernel side code that "catches" the...

Is a composite index also good for queries on the first field?

It certainly is. We discussed that in great detail under this related question: Working of indexes in PostgreSQL Space is allocated in multiples of MAXALIGN, which is typically 8 bytes on a 64-bit OS or (much less common) 4 bytes on a 32-bit OS. If you are not...

Explaining computational complexity theory

Hoooo, doctoral comp flashback. Okay, here goes. We start with the idea of a decision problem, a problem for which an algorithm can always answer "yes" or "no." We also need the idea of two models of computer (Turing machine, really): deterministic and...

Building a multi-level menu for umbraco

First off, no need pass the a parent parameter around. The context will transport this information. Here is the XSL stylesheet that should solve your problem: <!-- update this variable on how deep your menu should be --> <xsl:variable...

How to generate a random string?

My favorite way to do it is by using /dev/urandom together with tr to delete unwanted characters. For instance, to get only digits and letters: tr -dc A-Za-z0-9 </dev/urandom | head -c 13 ; echo '' Alternatively, to include more characters from the OWASP...

How to copy a file from a remote server to a local machine?

The syntax for scp is: If you are on the computer from which you want to send file to a remote computer: scp /file/to/send username@remote:/where/to/put Here the remote can be a FQDN or an IP address. On the other hand if you are on the computer wanting to...

What is the difference between curl and wget?

The main differences are: wget's major strong side compared to curl is its ability to download recursively. wget is command line only. There's no lib or anything, but curl's features are powered by libcurl. curl supports FTP, FTPS, HTTP, HTTPS, SCP, SFTP, TFTP,...

Using ‘sed’ to find and replace [duplicate]

sed is the stream editor, in that you can use | (pipe) to send standard streams (STDIN and STDOUT specifically) through sed and alter them programmatically on the fly, making it a handy tool in the Unix philosophy tradition; but can edit files directly, too,...

How do I loop through only directories in bash?

You can specify a slash at the end to match only directories: for d in */ ; do echo "$d" done If you want to exclude symlinks, use a test to continue the loop if the current entry is a link. You need to remove the trailing slash from the name in order for -L to...

How to clear journalctl

The self maintenance method is to vacuum the logs by size or time. Retain only the past two days: journalctl --vacuum-time=2d Retain only the past 500 MB: journalctl --vacuum-size=500M man journalctl for more information. You don't typically clear the journal...

How can I run a command which will survive terminal close?

One of the following 2 should work: $ nohup redshift & or $ redshift & $ disown See the following for a bit more information on how this works: man nohup help disown Difference between nohup, disown and & (be sure to read the comments too) If your...

Get exit status of process that’s piped to another

bash and zsh have an array variable that holds the exit status of each element (command) of the last pipeline executed by the shell. If you are using bash, the array is called PIPESTATUS (case matters!) and the array indicies start at zero: $ false | true $...

Execute vs Read bit. How do directory permissions in Linux work?

When applying permissions to directories on Linux, the permission bits have different meanings than on regular files. The read bit (r) allows the affected user to list the files within the directory The write bit (w) allows the affected user to create, rename,...

What are the pros and cons of Vim and Emacs? [closed]

I use both, although if I had to choose one, I know which one I would pick. Still, I'll try to make an objective comparison on a few issues. Available everywhere? If you're a professional system administrator who works with Unix systems, or a power user on...