Skip to main content

🗂️ Hash Tables

A library card catalog

The Library Index Analogy

At a library with 100,000 books, finding "Python Programming" by scanning every shelf would take hours.

But with an index card:

"Python Programming" → Shelf 42, Row B, Position 8

You go directly to the book. No searching required!

Hash tables work the same way.

A hash function converts your key ("Python Programming") into an address (Shelf 42, Position 8). Instead of searching, you calculate where to look.


How Hashing Works

The Magic Formula

Step 1: Take a key
        "apple"

Step 2: Hash function converts it to a number
        hash("apple") = 12345678

Step 3: Mod by array size to get index
        12345678 % 10 = 8

Step 4: Store/retrieve at that index
        table[8] = ("apple", value)

Why It's O(1)

Without hash table (linear search):
  "Where is 'apple'?"
  Check index 0... not here
  Check index 1... not here
  ...
  Check index 8... found!
  Time: O(n)

With hash table:
  hash("apple") → index 8
  table[8] → found!
  Time: O(1)

One calculation, direct access. No searching.


The Hash Function

A good hash function should:

  1. Be deterministic: Same key → same hash
  2. Be fast: O(1) computation
  3. Spread evenly: Distribute keys across all indices
Good hash function:
  "apple"  → index 8
  "banana" → index 3
  "cherry" → index 7
  (Spread out nicely)

Bad hash function:
  "apple"  → index 5
  "banana" → index 5
  "cherry" → index 5
  (All collide!)

Collisions: When Two Keys Map to the Same Index

The Problem

hash("apple")  % 10 = 5
hash("orange") % 10 = 5

Both want index 5! Now what?

Solution 1: Chaining

Each index holds a list of items:

Index 5: [("apple", 1), ("orange", 2)]

Lookup "orange":
  1. hash("orange") → 5
  2. Go to index 5
  3. Search the list for "orange"
  4. Found!

If collisions are rare, lists stay short → still fast.

Solution 2: Open Addressing

If spot is taken, find the next empty spot:

"apple" → index 5 (stored)
"orange" → index 5 (taken!) → try 6 → empty → stored at 6

Lookup "orange":
  1. hash("orange") → 5
  2. Check index 5: "apple" (not it)
  3. Check index 6: "orange" (found!)

Load Factor: The Key to Performance

Load Factor = number of items / table size

High load factor = more collisions = slower

Lower load factor: fewer collisions, often faster
Higher load factor: more collisions, can get slower

When the load factor exceeds an implementation-defined threshold, many hash tables resize (often by growing the backing array) and rehash entries.


Time Complexity

OperationAverageWorst (many collisions)
InsertO(1)O(n)
LookupO(1)O(n)
DeleteO(1)O(n)

Average case is O(1) - that's the whole point!

Worst case is uncommon in typical use, but it can happen with lots of collisions (for example, due to poor hashing, adversarial inputs, or very high load factors).


Hash Table in Different Languages

LanguageHash TableSet
Pythondictset
JavaScriptObject, MapSet
JavaHashMapHashSet
C++unordered_mapunordered_set

They're everywhere because they're so useful!


What Can Be a Key?

Keys generally need to be hashable (often meaning immutable):

âś“ Valid keys:
  - Strings: "hello"
  - Numbers (in many languages)
  - Tuples: (1, 2, 3)
  - Frozen sets

âś— Invalid keys:
  - Lists: [1, 2, 3] (mutable!)
  - Dicts: {"a": 1} (mutable!)
  - Sets: {1, 2, 3} (mutable!)

Why? If a key changes after being stored, its hash changes, and we can't find it anymore!


Common Use Cases

1. Counting Frequencies

"How many times does each word appear?"

for word in text:
    counts[word] = counts.get(word, 0) + 1

2. Caching/Memoization

"Have I computed this before?"

if input in cache:
    return cache[input]
result = expensive_computation(input)
cache[input] = result

3. Deduplication

"Remove duplicates"

unique = set(items)  # Hash set removes duplicates in O(n)

4. Two Sum Pattern

"Find two numbers that sum to target"

for num in array:
    complement = target - num
    if complement in seen:
        return [complement, num]
    seen.add(num)

Hash Table vs Other Structures

StructureLookupInsertOrdered?
Hash TableO(1)O(1)No
Binary Search TreeO(log n)O(log n)Yes
Sorted ArrayO(log n)O(n)Yes
Linked ListO(n)O(1)By insertion

Use hash table when:

  • You need fast lookup by key
  • Order doesn't matter

Use tree when:

  • You need elements in sorted order
  • You need range queries ("all keys between A and B")

Common Mistakes

1. Using Mutable Keys

key = [1, 2, 3]
hash_table[key] = "value"  # Error! Lists aren't hashable

Solution: Use tuples for compound keys.

2. Assuming Order

Some language implementations preserve insertion order (for example, modern Python dicts), but don't rely on hash tables being ordered in general.

3. Ignoring Worst Case

If a hash function behaves poorly, or if resizing/collision-handling is ineffective, performance can degrade toward O(n).


FAQ

Q: Why is lookup often O(1) on average?

Collisions can cause chains. But with good hash function and low load factor, collisions are rare enough that average is O(1).

Q: Can two different keys have the same hash?

Yes! That's a collision. Hash tables handle this with chaining or open addressing.

Q: When should I use a hash table vs a set?

Hash table: key-value pairs (dictionary) Set: just keys (membership testing)

Q: How big should my hash table be?

Start reasonably sized and let it grow. Most implementations handle resizing automatically.

Q: Is Python dict a hash table?

Yes! Python's dict is implemented as a hash table internally.


Summary

Hash tables provide O(1) average-case operations by converting keys into array indices via hashing.

Key Takeaways:

  • Hash function maps keys to indices
  • O(1) lookup, insert, delete (average)
  • Collisions handled via chaining or open addressing
  • Keys generally need to be hashable
  • Keep load factor low for performance
  • Foundation for dicts and sets in every language

Hash tables are the most frequently used data structure in practice - master them!

Related Concepts

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.