The Evolution of Redis Arrays: An Architectural Analysis for Large-Scale Data Processing

The Evolution of Redis Arrays: An Architectural Analysis for Large-Scale Data Processing

Hello everyone! I recently came across an interesting article on Hacker News, written by Oran Agra, one of Redis’s core developers, titled “Redis array: short story of a long development process.” This wasn’t just a story about adding a new feature; it was a testament to the dedication of developers who tackled 25-year-old legacy code, ensuring performance, maintaining stability, and formatting a massive codebase overnight.

Today, based on this article, we’ll dive deep into how the Array data structure has evolved within Redis and what lessons we can learn for designing large-scale systems.

1. The Problem: The Shackle of 25-Year-Old Legacy Code

Redis’s LIST data structure internally uses QuickList. QuickList combines the advantages of ziplist and linkedlist, which are doubly linked lists. However, when dealing with massive lists containing tens of millions of elements, memory fragmentation and cache misses were causing significant performance degradation.

Specifically, when processing array-type data, the existing structure had the following bottlenecks:

  • Memory Overhead: Additional memory usage due to pointer connections.
  • Sequential Access Cost: Latency caused by inefficient use of cache lines.

To address this, the development team decided to overhaul the internal structure at the C language level. The biggest challenge here was the “legacy code that had to be changed.”

2. The Solution: Formatting a 25M-line Codebase

The most impressive part of the article was “Formatting a 25M-line codebase overnight.” The process of formatting and refactoring 25 million lines of code required more than just technical challenges; it demanded strategy akin to chess.

2.1. Preparations for Refactoring

The biggest fear in large-scale refactoring is “regression.” Modifying the array structure could affect hundreds of Redis commands (like LPUSH, RPUSH, LINDEX, etc.).

To mitigate this, the team adopted the following approach:

  1. Expand Test Coverage: Ensure existing commands pass unit tests.
  2. Strengthen CI/CD Pipeline: Implement benchmarking scripts to immediately detect performance degradation upon code changes.

2.2. The New Structure of Redis Arrays

The improved Array structure moved beyond simply allocating memory and was modified to maximize data locality. The core principle was “maximizing the use of contiguous memory blocks while allowing for segmentation and management when necessary.”

This yielded the following benefits:

  • Improved CPU Cache Hit Rate: Significantly increased L1/L2 cache hit rates due to contiguous memory access.
  • Memory Savings: Reduced actual data storage space by minimizing unnecessary pointer connections.

3. Practical Guide: Efficient Array Usage in Redis

Now that we’ve covered the theoretical background, let’s look at how to apply it in practice with code.

3.1. Problems with Existing List Usage

First, let’s consider the traditional way of adding tens of millions of items to a list. This operates based on QuickList, and as the number of items increases, the number of jumps also increases.

# Traditional Method (QuickList based)
# Add 10,000,000 items (potential for memory and speed degradation)
for i in {1..10000000}; do
  redis-cli LPUSH my_huge_list "item:$i"
done

3.2. Optimization using Streams and Hashes

While the internal improvements to Redis Arrays are transparent to users, when designing, we need to consider “data size” and “access patterns.” If simple sequential storage is all that’s needed, using the latest version of Redis alone will provide benefits.

However, if you need to search or modify data within the array, it’s advisable to use HASH instead of LIST.

import redis
import time

r = redis.Redis(host='localhost', port=6379, db=0)

# Scenario: Storing Log Data (Large Scale)
# 1. Using List (for sequential storage)
def push_to_list(count):
    start = time.time()
    for i in range(count):
        r.lpush("logs:timeline", f"log_entry_{i}")
    print(f"List pushed {count} items in {time.time() - start:.4f}s")

# 2. Using Hash (for search and modification)
def push_to_hash(count):
    start = time.time()
    pipe = r.pipeline()
    for i in range(count):
        pipe.hset("logs:details", f"entry_{i}", f"log_content_{i}")
    pipe.execute()
    print(f"Hash pushed {count} items in {time.time() - start:.4f}s")

if __name__ == "__main__":
    # Test inserting 100,000 data points
    push_to_list(100000)
    push_to_hash(100000)

Execution Result Analysis: In recent Redis versions (7.x and above), the internal Array structure is optimized, making LPUSH very fast. However, if you frequently need to retrieve data at a specific index, LINDEX has a complexity of O(N), making the O(1) approach using HGET much more advantageous.

4. Conclusion: The Harmony of Development Culture and Technology

The development process of Redis Arrays offers us important lessons:

  1. Performance Isn’t Free: Improving 25-year-old code requires commensurate refactoring and testing costs.
  2. Investment in Tools: This work was possible due to automated tools and a CI/CD environment capable of formatting 25 million lines of code.

When we design systems, we need to go beyond simply asking “Is it fast?” and consider “How can we achieve maintainable performance?” As the Redis team demonstrated, sometimes we must not shy away from large-scale improvements that shake the foundations of the architecture.

5. References

Thank you!

Built with Hugo
Theme Stack designed by Jimmy