Skip to content

Latest commit

 

History

History
199 lines (141 loc) · 9.78 KB

README.md

File metadata and controls

199 lines (141 loc) · 9.78 KB

Experiments with Redis Bloom and the Text from "Moby Dick"

This repository contains a small example using the text from the book "Moby Dick" as a data source to compare and contrast the benefits and tradeoffs of using different Redis data structures. We'll use the deterministic Set data structure, and four probabilistic data structures: Hyperloglog, Bloom Filter, Top-K and Count-Min Sketch. These probabilistic data structures generally use hashing to be more memory efficient at the cost of some accuracy and the ability to recall data added to them.

The Set and Hyperloglog are built into core Redis, to use the Bloom Filter, Top K and Count-Min Sketch data structures, you'll need Redis Stack or the RedisBloom module.

Prerequisites

To try out this project yourself, you'll need:

  • Redis Stack (installed locally, available free in the cloud, or use the supplied Docker Compose file).
  • Docker Desktop (if using the Docker Compose file).
  • git command line tools.
  • Python version 3.6 or higher (I've tested the code using Python version 3.10.6).

Get the Code

To get the code, clone the repo to your machine:

$ git clone https://github.com/redis-developer/redisbloom-moby-dick.git
$ cd redisbloom-moby-dick/python

Starting Redis Stack

If you're using Docker, start Redis Stack as follows:

$ docker-compose up -d

Redis Stack will start up and Redis will listen on the default port 6379.

Once you're done, you can stop Redis Stack like this:

$ docker-compose down

If you're running Redis Stack locally, but installed it through a package manager, make sure that it is running. See the instructions for your package manager for details.

If you are running Redis Stack in the cloud, you won't need to start or stop it... but you will need to set an environment variable that tells the application code where to connect to Redis. Set the value of the REDIS_URL environment variable to a valid Redis connection URL before starting the application. For example:

$ export REDIS_URL=redis://default:sssssh@redis.mydomain.com:6390

Application Setup

To setup the application, first install the dependencies:

$ pip3 install -r requirements.txt

Running the Application

Start the application as follows:

$ python3 main.py

The application will log each word from the file moby_dick_just_words.txt to the console as it updates the following data structures in Redis with that word:

  • A Set whose key is mobydick:words:set.
  • A Bloom Filter whose key is mobydick:words:bloom.
  • A Hyperloglog whose key is mobydick:words:hyperloglog.
  • A Top-K whose key is mobydick:words:topk.
  • A Count-Min Sketch whose key is mobydick:words:cms.

The Set and Hyperloglog are built into Redis, the Bloom Filter, Top K and Count-Min Sketch are additional capabilities added by the RedisBloom module that is part of Redis Stack.

Once all the words have been loaded into these data structures, the code then prints out some summary statistics about each. Here's some example output:

There are 18271 distinct words in the Redis Set.
The Redis Set uses 941.318359375kb of memory.
The Redis Hyperloglog counted 18221 distinct words.
The Redis Hyperloglog uses 14.0703125kb of memory.
The Redis Bloom Filter uses 27.0703125kb of memory.
The Count-Min Sketch uses 109.4296875kb of memory.

The top 10 words by Top-K are:
 {'the': 583, 'and': 291, 'of': 289, 'i': 284, 'whale': 282, 'a': 282, 'to': 281, 'what': 13, 'you': 12, 'look': 12}

The Count-Min sketch counts for each of those words are:
 {'the': 14179, 'and': 6322, 'of': 6464, 'i': 1988, 'whale': 922, 'a': 4614, 'to': 4518, 'what': 564, 'you': 879, 'look': 207}

As the Hyperloglog, Bloom Filter, Top K and Count-Min Sketch are probabilistic data structures, your output for these may vary. You should always see 18270 distinct words in the Redis Set though, as this is a deterministic data structure.

How it Works

All of the code is contained in a single file: main.py. This uses redis-py library, and first connects to either a local instance of Redis, or one specified as a Redis connection URL in the REDIS_URL environment variable:

import redis

load_dotenv()

REDIS_URL = os.environ.get("REDIS_URL", "localhost")
REDIS_SET_KEY = "mobydick:words:set"
REDIS_BLOOM_KEY = "mobydick:words:bloom"
REDIS_HLL_KEY = "mobydick:words:hyperloglog"
REDIS_TOPK_KEY = "mobydick:words:topk"
REDIS_CMS_KEY = "mobydick:words:cms"

# Create a client and connect to Redis
# Create a pipeline for bulk operations
client = redis.Redis(REDIS_URL)
pipe = client.pipeline()

Next, we clean up after any previous run by deleting the Redis keys for each of the four different data types.

pipe.delete(REDIS_SET_KEY)
pipe.delete(REDIS_BLOOM_KEY)
pipe.delete(REDIS_HLL_KEY)
pipe.delete(REDIS_TOPK_KEY)
pipe.delete(REDIS_CMS_KEY)

There's no initialization / setup step required for the Set or Hyperloglog, but we do need to initalize the Bloom Filter, Top K and Count-Min Sketch:

bloom_filter = pipe.bf()
bloom_filter.create(REDIS_BLOOM_KEY, 0.01, 20000)

top_k = pipe.topk()
top_k.reserve(REDIS_TOPK_KEY, 10, 8, 7, 0.9)

cms = pipe.cms()
cms.initbyprob(REDIS_CMS_KEY, 0.001, 0.01)

Next, we'll want to load all of the words from the word file provided, and add them to each of the data structures. We'll do that with Python's with open(...) and create an array of words by splittng the file each time we see a space:

with open("../moby_dick_just_words.txt", "r") as fi:
    for line in fi:
        for raw_word in line.split():
            word = raw_word.strip().lower()
            bloom_filter.add(REDIS_BLOOM_KEY, word)
            pipe.sadd(REDIS_SET_KEY, word)
            pipe.pfadd(REDIS_HLL_KEY, word)
            top_k.add(REDIS_TOPK_KEY, word)
            cms.incrby(REDIS_CMS_KEY, [word], [1])
            print(word)

Before loading each word into the various data structures, we normalize it by converting it to lower case, so that Whale and whale are seen as the same word. Having done that, we use the appropriate Redis command for each data structure:

  • SADD adds a member to a Set, if the member already exists in the Set then nothing happens as Sets cannot contain duplicates.
  • BF.ADD adds an item to a Bloom Filter, if the item may already exist in the Bloom Filter then nothing happens, and this may sometimes be in error as this is a probabilistic data structure.
  • PFADD adds an item to a Hyperloglog, potentially updating the count of distinct items seen. As this is a probabilistic data structure and hash collisions may occur, the count will be an approximation.
  • TOPK.ADD adds an item to a Top K, giving it an initial score of 1. If the item already exists, the score is incremented by 1. As this is a probabilistic data structure and hash collisions may occur, the scores will be approximations and not entirely accurate.
  • CMS.INCRBY Increases the count of item by increment. Multiple items can be increased with one call.

Finally, let's check out some statistics about each of the data structures...

print(f"There are {client.scard(REDIS_SET_KEY)} distinct words in the Redis Set.")
print(f"The Redis Set uses {client.memory_usage(REDIS_SET_KEY) / 1024}kb of memory.")
print(f"The Redis Hyperloglog counted {client.pfcount(REDIS_HLL_KEY)} distinct words.")
print(
    f"The Redis Hyperloglog uses {client.memory_usage(REDIS_HLL_KEY) / 1024}kb of memory."
)
print(
    f"The Redis Bloom Filter uses {client.memory_usage(REDIS_BLOOM_KEY) / 1024}kb of memory."
)
print(f"The Count-Min Sketch uses {client.memory_usage(REDIS_CMS_KEY) / 1024}kb of memory.")

top_k.list(REDIS_TOPK_KEY, withcount=True)
pipe_result = pipe.execute()
top_k_list = pipe_result[-1]

words = top_k_list[::2]
freq = top_k_list[1::2]

top_k_frequency_dict = dict(zip(words, freq))

print("\nThe top 10 words by Top K are:\n", top_k_frequency_dict)

cms_word_count = client.cms().query(REDIS_CMS_KEY, *words)
cms_frequency_dict = dict(zip(words, cms_word_count))

print("\nThe Count-Min sketch counts for each of those words are:\n", cms_frequency_dict)
  • The SCARD command gives us the cardinality or number of elements in a Set. As the set keeps all of the data, it takes up more memory than the Hyperloglog or Bloom Filter but returns an accurate word count.
  • PFCOUNT returns an approximation of the number of distinct items added to the Hyperloglog.
  • TOPK.LIST with the WITHCOUNT modifier returns the full list of items in the Top K along with their approximated counts / scores.
  • CMS.QUERY returns the count for one or more items in a sketch.
  • Using the MEMORY USAGE command, we can see how much memory the Set, Hyperloglog, Bloom Filter, Top K and Count-Min Sketch take up in Redis. MEMORY USAGE returns the memory used in bytes, so we divide by 1024 to get kilobytes.