When a website or application receives millions of visits daily, counting unique visitors (UV) becomes a technical challenge. It's impractical to directly insert a record for each visit into the database and then run a slow `SELECT COUNT(DISTINCT user_id)` query at midnight. This would overwhelm the database and cause response times to skyrocket. Redis not only easily handles UV statistics for millions of visits but also keeps memory consumption incredibly low while maintaining extremely high speed.
The core of UV counting is deduplication. There are three main approaches to solving this problem with Redis, each offering trade-offs between accuracy, memory consumption, and computational complexity, suitable for different scenarios.
The most intuitive approach is to use a Redis Set. Each user identifier (such as user ID or device ID) is an element in the set. Leveraging the set's inherent deduplication property, the UV can be obtained by using the `SCARD` command to get the set size. This method is simple, intuitive, and the results are absolutely accurate.
```python
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
def log_visit(user_id):
"""Record a user visit"""
today_key = "uv:" + "2023-10-27" # Organize the key by date
# The SADD command will add user_id to the collection, ignoring it if it already exists
r.sadd(today_key, user_id)
def get_uv(date):
"""Get UV for the specified date"""
key = "uv:" + date
# The SCARD command returns the cardinality of the collection, i.e., the number of unique users
return r.scard(key)
# Simulate record visits
log_visit("user_123")
log_visit("user_456")
log_visit("user_123") # Repeated visits will not be counted repeatedly.
`print(get_uv("2023-10-27")) # Output: 2`
The advantage of the Set approach is absolute accuracy and simple commands. However, its fatal weakness lies in memory consumption. Each user ID needs to be stored in memory as a string. Assuming you have 1 million unique users, with each user ID occupying an average of 20 bytes, storing only one day's data would require nearly 20MB of memory. If you need to retain data for multiple days, the overhead will increase linearly. Therefore, the Set approach is more suitable for scenarios with a controllable total number of users (e.g., below 100,000).
To maximize memory compression, the second approach, Bitmap, comes into play. Its idea is ingenious: instead of storing the user ID itself, it maps each user to an integer offset (e.g., hashing the user ID and taking the modulo). A bitmap is essentially a very long binary bit array, with each user occupying only one bit. If a user has visited a site, its corresponding bit is set to 1; otherwise, it's 0. Finally, by counting the number of bits with a value of 1 in the bitmap, you can obtain the UV (unique visitors).
```python
def log_visit_bitmap(user_id):
"""Use Bitmap to record visits"""
today_key = "uv_bitmap:" + "2023-10-27"
# Hash user_id and take the modulo to map it to a very large offset (e.g., 2^32)
offset = hash(user_id) % (2^32)
# Use the SETBIT command to set the bit corresponding to the offset to 1
r.setbit(today_key, offset, 1)
def get_uv_bitmap(date):
"""Use the BITCOUNT command to count the number of 1s in the bitmap"""
key = "uv_bitmap:" + date
return r.bitcount(key)
# Log Visits
`log_visit_bitmap("user_123")`
`log_visit_bitmap("user_456")`
`log_visit_bitmap("user_123")` # Repeated visits, the same bit is set to 1 multiple times, does not affect the result
`print(get_uv_bitmap("2023-10-27"))` # Output: 2
Bitmaps are extremely memory efficient. Counting visits from 100 million users only requires approximately `100,000,000 / 8 / 1024 / 1024 ≈ 12MB` of memory. However, it has two prerequisites: first, you need to be able to stably map users to a range of integers; second, if the user ID space is sparse (i.e., the total number of users is extremely large but the actual proportion of users accessing the site is very small), some memory may be wasted. The Bitmap scheme is suitable for scenarios where user IDs are digitizable and the total number is clear (such as a registered user ID system).
When the data volume truly reaches massive levels (tens of millions or even hundreds of millions), and you can tolerate extremely small errors in the statistical results, Redis's HyperLogLog data structure becomes the ultimate weapon. HyperLogLog is a probabilistic algorithm that uses minimal memory space (approximately 12KB per HyperLogLog key) to estimate the cardinality (number of duplicate elements) of a set, with an error rate controlled within 1%.
```python
def log_visit_hll(user_id):
"""Record visits using HyperLogLog"""
today_key = "uv_hll:" + "2023-10-27"
# The PFADD command adds the element to HyperLogLog
r.pfadd(today_key, user_id)
def get_uv_hll(date):
"""Get the estimated UV using the PFCOUNT command"""
key = "uv_hll:" + date
return r.pfcount(key)
# The HyperLogLog interface is as simple as Set
log_visit_hll("user_123")
log_visit_hll("user_456")
log_visit_hll("user_123")
print(get_uv_hll("2023-10-27")) #` The output is an estimated value close to 2, which could be 2, 1.98, etc.
The magic of HyperLogLog lies in the fact that no matter whether you add 10,000 or 100 million elements, it almost always uses only about 12KB of memory. This is revolutionary for scenarios that require long-term statistics of massive UVs (such as the historical UV of the entire site). You can easily keep the daily UV data of the past 365 days resident in memory, while the memory consumption is only about `365 * 12KB ≈ 4.3MB`. The cost is that it gives an estimated value, not an exact number, but for most macro data analysis (such as traffic trends, channel comparison), a 1% error is completely within an acceptable range.
In practical engineering, a robust UV statistics system often combines these solutions. For example, HyperLogLog is used to count the real-time and historical UV of the entire site for monitoring the overall picture; Bitmap is used to count the precise UV of a specific activity page, because the number of active users is usually controllable and requires precise numbers; and Set is used to count the visits of small groups such as VIP users. In addition, attention should be paid to key design, typically organized by date (e.g., `uv:20231027`) for easy daily statistics and setting expiration times, preventing data from growing indefinitely.
Synchronizing this data from Redis to a persistent database (such as MySQL) for further analysis is also a common practice. You can write the previous day's UV values (using the `GET` command to retrieve pre-calculated values, or calling `PFCOUNT` or `SCARD` to calculate them) to the database every morning using a simple script. The entire process has almost no impact on the online Redis service.
Therefore, when facing millions or even higher daily access volumes, Redis, with its diverse data structures, provides a complete UV statistics solution, from accurate to estimating, from memory-efficient to ultra-high concurrency. You no longer need to worry about database `COUNT(DISTINCT)` queries, nor the storage costs of massive amounts of data.
CN
EN