Incident on Python 2 to 3 Migration: Redis Cache and Pickle

1st January 2020 was the official sunset date of Python 2. At NewsCred (now Welcome) we had a bunch of services written in Python 2 that we wanted to migrate to Python 3 as soon as we can, since the sunset date was approaching. In late November 2019, my team (Public API) took the first initiative to migrate one Flask service from Python 2.7.9 to 3.8.0. As you can guess, it didn’t go butter smooth, and I’m here to tell the tale.

That big migration PR got reviewed by 4 experienced engineers. We had good test coverage. We created integration tests from log replay. We did extensive manual testing before production deployment. But as soon as it hit the production, we started getting PagerDuty alerts!

A lot of API requests were getting HTTP 500 responses. Without much investigation, out of instinct we reverted the migration PR. Though the frequency went down by a big margin, we were still getting some 500s.

Here’s Why!

From the logs we saw, after the migration PR deployment we started getting the following exception:

...
...
  File "/.../werkzeug/contrib/cache.py", line 631, in get
    return self.load_object(self._client.get(self.key_prefix + key))
  File "/.../werkzeug/contrib/cache.py", line 621, in load_object
    return pickle.loads(value[1:])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 180662: ordinal not in range(128)

After the deployment of the revert PR, we got the following exception:

...
...
  File "/...werkzeug/contrib/cache.py", line 598, in get
    return self.load_object(self._client.get(self.key_prefix + key))
  File "/.../werkzeug/contrib/cache.py", line 588, in load_object
    return pickle.loads(value[1:])
ValueError: unsupported pickle protocol: 4

So the underlying issue was with pickle not being able to de-serialize the cached object. Let’s dive a little deeper.

Cache Setup

We used Redis for response caching in that service. The setup looked something like this:

1
2
3
4
5
6
7
from werkzeug.contrib.cache import RedisCache

cache = RedisCache(host=config.CACHE_HOST, port=config.CACHE_PORT)

cache.set(key, value, timeout=config.CACHE_TIMEOUT)

value = cache.get(key)

So we used the RedisCache client from the werkzeug. (Underneath it uses the redis-py, this information is not relevant here though).

Where Things Went Wrong

RedisCache uses pickle to serialize objects before writing to redis:

1
2
3
4
5
6
7
8
def dump_object(self, value):
    """Dumps an object into a string for redis.  By default it serializes
    integers as regular string and pickle dumps everything else.
    """
    t = type(value)
    if t in integer_types:
        return str(value).encode('ascii')
    return b'!' + pickle.dumps(value)

And de-serialize objects after reading from redis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def load_object(self, value):
    """The reversal of :meth:`dump_object`.  This might be called with
    None.
    """
    if value is None:
        return None
    if value.startswith(b"!"):
        try:
            return pickle.loads(value[1:])
        except pickle.PickleError:
            return None
    try:
        return int(value)
    except ValueError:
        # before 0.8 we did not have serialization.  Still support that.
        return value

Implementation of the default pickle protocol in Python 2.7.9:

1
2
if protocol is None:
    protocol = 0

And the implementation of the default pickle protocol in Python 3.8.0:

1
2
if protocol is None:
    protocol = DEFAULT_PROTOCOL

The value of the DEFAULT_PROTOCOL:

1
DEFAULT_PROTOCOL = 4

And there’s a docstring telling:

The optional *protocol* argument tells the pickler to use the
given protocol; supported protocols are 0, 1, 2, 3 and 4.  The
default protocol is 4. It was introduced in Python 3.4, it is
incompatible with previous versions.

As the default protocol of pickle got changed from Python 2 to 3, the cache objects those were serialized (set) using Python 2, couldn’t be de-serialized (get) using Python 3 due to the unicode conversion error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 180662: ordinal not in range(128)

Thus we were getting that influx of 500s after the migration PR deployment.

While the service was running in Python 3, some data got cached, after reverting to Python 2, pickle couldn’t read those cached data since protocol 4 was unknown to it, and we were getting some 500s.

ValueError: unsupported pickle protocol: 4

Why This Wasn’t detected Before Deployment

  1. Though we had good test coverage, caching scenarios were mocked in the unit tests.
  2. The subset of data we used for integration tests didn’t have any unicode entry.
  3. We didn’t test manually with unicode data.

Hence it caught us off guard!

How We Solved the Issue

We re-reverted the revert PR, and instantly after the deployment, we deleted the already cached data from redis. Since most of the consumers of that service were kind of background workers with high timeouts, performance spike from the mass cache invalidation didn’t cause much trouble.

Around 2 years after the incident, when I look back, definitely can realize - this kind of incidents teach us invaluable lessons!

Load Comments?