1st January 2020 was the official sunset date of Python 2. At NewsCred (now Welcome) we had a bunch of services written in Python 2 that we wanted to migrate to Python 3 as soon as we can, since the sunset date was approaching. In late November 2019, my team (Public API) took the first initiative to migrate one Flask service from Python 2.7.9
to 3.8.0
. As you can guess, it didn’t go butter smooth, and I’m here to tell the tale.
That big migration PR got reviewed by 4 experienced engineers. We had good test coverage. We created integration tests from log replay. We did extensive manual testing before production deployment. But as soon as it hit the production, we started getting PagerDuty alerts!
A lot of API requests were getting HTTP 500
responses. Without much investigation, out of instinct we reverted the migration PR. Though the frequency went down by a big margin, we were still getting some 500
s.
Here’s Why!
From the logs we saw, after the migration PR deployment we started getting the following exception:
...
...
File "/.../werkzeug/contrib/cache.py", line 631, in get
return self.load_object(self._client.get(self.key_prefix + key))
File "/.../werkzeug/contrib/cache.py", line 621, in load_object
return pickle.loads(value[1:])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 180662: ordinal not in range(128)
After the deployment of the revert PR, we got the following exception:
...
...
File "/...werkzeug/contrib/cache.py", line 598, in get
return self.load_object(self._client.get(self.key_prefix + key))
File "/.../werkzeug/contrib/cache.py", line 588, in load_object
return pickle.loads(value[1:])
ValueError: unsupported pickle protocol: 4
So the underlying issue was with pickle
not being able to de-serialize the cached object. Let’s dive a little deeper.
Cache Setup
We used Redis for response caching in that service. The setup looked something like this:
|
|
So we used the RedisCache
client from the werkzeug
. (Underneath it uses the redis-py
, this information is not relevant here though).
Where Things Went Wrong
RedisCache
uses pickle
to serialize objects before writing to redis:
|
|
And de-serialize objects after reading from redis:
|
|
Implementation of the default pickle
protocol in Python 2.7.9:
|
|
And the implementation of the default pickle
protocol in Python 3.8.0:
|
|
The value of the DEFAULT_PROTOCOL
:
|
|
And there’s a docstring telling:
The optional *protocol* argument tells the pickler to use the
given protocol; supported protocols are 0, 1, 2, 3 and 4. The
default protocol is 4. It was introduced in Python 3.4, it is
incompatible with previous versions.
As the default protocol of pickle
got changed from Python 2 to 3, the cache objects those were serialized (set
) using Python 2, couldn’t be de-serialized (get
) using Python 3 due to the unicode conversion error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 180662: ordinal not in range(128)
Thus we were getting that influx of 500
s after the migration PR deployment.
While the service was running in Python 3, some data got cached, after reverting to Python 2, pickle
couldn’t read those cached data since protocol 4
was unknown to it, and we were getting some 500
s.
ValueError: unsupported pickle protocol: 4
Why This Wasn’t detected Before Deployment
- Though we had good test coverage, caching scenarios were mocked in the unit tests.
- The subset of data we used for integration tests didn’t have any unicode entry.
- We didn’t test manually with unicode data.
Hence it caught us off guard!
How We Solved the Issue
We re-reverted the revert PR, and instantly after the deployment, we deleted the already cached data from redis. Since most of the consumers of that service were kind of background workers with high timeouts, performance spike from the mass cache invalidation didn’t cause much trouble.
Around 2 years after the incident, when I look back, definitely can realize - this kind of incidents teach us invaluable lessons!