The gRPC quickstart guides do a very good job at helping the developers to get started using gRPC in their projects, but deployments to production is not discussed and available resources are sparse.
In this article we discuss a common deployment scenario and how to navigate around the pitfalls.
Infrastructure Overview
In our example, we have a Ruby on Rails monolithic application that is running production and actively maintained by a team of developers. To develop a new feature, we chose to build a microservice in Python and leverage its readily available machine learning libraries. Rails app communicates with the Python microservice via gRPC. Each instance of the Rails app maintains one instance of active gRPC connection with the Python microservice (a channel, in gRPC’s terminology.)
We chose to use a standard Layer 3 load balancer to scale the Python microservices horizontally. (Note: This might make sense for all use cases, if your application is data transfer heavy, a proxy-like Layer 3 LB might be a bottleneck.)
The Python microservice is built as a Docker container and we use red-black deployment strategy to retire older versions safely.
Connection Keepalive
GRPC uses keepalive ping as a way to check if a channel is currently working by sending HTTP2 pings over the transport. It is sent periodically, and if the ping is not acknowledged by the peer within a certain timeout period, the transport is disconnected [1]. However the default interval of keepalive ping is set as 2 hours and is only sent on the server side as a sane default. This means when an old gRPC server is retired by a newer deployment or simply becames faulty, the client only finds out when trying and failing to send a request. As a default value, it works well as it prevents the network from being flooded by useless pings when the services don’t need them. However in practice, we often want to detect interruptions earlier, and having a long keepalive interval degrades the service for the first few users that try to access the service while gRPC client is unaware of the server failure.
When this happens, usually this error appears on the client side
1 | GRPC::Unavailable: 14:OS Error |
or
1 | GRPC::Unavailable: 14:Connect Failed |
We can make gRPC detects unavailable services quicker by adjusting the keepalive options. On the server side we can change the server initializer code to include keepalive options:
1 | server = grpc.server( |
On the client- side, we change the client initializer code to include keepalive options:
1 | def connect! |
These options allows the client and server to discover connection interruptions quicker and re-establish another connection through the load balancer.