PHP-AMQP RabbitMQ Connection Timeouts

As some of you know, I've been working extensively with a new LAMP stack (Linux, AMQP, mongodb, PHP) for several years now.

In my current assignment, I've designed and developed a system that uses both a local and remote AMQP services for CRUD transactions containing HIPAA data.  The remote system is located on AWS where my back-end has it's own cluster, servicing client requests from the internet via a node.js application, providing access to non-identifying data.

The local system, which we store across two locations for data redundancy, handles patient data requests via the AWS client from the mobile clients, as well as provides universal login services to the rest of the company's internet applications.

This article is the first of two articles that discuss work-around solutions for problems I encountered using RabbitMQ as a distribution vector between two discreet systems.

This article deals with RabbitMQ connection time-outs when using a AMQP SSL/TLS connection by describing what's causing the time-out, why the time-out is occurring even after configuring keep-alive under RabbitMQ, and what the work-around solution was to prevent further connection time-outs.

The requests sent from AWS are built at the time of the request, an AMQP client is instantiated, and the the request is published to the remote service. Between AWS and the data center, there is no persistent (permanent) connection.

When the brokers have been inactive for at least 90 minutes, the node.js client connection request will crash the PHP consumer/broker/daemon (RPC queue) leaving behind this in the php_errors log:

[19-Feb-2016 11:38:42 America/Los_Angeles] PHP Fatal error: Uncaught exception 'PhpAmqpLib\Exception\AMQPRuntimeException' with message 'Broken pipe or closed connection' in /home/atladmin/aura-ds/lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Wire/IO/StreamIO.php:190
Stack trace:
#0 /home/atladmin/aura-ds/lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php(336): PhpAmqpLib\Wire\IO\StreamIO->write('?????????????')
#1 /home/atladmin/aura-ds/lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php(457): PhpAmqpLib\Connection\AbstractConnection->write('?????????????')
#2 /home/atladmin/aura-ds/lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Channel/AbstractChannel.php(223): PhpAmqpLib\Connection\AbstractConnection->send_channel_method_frame(1, Array, Object(PhpAmqpLib\Wire\AMQPWriter))
#3 /home/atladmin/aura-ds/lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Channel/AMQPChannel.php(259): PhpAmqpLib\Channel\AbstractChannel->send_method_frame(Array, Object(PhpAmqpLib\Wire\AMQPWriter))
#4 /home/atladmin/aura-ds/lib/vendor/videla in /home/atladmin/aura-ds/lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Wire/IO/StreamIO.php on line 190

At first glance, this error would appear to be raised from a bug in the php-amqp library that I'm calling.

Note:  I've since updated from the videlalvaro-phpamqplib library to the php-amqp php-amqp library.  Read on...

When this exception is thrown, the PHP broker daemon dies; there are no errors left behind on the RabbitMQ logs.

Couple of other things to note - we discovered that this error can only be triggered by a call from the node.js client.  In testing, I'd let the brokers run without activity for several days then published, from a PHP-test script, requests without reproducing the failure.  Additionally, and logically, any activity within a 90-minute window will prevent this exception.

The source cause, after some digging, isn't in the php-amqp libraries, as the exception dump would like you to believe but, rather, in the PHP method call: socket_import_stream().  This method, which is the method invoked by the php-amqp library, simply does not support SSL connections.  (Bug report here.)

Within the (videlalvaro version) php-amqp library, the relevant code throwing the exception is in the file:

lib/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Wire/IO/StreamIO.php

I filed a problem report with the library maintainers here, requesting that the resource allocation call return be evaluated prior to passing the keep-alive value to the amqp resource, thus avoiding the exception.

However, this doesn't fix my code nor does it prevent the node.js client from crashing the PHP brokers after 90-minutes of inactivity.

As of this writing, I have six different persistent daemon (brokers) that are deployed.  Standard within each of these is a ping event call that returns a Boolean(true) value on success.  (Failure return timeouts.)

I wrote a script that reads my XML configuration for the names and locations of the broker daemons and then instantiates a client class object and publishes a ping event to each of the daemons.  The script is invoked by the CRON daemon every hour.  Successful ping results are written to the framework's console log while errors are written to the logger and alert events are pushed to the framework event manager.

This script provided a work-around solution, has low-overhead, and keeps the daemons up-and-running.