Diagnostic Logging with the NGINX JavaScript Module

NGINX | September 29, 2020

Liam CrillySr Director, Product Management

Troubleshooting in Production Without Tuning the Error Log

Editor – This blog is one of several that discuss logging with NGINX and NGINX Plus. Please also see:

It’s also one of many blogs about use cases for the NGINX JavaScript module. For the complete list, see Use Cases for the NGINX JavaScript Module.

NGINX helps organizations of all sizes to run their mission‑critical websites, applications, and APIs. Regardless of your scale and choice of deployment infrastructure, running in production is not easy. In this article we talk about just one of the hard things about a production deployment – logging. More specifically, we discuss the balancing act of collecting the right amount of detailed logs for troubleshooting without being swamped with unnecessary data.

Logging Basics

NGINX provides two different logging mechanisms: access logging for client requests, and error logging for when things go wrong. These mechanisms are available in both the HTTP and Stream (TCP/UDP) modules, but here we focus on HTTP traffic. (There is also a third logging mechanism which uses the debug severity level, but we won’t discuss it here.)

A typical, default NGINX logging configuration looks like this.

The log_format directive describes the contents and structure of the log entries created when the access_log directive is included in the configuration. The example above is an extension of the Common Log Format (CLF) used by many web servers. With the error_log directive, you specify the severity level of the messages to log, but not the content or format of entries, which are fixed. More on that in the next section.

Other noteworthy aspects of NGINX logging include:

Logging directives are automatically inherited by lower‑level configuration contexts. For example, an access_log directive at the http context is applied to all server{} blocks.
A logging directive in a child context overrides inherited directives.
Multiple logging directives may exist in the same context. For example, two access_log directives might be used to create both a standard CLF log file and a second, more detailed log.

The Reality of Logging in Production

In general terms, we want to use the access log to provide analytics and usage statistics, and use the error log for failure detection and troubleshooting. But running a production system is seldom that simple. Here are some common challenges:

Access logs lack sufficient detail for troubleshooting
Error logs disclose good detail at the info severity level but that is too verbose for normal operations
The error log format is fixed and so cannot be customized to include variables of particular interest
Entries in the error log don’t include the context of the request and are difficult to match with the corresponding access log entry

What’s more, changing the NGINX configuration to add or remove logging detail in a production environment may also require going through a change‑control process, and redeploying the configuration. Entirely safe, but somewhat cumbersome when troubleshooting a live issue such as, “why am I seeing a spike in 4xx/5xx errors?”. This is of course magnified when there are multiple NGINX instances handling the same traffic across a cluster.

Using a Second Access Log for Errors

Customizing the format of the access log to enrich the data collected for each request is a common approach for enhanced analytics, but doesn’t scale well for diagnostics or troubleshooting. Asking the main access log to do two jobs is a contrived solution, because we typically want a lot more information for troubleshooting than we do for regular analytics. Adding numerous variables to the main access log can dramatically increase log volumes with data that is only occasionally useful.

Instead we can use a second access log and write to it only when we encounter an error that warrants debugging. The access_log directive supports conditional logging with the if parameter – requests are only logged when the specified variable evaluates to a non‑zero, non‑empty value.

With this configuration, we pass the $status variable through a map block to determine the value of the $is_error variable, which is then evaluated by the access_log directive’s if parameter. When $is_error evaluates to 1 we write a special log entry to the access_debug.log file.

However, this configuration doesn’t detect errors encountered during request processing that are ultimately resolved, and therefore result in status 200 OK. One such example is when NGINX is load balancing between multiple upstream servers. If an error is encountered with the selected server then NGINX passes the request to the next server under the conditions configured by the proxy_next_upstream directive. As long as one of the upstream servers responds successfully, then the client receives a successful response, which gets logged with status 200. However, the user experience is likely to be poor due to the retries, and it may not be immediately obvious that an upstream server is unhealthy. After all, we logged a 200.

When NGINX attempts to proxy to multiple upstream servers, their addresses are all captured in the $upstream_addr variable. The same is true for other $upstream_* variables, for example $upstream_status which captures the response code from each attempted server. So when we see multiple entries in these variables, we know that something bad happened – we probably have a problem with at least one of the upstream servers.

How about we also write to access_debug.log when the request was proxied to multiple upstream servers?

Here we use another map block to produce a new variable ($multi_upstreams) whose value depends on the presence of a comma (,) in $upstream_status. A comma means there is more than one status code, and therefore more than one upstream server was encountered. This new variable determines the value of $is_error when $status is not one of the listed error codes.

At this point we have our regular access log and a special access_debug.log file that contains erroneous requests, but we haven’t yet defined the access_debug log format. Let’s now ensure that we have all the data we need in the access_debug.log file to help us diagnose problems.

JSON Logging with JavaScript

Getting diagnostic data into access_debug.log is not difficult. NGINX provides over 100 variables related to HTTP processing, and we can define a special log_format directive that captures as many of them as we need. However, there are a few downsides to building out a naïve log format for this purpose.

It is a custom format – you need to train a log parser how to read it
Entries can be very long and hard for humans to read during live troubleshooting
You continuously need to refer to the log format in order to interpret entries
It is not possible to log non‑deterministic values such as “all request headers”

We can address these challenges by writing log entries in a structured format such as JSON, using the NGINX JavaScript module<.htmla> (njs). JSON format is also widely supported by log processing systems such as Splunk, LogStash, Graylog, and Loggly. By offloading the log_format syntax to a JavaScript function, we benefit from native JSON syntax, with access to all of the NGINX variables and additional data from the njs ‘r’ object.

The js_import directive specifies the file containing the JavaScript code and imports it as a module. The code itself can be found here. Whenever we write an access log entry that uses the access_debug log format, the $json_debug_log variable is evaluated. This variable is evaluated by executing the debugLog JavaScript function as defined by the js_set directive.

This combination of JavaScript code and NGINX configuration produces diagnostic logs that look like this.

The JSON format enables us to have separate objects for information related to the overall HTTP connection (including SSL/TLS), request, upstreams, and response. Notice how the first upstream (10.37.0.71) returned status 504 (Gateway Timeout) before NGINX tried the next upstream (10.37.0.72), which responded successfully. The half‑second timeout (reported as response_time in the first element of the upstreams object) accounts for most of the overall latency for this successful response (reported as elapsed_time in the connection object).

Here is another example of a (truncated) log entry, for a client error caused by an expired client certificate.

Summary

By generating rich diagnostic data only when we encounter an error, we enable real‑time troubleshooting without needing to perform any reconfiguration. Successful requests are not impacted because the JavaScript code runs only when we detect an error at the logging phase, after the last byte was sent to the client.

The complete configuration is available on GitHub – why not try it in your environment? If you’re not already running NGINX Plus, start a free 30-day trial today or contact us to discuss your use cases.