BLOG

Detecting PhantomJS Based Visitors

Published January 22, 2015


These days, many web security incidents involve automation. Web-scraping, password reuse, and click-fraud attacks are perpetrated by adversaries trying to mimic real users, and thus will attempt to look like they are coming from a browser. As a website owner, you want to ensure you serve humans, and as a web service provider you want programmatic access to your content to go through your API instead of being scraped through your heavier and less stable web interface.

Assuming that you have basic checks for cURL-like visitors, the next reasonable step is to ensure that visitors are using real, UI-driven browsers — and not headless browsers like PhantomJS and SlimerJS.

In this article, we’re going to demonstrate some techniques for identifying visits by PhantomJS. We decided to focus on PhantomJS because it is the most popular headless browser environment, but many of the concepts that we’ll cover are applicable to SlimerJS and other tools.

NOTE: The techniques presented in this article are applicable to both PhantomJS 1.x and 2.x, unless explicitly mentioned. First up: is it possible to detect PhantomJS without even responding to it?

HTTP stack

As you may be aware, PhantomJS is built on the Qt framework. The way Qt implements the HTTP stack makes it stick out from other modern browsers.

First, let’s take a look at Chrome, which sends out the following headers:

PhantomJS

In PhantomJS, however, the same HTTP request looks like this:

 
PhantomJS

You’ll notice the PhantomJS headers are distinct from Chrome (and, as it turns out, all other modern browsers) in a few subtle ways:

  • The Host header appears last
  • The Connection header value is mixed case
  • The only Accept-Encoding value is gzip
  • The User-Agent contains “PhantomJS”

Checking for these HTTP header aberrations on the server, it should be possible to identify a PhantomJS browser.

But, is it safe to believe these values? If an adversary uses a proxy to rewrite headers in front of the headless browser, they could modify those headers to appear like a normal modern browser instead.

Looks like tackling this problem purely on the server is not a silver bullet. So let’s take a look at what can be done on the client, using PhantomJS’s JavaScript environment.

Client-side User-Agent Check

We may not be able to trust the User-Agent value as delivered via HTTP, but what about on the client?

PhantomJS

Unfortunately, it is similarly trivial to change user-agent header and navigator.userAgent values in PhantomJS, so this might not be enough.

Plugins

navigator.plugins contains an array of plugins that are present within the browser. Typical plugin values include Flash, ActiveX, support for Java applets, and the “Default Browser Helper”, which is a plugin that indicates whether this browser is the default browser in OS X. In our research, most fresh installs of common browsers include at least one default plugin — even on mobile.

This is unlike PhantomJS, which doesn’t implement any plugins, nor does it provide a way to add one (using the PhantomJS API).

The following check might then be useful:

PhantomJS

On the other hand, it’s fairly trivial to spoof this plugin array by modifying the PhantomJS JavaScript environment before the page is loaded.

It’s also not difficult to imagine a custom build of PhantomJS with real, implemented plugins. This is easier than it sounds because the Qt framework on which PhantomJS is built provides a native API for implementing plugins.

Timing

Another point of interest is how PhantomJS suppresses JavaScript dialogs:

PhantomJS

After measuring several times, it appears that if the alert dialog is suppressed within 15 milliseconds, the browser is probably not being controlled by a human. But using this approach means bothering real users with an alert they’ll manually have to close.

Global Properties

PhantomJS 1.x exposes two properties on the global object:

PhantomJS

However, these properties are part of an experimental feature and may change in the future.

Lack of JavaScript Engine Features

PhantomJS 1.x and 2.x currently use out-of-date WebKit engines, which means there are browser features that exist in newer browsers that do not exist in PhantomJS. This extends to the JavaScript engine — whereby some native properties and methods are different or absent in PhantomJS.

One such method is Function.prototype.bind, which is missing in PhantomJS 1.x and older. The following example checks whether bind is present, and that it has not been spoofed in the executing environment.

PhantomJS

This code is a little too tricky to explain in detail here, but you can find out more from our presentation.

Stack Traces

Errors thrown by JavaScript code evaluated by PhantomJS via the evaluate command contain a uniquely identifiable stack trace, from which we can identify the headless browser.

For example, suppose that PhantomJS calls evaluate on the following code:

PhantomJS

Note that this example uses a custom indexOfString() function, left as an exercise for the reader, since the native String.prototype.indexOf can be spoofed by PhantomJS to always return a negative result.

Now, how do you get a PhantomJS script to evaluate this code? One technique is to override some frequently used DOM API functions that are likely to be called. For example, the code below overrides document.querySelectorAll to inspect the browser’s stack trace:

Summary

In this article we’ve looked at 7 different techniques for identifying PhantomJS, both on the server and by executing code in PhantomJS’s client JavaScript environment. By combining the detection results with a strong feedback mechanism — for example, rendering a dynamic page inert, or invalidating the current session cookie — you can introduce a solid hurdle for PhantomJS visitors. Always keep in mind however, that these techniques are not infallible, and a sophisticated adversary will get through eventually.

To learn more, we recommend watching this recording of our presentation from AppSec USA 2014 (slides). We’ve also put together a GitHub repository of example implementations — and possible circumventions — of the techniques presented here.

Thanks for reading, and happy hunting.

Contributors

Sergey Shekyan – @sshekyan
Ben Vinegar – @bentlegen
Bei Zhang – @ikarienator