Understanding and Mitigating the Impact of Web Robot and IoT Traffic on Web Systems

From Knoesis wiki
Jump to: navigation, search


This article is based on work supported by the National Science Foundation (NSF) under Grant No. 1464104 (http://nsf.gov/awardsearch/showAward?AWD_ID=1464104). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

Introduction & Motivation

The kind of information shared on the Web has shifted dramatically over the past decade and a half. Compared to Web pages that mainly hosted static material during the early 2000’s, modern Web sites are full of dynamic content in the form of in-the-moment news articles, opinions, and social information. Consequently, Web robots or crawlers, which are software agents that automatically submit http requests for content from the Web without any human intervention, have been steadily rising in sophistication and in volume. Latest industry reports suggest that 6/10 of all Web requests are originated by bots, and this proportion is slated to rise to ever higher levels with the advent of the Internet of Things. The figure below illustrates the rapid increase in the volume of robot traffic reported by various scientific and commercial studies over the past decade:


Present efforts have identified how the behavioral and statistical characteristics of Web robot traffic stand in contrast to traffic generated by humans. Unfortunately, current methods for optimizing the response rate, power consumption, and other performance aspects of Web systems rely on traffic to exhibit human-like characteristics. For example, caches are a critical tool to improve system response times and minimize energy usage, but they require traffic to display human behavioral patterns which robots do not show. Given how robots are the main form of Web traffic today and are likely to rise to higher levels in the future, they stand to threaten the performance, efficiency, and scalability of all kinds of Web systems, from single servers to farms and large-scale clouds.

This NSF funded project will devise essential methods towards understanding the complexity of Web robot traffic and its impact on Web systems. The research activities will operationalize our existing knowledge about the features and behaviors of robot requests to devise novel classification methods and to build a robot traffic generator and robot- resilient caching system. The results of the project stand to transform the how Web systems of all kinds are designed and optimized so that the performance and energy cost associated with servicing robot requests is mitigated. The results also lay a foundation to build analytic models of the demand robots impose based on the specific architecture of a system, and to develop open-source plugins that control robot activity and their access to information in unprecedented ways.


Graduate Students: Nathan Rude, Ning Xie
Undergraduate Students: Logan Rickert
Faculty: Derek Doran (PI)


Robot detection

Contemporary offline and real-time robot detection techniques suffer from several limitations. Offline detection approaches based on: (i) simple, textual analysis can be easily obfuscated by Web robots; (ii) identifying common characteristics (such as percentage of image requests and response codes) of robots may not generalize across server domains; and (iii) training machine learning models like a a naive Bayes net or Hidden Markov Model over traffic statistics may see degraded performance as the statistical profile of robots and humans evolve continuously. Contemporary real-time detection approaches subject active sessions to tests such as checking for the execution of embedded code, or requesting hidden resources on an html page. These tests, however, are designed to identify specific ill-behaviors, such as robots participating in a distributed denial of service attack. They also require extensive modifications to Web servers and lack the ability to identify more general classes of ill-behaving robots. Furthermore, although offline and real-time approaches are synergistic and offer complementary benefits, they have been developed independently, with disparate underlying philosophies. Finally, real-time approaches are often carefully engineered systems that require costly Web server modifications that organizations may not wish to invest in. These drawbacks call for a simple integrated approach to online and offline detection that can reliability identify Web robots despite the continuous evolution of their traffic.

We propose a novel approach to detect Web robots in both offline and in real-time. The approach relies on training analytical models over the request patterns of robots and humans for different types of resources encoded as DTMC models. The approach works well in offline settings where, in a post-mortem analysis, users are interested in capturing realistic and complete samples of Web robot traffic. The figure below illustrates the superior F_1 measure (a harmonic mean of precision and recall) against a number of baseline multinomial predictors used as the core learning algorithm in a number of other members.


Our proposed detection approach is adaptable to real-time settings, where a system wishes to identify that an active session is a Web robot. This classification must be performed as early as possible so that the system can choose to take an alternative action to the session (e.g. admit requests with low priority, block access to specific files, or block the robot). Our adaptation algorithm performs well in a number of settings, and we are able to identify robot sessions within a small number (k < 20) requests.


Ongoing work seeks to develop algorithms for identifying malicious Web robots. One promising approach is being pursued with international collaborators, where we seek to integrate fuzzy logic with markov clustering algorithms to separate bots from humans, and of the robots found, those who are malicious.


Initial results are promising, but our work is ongoing:


Traffic Generation

With robot and IoT devices representing a significant proportion of all Web traffic, a realistic traffic generator for this class of traffic is necessary to better evaluate the capacity, scalability, and other performance characteristics of Web systems. Although generators exist for many important types of Internet traffic including streaming media, peer-to-peer file sharing, and malicious attackers, two differentiated features of robot and IoT traffic (henceforth only referred to as “robot traffic”) demand a new generation approach: 
 Multi-faceted nature: There exists a wide variety of Web robots on the Internet, each one sporting behaviors and characteristics based on their individual programming. Some robots are designed to exclusively fetch images, for example, while others prefer to only request documents. Besides this resource type preference, the inter-request time, the request lasting time, the request frequency and so on also vary among different types of robots (we postulate that the heavy-tailed characteristics of robot traffic may be rooted in this behavioral variation, rather than the the resource size and client think time distributions that have been traced as the genesis of heavy-tailed distributions in human generated traffic). One of our future research is to look into different kinds of robot and try to classify robots eventually, thus allowing a traffic generator to account for its multi-faceted nature in greater detail.
 Sequence Dependency: Imagining the structure of a Web site or Web system as a hyperlink graph where nodes are resources (files) and edges establish a connection on the same html page from one resource to another, there are typically many ways for resources to be connected. It may be intuitive to believe that a resource linked to another having high degree on this graph may be requested with higher frequency, or that a resource is likely to be requested if a nearby resource was recently requested. Yet whether this hyperlink graph establishes a sequence dependency in resource requests for Web robot traffic is unknown. Human request patterns are driven by this hyperlink structure, but robot traffic may not since they may visit a site with a priori information about what resources they want to request and where they are located.

Related Existing Traffic Models

The Wikipedia article History of network traffic models provides a basic introduction to some of the network traffic features (as well as some of the traffic model) that are considered, e.g. which are self-similarity, heavy-tailed patterns, etc. The article Traffic generation model describes a variety of types traffic generation models that are currently used in practice. In general, present approaches use a stochastic model of traffic flows or data sources in a communication network. But as the Web continues its rapid evolution, it is reasonable to expect many new kinds of robots with widely varying functions and behaviors to emerge. Stochastic models, which require many assumptions about the statistical regularity of requests and behaviors, may thus be challenging to adapt to robot traffic. We need a novel model which could perfectly adapt to the changing of robot traffic behavior as well as directly generate “real request”, like an end-to-end model if possible.

Promising Alternative Ideas

To build a novel generator of Web robot traffic, we note that the most essential features of a generator is building models of what is requested along with when. Traditional stochastic models attempt to tie these two notions together. For example, by drawing a resource request as a function of the previous series of requests made and the rates of these requests. However, for robot traffic, it may be the case the the when and the what may be generated by independent processes. This is because the rate at which robot requests are submitted may be periodic and consistent, such as an image indexer that submits requests during its session at a constant rate and returns to check for updates at the same time every day or week. The intensity with which requests are made could therefore be independent of what is requested. For example, two image indexers may exhibit similar what patterns even through their profile of when requests are submitted is completely different.

With some evidence indicating that there is no strong relationship between the when and the what, it is reasonable to build the when and what model separately, and combine them together in some way at the end.

The What Model

Since web traffic could be considered as a sequence of requests, ideas in building sequence generation models could be borrowed and used for the robot traffic generation model. After looking into many of the sequence generation cases and models, one inspirational article Composing Music With Recurrent Neural Networks jumps out. In this case, it feeds the computer with many .mid format music, then uses LSTM (Long Short Term Memory) to train and compose music. The music composing case is very similar to our robot traffic generation problem, and it is the model we are looking into to build our What model.

So what is LSTM and Why are we choosing it? LSTM (Long Short Term Memory) is a specific version of Recurrent Neural Network, and is better than Recurrent Neural Network in “memorizing” long term things.

Here is a rough sketch about how to use an LSTM to generate robot traffic:

1. Extract features of each requested resource. Candidate features may include the size, type, frequency, dependency of the requested resource.

2. Contiguously feed the features of the requesting resources in order, which forms a request sequence into the LSTM, and train this LSTM model using the feeding sequence.

3. To start generating, feed the trained LSTM model with a start request and then predict the next request. Feed the “next” request it just predicted back into this model, then predict the next request again. Continue feed the just predicted request back into the LSTM model, then we can get a requesting sequence in this way.

Current Thoughts...

However, the robot traffic is not exactly the same as the music composing, it has more than just one sequence and there are some overlapping problems to solve. By using LSTM, it is possible to generate request sequence for each robot, or each specific kind of robot, but how to organize these separated sequences and combine them together to get the integrated traffic is something to think about. As LSTM model could predict what is the next request the robots would make, we also need a when model to tell us when the next requests happen. The when model is still uncertain by now.

Robot-resilient Web caching

Because robots and humans exhibit different traffic characteristics and behaviors, caching polices that use heuristic rules to admit and evict resources based on human behaviors are unlikely to yield high hit ratios over robot requests. Furthermore, predictive polices that learn behavioral patterns may perform poorly over present-day mixtures of robot and human traffic because they are unable to learn patterns within pure streams of requests from either type of traffic (see figure below). To overcome these obstacles, an innovative dual-caching architecture is used that incorporates independent caches for robot and human traffic. Dual-caches enable the use of separate caching policies and prediction algorithms that are compatible with human and robot traffic.

Parallel coordinate comparison.png

Ideally, Web caches should be equipped to predict the exact resource that will be requested next by a Web Robot session. This is not feasible due to the large set of resources that are available on a Web server. Even predicting the extension of the next resource may require a model to predict one type out of hundreds, a task that is challenging for a lightweight classifier to perform in real time. Instead we follow previous work <ref name="robotAnalysis" /> and cluster resources into types. Predicting the next type of resource may provide a smarter alternative since the popularity of robot requests exhibits a power tail <ref name="detectingRobots" /> and as such the most popular resources of a predicted type are the ones likely to be requested next. The resource types used can be seen in Table 1.

Table 1: Breakdown of Resource Types
Class Extensions
text txt, xml, sty, tex, cpp, java
web asp, jsp, cgi, php, html, htm, css, js
img tiff, ico, raw, pgm, gif, bmp, png, jpeg, jpg
doc xls, xlsx, doc, docx, ppt, pptx, pdf, ps, dvi
av avi, mp3, wvm, mpg, wmv, wav
prog exe, dll, dat, msi, jar
compressed zip, rar, gzip, tar, gz, 7z
malformed request strings that are not well-formed
noExtention request for directory contents

Predicting Cache Resources

Request Features

To predict the type of a Web robot request, we consider algorithms that try to predict the type of the nth resource requested given a sequence of past n - 1 request types. A training record is denoted ri = (vi,li) where vi is the ordered sequence of the past n - 1 requests and li = xn is the type of resource requested after the sequence vi. The following figure shows an example with n = 10. The first record is composed of the first nine requests, and its class label is the tenth request; the second record is composed of the second request through the tenth request and its label is given by the eleventh request. The trained predictor will maintain a history of the previous n - 1 requests and, based on this history, generate the predicted label for the next request.

Elman Neural Network

Prediction Algorithms

We want to be able to predict the behavior of robot and human traffic. In order to do this, we consider algorithms that utilize resource types requested in a web session and the order the requests occur. Specifically we investigate the following algorithms:

  • Resource order
    • Discrete-Time Markov Chain (DTMC)
  • Resource type (features)
    • Multinomial Logistic Regression (MLR)
    • Decision Trees (J48)
    • Naive Bayes (NB)
    • Support Vector Machines (SVM)
    • Neural Nets
  • An Elman Neural Network (ENN) - combines both resource order and type for predictions.

The graph above depicts the design of an ENN. Similar to a traditional Neural Network, the ENN has a recurrent connection from its hidden layer to a context layer which saves the state of the current hidden layer. This state is fed back into the hidden layer on the next iteration to capture the temporal aspects of the data.

The performance of the previous algorithms can be seen in the graph below.


This graph shows that the feature-based algorithms outperformed both the DTMC and ENN on the WSU data, however the feature-based algorithms had comparable performance to the DTMC with the UConn data. The ENN also outperformed all other algorithms with this dataset.

We examine the rational behind these results by first determining the impact that request order had on prediction. The two figures below show the markov models for WSU and UConn respectively. These models show the probability that the next resource will be of a given type, given the current resource type.

WSU markov.png
UConn markov.png

For the WSU model we observe a prominent diagonal. This implies that the DTMC learns that the next resource type to predict should be the same resource type we currently have. However this is not an accurate representation of web traffic on a server. Web traffic is more dynamic and cannot be limited to sequential requests for the same resource type. The UConn model, with its higher complexity, offers more information that an order-based algorithm can use for predictions. We see the effect of this when we compared the performances of the feature-based algorithms to the DTMC. For WSU, the resource order carried no predictive power and thus the DTMC performance suffered. However with the UConn data, the DTMC was able to learn sufficient patterns to give comparable performance to the feature based algorithms.

The Elman Neural Network, which utilizes resource types and their order, appears to be a robust solution for predicting the next type of resource to be requested. We observed that for certain datasets, the request order does not assist with predictions, and the performance of the ENN reflected this. For the WSU data, the ENN had slightly worse performance than the feature-based algorithms. However when the request order did carry predictive power, the ENN promoted this and gained significantly better performance than other algorithms.

Thus the ENN is a suitable algorithm that can predict the next resource type to be precached. By building two models, one for robots and one for humans, we can leverage the benefits that a dual caching scheme provides by having two separate prediction models determining the behavior of humans and robots independently.


  • N. Rude and D. Doran. “Request Type Prediction for Web Robot and Internet of Things Traffic”, Proc. Of IEEE Intl. Conference on Machine Learning and Applications, Miami, FL, Dec. 2015
  • D. Doran and S. Gokhale. "An Integrated Method for Offline and Real-time Web Robot Detection", Under Review.
  • M. Zabihi, R. Sadeghi, and D. Doran. "Unsupervised Detection of Benign and Malicious Web Robots and Crawlers", In Preparation.