Difference between revisions of "Understanding and Mitigating the Impact of Web Robot and IoT Traffic on Web Systems"

From Knoesis wiki
Jump to: navigation, search
Line 35: Line 35:
 
=== Robot-resilient Web caching ===
 
=== Robot-resilient Web caching ===
 
----
 
----
Because robots and humans exhibit different traffic characteristics and behaviors, caching polices that use heuristic rules to admit and evict resources based on human behaviors [24–29] are unlikely to yield high hit ratios over robot requests. Furthermore, predictive polices that learn behavioral patterns may perform poorly over present-day mixtures of robot and human traffic because they are unable to learn patterns within pure streams of requests from either type of traffic.
+
Because robots and humans exhibit different traffic characteristics and behaviors, caching polices that use heuristic rules to admit and evict resources based on human behaviors [24–29] are unlikely to yield high hit ratios over robot requests. Furthermore, predictive polices that learn behavioral patterns may perform poorly over present-day mixtures of robot and human traffic because they are unable to learn patterns within pure streams of requests from either type of traffic. To overcome these obstacles, an innovative ''dual-caching'' architecture is used that incorporates independent caches for robot and human traffic. Dual-caches enable the use of separate caching policies compatible with human [CITE] and robot traffic.
  
To overcome these obstacles, an innovative ''dual-caching'' architecture is used that incorporates independent caches for robot and human traffic. Dual-caches enable the use of separate caching policies compatible with human [CITE] and robot traffict.
+
Ideally, Web caches should be equipped to predict the exact resource that will be requested next by a Web Robot session. This is not feasible due to the large set of resources that are available on a Web server. Even predicting the extension of the next resource may require a model to predict one type out of hundreds, a task that is challenging for a lightweight classifier to perform in real time. Instead we follow previous work <ref name="robotAnalysis" /> and cluster resources into types. Predicting the next type of resource may provide a smarter alternative since the popularity of robot requests exhibits a power tail <ref name="detectingRobots" /> and as such the most popular resources of a predicted type are the ones likely to be requested next. The resource types used can be seen in Table 1.
 
+
Ideally, Web caches should be equipped to predict the exact resource that will be requested next by a Web Robot session. This is not feasible due to the large set of resources that are available on a Web server. Even predicting the extension of the next resource may require a model to predict one type out of hundreds, a task that is challenging for a lightweight classifier to perform in real time. Instead we follow previous work <ref name="robotAnalysis" /> and cluster resources into types. Predicting the next type of resource may provide a smarter alternative since the popularity of robot requests exhibits a power tail <ref name="detectingRobots" /> and as such the most popular resources of a predicted type are the ones likely to be requested next.  
+
  
 
{| class="wikitable" style="text-align: center"
 
{| class="wikitable" style="text-align: center"
Line 75: Line 73:
 
|}
 
|}
  
==== Classification Algorithms ====
+
==== Predicting Cache Resources ====
 
To predict the type of a Web robot request, we consider algorithms that try to predict the type of the n<sup>th</sup> resource requested given a sequence of past ''n'' - 1 request types. A training record is denoted ''r<sub>i</sub> = (v<sub>i</sub>,l<sub>i</sub>)'' where ''v<sub>i</sub>'' is the ordered sequence of the past ''n'' - 1 requests and ''l<sub>i</sub> = x<sub>n</sub>'' is the type of resource requested after the sequence ''v<sub>i</sub>''. Figure [ID] shows an example with ''n'' = 10. The first record is composed of the first nine requests, and its class label is the tenth request; the second record is composed of the second request through the tenth request and its label is given by the eleventh request. The trained predictor will maintain a history of the previous ''n'' - 1 requests and, based on this history, generate the predicted label for the next request.
 
To predict the type of a Web robot request, we consider algorithms that try to predict the type of the n<sup>th</sup> resource requested given a sequence of past ''n'' - 1 request types. A training record is denoted ''r<sub>i</sub> = (v<sub>i</sub>,l<sub>i</sub>)'' where ''v<sub>i</sub>'' is the ordered sequence of the past ''n'' - 1 requests and ''l<sub>i</sub> = x<sub>n</sub>'' is the type of resource requested after the sequence ''v<sub>i</sub>''. Figure [ID] shows an example with ''n'' = 10. The first record is composed of the first nine requests, and its class label is the tenth request; the second record is composed of the second request through the tenth request and its label is given by the eleventh request. The trained predictor will maintain a history of the previous ''n'' - 1 requests and, based on this history, generate the predicted label for the next request.
  
 
[[File:PredData.png|300px|center]]
 
[[File:PredData.png|300px|center]]
  
==== Datasets ====
+
==== Cache Design ====
 +
* predictive
 +
* Cloud-based
 +
* replacement policies (adaptive LRU)
  
==== Results & Analysis ====
 
  
 
== Acknowledgement ==
 
== Acknowledgement ==

Revision as of 19:53, 24 October 2015

Introduction & Motivation

The kind of information shared on the Web has shifted dramatically over the past decade and a half. Com- pared to Web pages that mainly hosted static material during the early 2000’s, modern Web sites are full of dynamic content in the form of in-the-moment news articles, opinions, and social information. Consequently, Web robots or crawlers, which are software agents that automatically submit http requests for content from the Web without any human intervention, have been steadily rising in sophistication [CITE] and in volume [CITE].

Present efforts have identified how the behavioral and statistical characteristics of Web robot traffic [1–8, 12–14] stand in contrast to traffic generated by humans [4, 8, 12, 15]. Unfortunately, current methods for optimizing the response rate [16, 17], power consumption [18–20], and other performance aspects of Web systems rely on traffic to exhibit human-like characteristics. For example, caches are a critical tool to improve system response times [21] and minimize energy usage [22, 23], but they require traffic to display human behavioral patterns [24–29] which robots do not show [4, 30]. Given how robots are the main form of Web traffic today and are likely to rise to higher levels in the future, they stand to threaten the performance, efficiency, and scalability of all kinds of Web systems, from single servers to farms and large-scale clouds.

To mitigate the effects of robot traffic on Web systems, one approach may be to go on the ‘attack’ and devise blacklists or gateways that can stop robot requests from ever reaching a Web system. However, robots play an essential role in the Web ecosystem. They are principly responsible for collecting and analyzing the content that powers Web search [31] and information aggregator services [32]. They may also be employed by useful software systems to submit and collect data from a variety of Web service APIs [33]. Finally, as the ‘Internet of things’ concept [34] becomes a reality, all kinds of physical devices will use robots to submit data through the Web. Thus, so long as they behave well [13] and are not malicious (such as part of an offensive botnet [35]), robot requests should reach any Web system unobstructed. Administrators familiar with the operation and information hosted on a Web system may then decide on its response to robot requests.

The results of this project stand to transform how Web systems of all kinds are designed and optimized so that the performance and energy cost associated with servicing robot requests is mitigated. The results also lay a foundation to build analytic models of the demand robots impose based on the specific architecture of a system, and to develop open-source plugins that control robot activity and their access to information in unprecedented ways.

The intellectual merits of the proposed research lie in: (i) the innovative use of (un)supervised learners on Web logs that automatically infer profiles of robots based on their functionality and demand; (ii) a novel approach to generate realistic streams of robot requests; and (iii) the development of a new caching architecture and policy that capably handles streams of traffic with any level of robot requests.

Overview

Research Tasks

This work is broken down into the following parts:

Robot classification


Traffic Generation


Robot arrival process

Assigning resource requests

Robot-resilient Web caching


Because robots and humans exhibit different traffic characteristics and behaviors, caching polices that use heuristic rules to admit and evict resources based on human behaviors [24–29] are unlikely to yield high hit ratios over robot requests. Furthermore, predictive polices that learn behavioral patterns may perform poorly over present-day mixtures of robot and human traffic because they are unable to learn patterns within pure streams of requests from either type of traffic. To overcome these obstacles, an innovative dual-caching architecture is used that incorporates independent caches for robot and human traffic. Dual-caches enable the use of separate caching policies compatible with human [CITE] and robot traffic.

Ideally, Web caches should be equipped to predict the exact resource that will be requested next by a Web Robot session. This is not feasible due to the large set of resources that are available on a Web server. Even predicting the extension of the next resource may require a model to predict one type out of hundreds, a task that is challenging for a lightweight classifier to perform in real time. Instead we follow previous work <ref name="robotAnalysis" /> and cluster resources into types. Predicting the next type of resource may provide a smarter alternative since the popularity of robot requests exhibits a power tail <ref name="detectingRobots" /> and as such the most popular resources of a predicted type are the ones likely to be requested next. The resource types used can be seen in Table 1.

Table 1: Breakdown of Resource Types
Class Extensions
text txt, xml, sty, tex, cpp, java
web asp, jsp, cgi, php, html, htm, css, js
img tiff, ico, raw, pgm, gif, bmp, png, jpeg, jpg
doc xls, xlsx, doc, docx, ppt, pptx, pdf, ps, dvi
av avi, mp3, wvm, mpg, wmv, wav
prog exe, dll, dat, msi, jar
compressed zip, rar, gzip, tar, gz, 7z
malformed request strings that are not well-formed
noExtention request for directory contents

Predicting Cache Resources

To predict the type of a Web robot request, we consider algorithms that try to predict the type of the nth resource requested given a sequence of past n - 1 request types. A training record is denoted ri = (vi,li) where vi is the ordered sequence of the past n - 1 requests and li = xn is the type of resource requested after the sequence vi. Figure [ID] shows an example with n = 10. The first record is composed of the first nine requests, and its class label is the tenth request; the second record is composed of the second request through the tenth request and its label is given by the eleventh request. The trained predictor will maintain a history of the previous n - 1 requests and, based on this history, generate the predicted label for the next request.

PredData.png

Cache Design

  • predictive
  • Cloud-based
  • replacement policies (adaptive LRU)


Acknowledgement

This article is based on work supported by the National Science Foundation (NSF) under Grant No. 1464104. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

References

<ref name="robotAnalysis"> D. Doran, “Detection, classification, and workload analysis of web robots,” Ph.D. dissertation, University of Connecticut, 2014.</ref> <ref name="detectingRobots"> D. Doran and S. Gokhale, “Detecting Web Robots Using Resource Request Patterns,” in Proc. of Intl. Conference on Machine Learning and Applications, 2012, pp. 7–12.</ref> <references/>