I recently faced dealing with some badbots and scrapers,  it was in a LAMP stack with varnish at the edge.  I decided to deal with it in varnish, as I always try handle as many tasks at the edge as I can, and leave apache to serve php.

So, I thought about the problem a bit, and decided to use a token bucket, nothing unusual about that. (I had to modify the source to allow passing values instead of defaulting to 1 token).  However I went a bit further and decided that different pages are 'worth' more than others, i.e. they are more sensitive.  For example, accessing the homepage vs accessing account pages.  This required a patch of the throttle mod to allow you to pass the 'cost' of a page, so more than 1 token is removed from the bucket.  For now it just logs, but I intend to send a user that is exceeding the request rate to a different backend server that will give them fake data to devalue their scraping.

you could detect user agent strings or other patterns and use as a multiplier , so bad user agent will multiply the tokens to be removed by say 5.  you could do the same with cookies too.

sample config below

vcl 4.0;
import var;
import vsthrottle;
import std;


# Default backend definition. Set this to point to your content server.
backend default {
    .host = "127.0.0.1";
    .port = "8080";
}

sub vcl_recv {

        # set weights on pages using regex patterns

var.set_int("sensitivity", 1);
       
        if (req.url ~ "^/browse/?") {
                var.set_int("sensitivity", 10);
        } elsif (req.url ~ "^/stats/?") {
                var.set_int("sensitivity", 20);
        } elsif (req.url ~ "^/account/?") {
                var.set_int("sensitivity", 30);
        }

        # now, lets see if they have enough credit in their token bucket to ask for this page
        # token bucket is set to 150 tokens, and is measured for 10 seconds

        if (vsthrottle.is_denied(client.identity, var.get_int("sensitivity") , 150, 10s)) {

          # Client has exceeded credit limit, lets do things like;
# set req.backend = fakedataserver;
# maybe set a http header into the get request to add to apache logs ?
std.syslog(180, "RECV: " + req.http.host + req.url+ client.identity);
return (synth(429, "Too Many Requests"));
        }

}

sub vcl_backend_response {
}

sub vcl_deliver {
}

Comments

Popular posts from this blog

Baileys liquor Chocolate Chip and Cream desert

nginx decode base64 url for use with imgproxy

using t1n1wall, opnsense or pfsense on Google Compute Engine GCE