Skip to content

[Mirror] Self-hosted abuse detection and rule enforcement against low-effort mass AI scraping and bots.

License

Notifications You must be signed in to change notification settings

WeebDataHoarder/go-away

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-away

Self-hosted abuse detection and rule enforcement against low-effort mass AI scraping and bots. Uses conventional non-nuclear options.

Latest Release Build Status Go Reference

go-away sits in between your site and the Internet / upstream proxy.

Incoming requests can be selected by rules to be actioned or challenged to filter suspicious requests.

The tool is designed highly flexible so the operator can minimize impact to legit users, while surgically targeting heavy endpoints or scrapers.

Challenges can be transparent (not shown to user, depends on backend or other logic), non-JavaScript (challenges common browser properties), or custom JavaScript (from Proof of Work to fingerprinting or Captcha is supported)

See Why do this? section for the challenges and reasoning behind this tool.

This documentation and go-away are in active development. See What's left? section for a breakdown.

Check this README for a general introduction. An in-depth Wiki is available and being improved.

Support

If you have some suggestion or issue, feel free to open a New Issue on the repository.

Pull Requests are encouraged and desired.

For real-time chat and other support join IRC on #go-away on Libera.Chat [WebIRC]. The channel may not be monitored at all times, feel free to ping the operators there.

Code Mirrors

Source code is automatically pushed to the following mirrors. Packages are also mirrored on Codeberg and GitHub.

GammaSpectra.live

Codeberg

GitHub

sourcehut

Note that issues or pull requests should be issued on the main Forge.

Installation and Setup

See the Installation page on the Wiki for all the details.

go-away can be directly run from command line, via pre-built containers, or your own built containers.

Features

Rich rule matching

Common Expression Language (CEL) is used to allow arbitrary selection of client properties, not only limited to regex. Boolean operators are supported.

Templates can be defined in the Policy to allow reuse of such conditions on rule matching. Challenges can also be gated behind conditions.

See the CEL Language Definition for the syntax.

Rules and conditions are served with this environment:

remoteAddress (net.IP) - Connecting client remote address from headers or properties
  remoteAddress.network(networkName string) bool - Check whether a given IP is listed on the underlying defined network
  remoteAddress.network(networkCIDR string) bool - Check whether a given IP is listed on the CIDR
host (string) - HTTP Host
method (string) - HTTP Method/Verb
userAgent (string) - HTTP User-Agent header
path (string) - HTTP request Path
query (map[string]string) - HTTP request Query arguments
headers (map[string]string) - HTTP request headers
fp (map[string]string) - Available fingerprints
  
Only available when TLS is enabled
   fp.ja3n (string) JA3N TLS Fingerprint
   fp.ja4 (string) JA4 TLS Fingerprint

Package path

You can modify the path where challenges are served and package name, if you don't want its presence to be easily discoverable.

No source code editing or forking necessary!

Simply pass a new absolute path via the cmdline path argument, like so: --path "/.goaway_example"

Page template and customization support

Internal or external templates can be loaded to customize the look of the challenge or error page. Additionally, themes can be configured to change the look of these quickly.

These templates are included by default:

  • anubis: An anubis-like themed challenge.
  • forgejo: Uses the Forgejo template and assets from your own instance. Supports specifying themes like forgejo-auto, forgejo-light and forgejo-dark.

External templates for your site can be loaded specifying a full path to the .gohtml file. See embed/templates/ for examples to follow.

You can alter the language and strings in the templates directly from the config.yml file if specified, or add footer links directly.

Some templates support themes. Specify that either via the config.yml file, or via challenge-template-theme cmdline argument.

Most templates support overriding the logo. Specify that either via the config.yml file, or via challenge-template-logo cmdline argument.

Feel free to make any changes to existing templates or bring your own, alter any logos or styling, it's yours to adapt!

Advanced actions

In addition to the common PASS / CHALLENGE / DENY rules, go-away offers more actions, plus any more extensible via code.

See the Rule Actions page on the Wiki.

Multiple challenge matching

Several challenges can be offered as options for rules. This allows users that have passed other challenges before to not be affected.

For example:

  - name: standard-browser
    action: challenge
    settings:
      challenges: [http-cookie-check, preload-link, meta-refresh, resource-load, js-pow-sha256]
    conditions:
      - '($is-generic-browser)'

This rule has the user be checked against a backend, then attempts pass a few browser challenges.

In this case the processing would stop at meta-refresh due to the behavior of earlier challenges (cookie check and preload link allow failing / continue due to being silent, while meta-refresh requires displaying a challenge page).

Any of these listed challenges being passed in the past will allow the client through, including non-offered resource-load and js-pow-sha256.

Non-Javascript challenges

Several challenges that do not require JavaScript are offered, some targeting the HTTP stack and others a general browser behavior, or consulting with a backend service.

These can be used for light checking of requests that eliminate most of the low effort scraping.

See Transparent challenges and Non-JavaScript challenges on the Wiki for more information.

Custom JavaScript / WASM challenges

A WASM interface for server-side proof generation and checking is offered. We provide js-pow-sha256 as an example of one.

You can implement Captchas or other browser fingerprinting tests within this interface.

See Custom JavaScript challenges on the Wiki for more information.

Upstream PROXY support

Support for HAProxy PROXY protocol can be enabled.

This allows sending the client IP without altering the connection or HTTP headers.

Supported by HAProxy, Caddy, nginx and others.

Automatic TLS support and HTTP/2 support

You can enable automatic certificate generation and TLS for the site via any ACME directory, which enables HTTP/2.

Without TLS, HTTP/2 cleartext is supported, but you will need to configure the upstream proxy to send this protocol (h2c:// on Caddy for example).

TLS Fingerprinting

When running with TLS via autocert, TLS Fingerprinting of the incoming client is done.

This can be targeted on conditions or other application logic.

Read more about JA3 and JA4.

Network range and automated filtering

Some specific search spiders do follow robots.txt and are well behaved. However, many actors can reuse user agents, so the origin network ranges must be checked.

The samples provide example network range fetching and rules for Googlebot / Bingbot / DuckDuckBot / Kagibot / Qwantbot / Yandexbot.

Network ranges can be loaded via fetched JSON / TXT / HTML pages, or via lists. You can filter these using jq or a regex.

Example for jq:

  aws-cloud:
    - url: https://ip-ranges.amazonaws.com/ip-ranges.json
      jq-path: '(.prefixes[] | select(has("ip_prefix")) | .ip_prefix), (.prefixes[] | select(has("ipv6_prefix")) | .ipv6_prefix)'

Example for regex:

  cloudflare:
    - url: https://www.cloudflare.com/ips-v4
      regex: "(?P<prefix>[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+/[0-9]+)"
    - url: https://www.cloudflare.com/ips-v6
      regex: "(?P<prefix>[0-9a-f:]+::/[0-9]+)"

Multiple backend support

Multiple backends are supported, and rules specific on backend can be defined, and conditions and rules can match this as well.

Subdomain wildcards like *.example.com, or full fallback wildcard * are supported.

This allows one instance to run multiple domains or subdomains.

IPv6 Happy Eyeballs challenge retry

In case a client connects over IPv4 first then IPv6 due to Fast Fallback / Happy Eyeballs, the challenge will automatically be retried.

This is tracked by tagging challenges with a readable flag indicating the type of address.

Example policies

Forgejo

The policy file at examples/forgejo.yml provides a ready template to be used on your own Forgejo instance.

Important notes:

  • Edit the http-cookie-check challenge, as this will fetch the listed backend with the given session cookie to check for user login.
  • Adjust the desired blocked networks or others. A template list of network ranges is provided, feel free to remove these if not needed.
  • Check the conditions and base rules to change your challenges offered and other ordering.
  • By default Googlebot / Bingbot / DuckDuckBot / Kagibot / Qwantbot / Yandexbot are allowed by useragent and network ranges.

Generic

The policy file at examples/generic.yml provides a baseline to place on any site, that can be modified to fit your needs.

Important notes:

  • Edit the homesite rule, as it's targeted to pages you always want to have available, like landing pages.
  • Edit the is-static-asset condition or the allow-static-resources rule to allow static file access as necessary.
  • If you have an API, add a PASS rule targeting it.
  • Check the conditions and base rules to change your challenges offered and other ordering.
  • Add or modify rules to target specific pages on your site as desired.
  • By default Googlebot / Bingbot / DuckDuckBot / Kagibot / Qwantbot / Yandexbot are allowed by useragent and network ranges.

Snippets

You can define snippets to be included. YAML anchors/aliases are supported.

See examples/snippets/ for some defaults including indexer bots, challenges and other general matches.

Why do this?

In the past few years this small git instance has been hit by waves and waves of scraping. This was usually fought back by random useragent blocks for bots that did not follow robots.txt, until the past half year, where low-effort mass scraping was used more prominently.

Recently these networks go from using residential IP blocks to sending requests at several hundred requests per second.

If the server gets sluggish, more requests pile up. Even when denied they scrape for weeks later. Effectively spray and pray scraping, process later.

At some point about 300Mbit/s of incoming requests (not including the responses) was hitting the server. And all of them nonsense URLs, or hitting archive/bundle downloads per commit.

If AI is so smart, why not just git clone the repositories?


Initially I deployed Anubis, and yeah, it does work!

This tool started as a way to replace Anubis as it was not found as featureful as desired, and the impact was too high.

go-away may not be as straight to configure as Anubis but this was chosen to reduce impact on legitimate users, and offers many more options to dynamically target new waves.

Can't scrapers adapt?

Yes, they can. At the moment their spray-and-pray approach is cheap for them.

If they have to start adding an active browser in their scraping, that makes their collection expensive and slow.

This would more or less eliminate the high rate low effort passive scraping and replace it with an active model.

go-away offers a highly configurable set of challenges and rules that you can adapt to new ways.

What's left?

go-away has most of the desired features from the original checklist that was made in its development. However, a few points are left before go-away can be called v1.0.0:

  • Several parts of the code are going through a refactor, which won't impact end users or operators.
  • Documentation is lacking and a more extensive one with inline example is in the works.
  • Policy file syntax is going to stay mostly unchanged, except in the challenges definition section.
  • Allow end users to pick fallback challenges if any fail, specially with custom ones.
  • Replace Anubis-like default template with own one.
  • Define strings and multi-language support for quick modification by operators without custom templates.
  • Have highly tested paths that match examples.
  • Caching of temporary fetches, for example, network ranges.
  • Allow live and dynamic policy reloading.
  • Multiple domains / subdomains -> one backend handling, CEL rules for backends
  • Merge all rules and conditions into one large AST for higher performance.
  • Explore exposing a module for direct Caddy usage.
  • More defined way of picking HTTP/HTTP(s) listeners and certificates.
  • Expose metrics for challenge solve rates and acting on them.
    • Metrics for common network ranges / AS / useragent

Other Similar Projects

Project Source Code Description Method
Anubis GitHub
Go / MIT
Proxy that uses JavaScript proof of work to weight request based on simple match rules JavaScript PoW (SHA-256)
powxy lindenii.runxiyu.org
Go / BSD 2-Clause
Powxy is a reverse proxy that protects your upstream service by challenging clients with proof-of-work. JavaScript PoW (SHA-256) with manual program
PoW! Bot Deterrent SequentialRead
Go / GPL v3.0
A proof-of-work based bot deterrent. Lightweight, self-hosted and copyleft licensed. JavaScript PoW (WASM scrypt)
CSSWAF GitHub
Go / MIT
A CSS-based NoJS Anti-BOT WAF (Proof of Concept) Non-JS CSS Subresource loading order
anticrawl humungus.tedunangst.com
Go / None
Go http handler / proxy for regex based rules Non-JS manual Challenge/Response
ngx_http_js_challenge_module GitHub
C / GPL v3.0
Simple javascript proof-of-work based access for Nginx with virtually no overhead. JavaScript Challenge
haproxy-protection GitGud
Lua / GPL v3.0
HAProxy configuration and lua scripts allowing a challenge-response page where users solve a captcha and/or proof-of-work. JavaScript Challenge / Captcha

Development

This Go package can be used as a command on git.gammaspectra.live/git/go-away/cmd/go-away or a library under git.gammaspectra.live/git/go-away/lib

About

[Mirror] Self-hosted abuse detection and rule enforcement against low-effort mass AI scraping and bots.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •