Self-hosted abuse detection and rule enforcement against low-effort mass AI scraping and bots. Uses conventional non-nuclear options.
go-away sits in between your site and the Internet / upstream proxy.
Incoming requests can be selected by rules to be actioned or challenged to filter suspicious requests.
The tool is designed highly flexible so the operator can minimize impact to legit users, while surgically targeting heavy endpoints or scrapers.
Challenges can be transparent (not shown to user, depends on backend or other logic), non-JavaScript (challenges common browser properties), or custom JavaScript (from Proof of Work to fingerprinting or Captcha is supported)
See Why do this? section for the challenges and reasoning behind this tool.
This documentation and go-away are in active development. See What's left? section for a breakdown.
Check this README for a general introduction. An in-depth Wiki is available and being improved.
If you have some suggestion or issue, feel free to open a New Issue on the repository.
Pull Requests are encouraged and desired.
For real-time chat and other support join IRC on #go-away on Libera.Chat [WebIRC]. The channel may not be monitored at all times, feel free to ping the operators there.
Source code is automatically pushed to the following mirrors. Packages are also mirrored on Codeberg and GitHub.
Note that issues or pull requests should be issued on the main Forge.
See the Installation page on the Wiki for all the details.
go-away can be directly run from command line, via pre-built containers, or your own built containers.
Common Expression Language (CEL) is used to allow arbitrary selection of client properties, not only limited to regex. Boolean operators are supported.
Templates can be defined in the Policy to allow reuse of such conditions on rule matching. Challenges can also be gated behind conditions.
See the CEL Language Definition for the syntax.
Rules and conditions are served with this environment:
remoteAddress (net.IP) - Connecting client remote address from headers or properties
remoteAddress.network(networkName string) bool - Check whether a given IP is listed on the underlying defined network
remoteAddress.network(networkCIDR string) bool - Check whether a given IP is listed on the CIDR
host (string) - HTTP Host
method (string) - HTTP Method/Verb
userAgent (string) - HTTP User-Agent header
path (string) - HTTP request Path
query (map[string]string) - HTTP request Query arguments
headers (map[string]string) - HTTP request headers
fp (map[string]string) - Available fingerprints
Only available when TLS is enabled
fp.ja3n (string) JA3N TLS Fingerprint
fp.ja4 (string) JA4 TLS Fingerprint
You can modify the path where challenges are served and package name, if you don't want its presence to be easily discoverable.
No source code editing or forking necessary!
Simply pass a new absolute path via the cmdline path argument, like so: --path "/.goaway_example"
Internal or external templates can be loaded to customize the look of the challenge or error page. Additionally, themes can be configured to change the look of these quickly.
These templates are included by default:
anubis
: An anubis-like themed challenge.forgejo
: Uses the Forgejo template and assets from your own instance. Supports specifying themes likeforgejo-auto
,forgejo-light
andforgejo-dark
.
External templates for your site can be loaded specifying a full path to the .gohtml
file. See embed/templates/ for examples to follow.
You can alter the language and strings in the templates directly from the config.yml file if specified, or add footer links directly.
Some templates support themes. Specify that either via the config.yml file, or via challenge-template-theme
cmdline argument.
Most templates support overriding the logo. Specify that either via the config.yml file, or via challenge-template-logo
cmdline argument.
Feel free to make any changes to existing templates or bring your own, alter any logos or styling, it's yours to adapt!
In addition to the common PASS / CHALLENGE / DENY rules, go-away offers more actions, plus any more extensible via code.
See the Rule Actions page on the Wiki.
Several challenges can be offered as options for rules. This allows users that have passed other challenges before to not be affected.
For example:
- name: standard-browser
action: challenge
settings:
challenges: [http-cookie-check, preload-link, meta-refresh, resource-load, js-pow-sha256]
conditions:
- '($is-generic-browser)'
This rule has the user be checked against a backend, then attempts pass a few browser challenges.
In this case the processing would stop at meta-refresh
due to the behavior of earlier challenges (cookie check and preload link allow failing / continue due to being silent, while meta-refresh requires displaying a challenge page).
Any of these listed challenges being passed in the past will allow the client through, including non-offered resource-load
and js-pow-sha256
.
Several challenges that do not require JavaScript are offered, some targeting the HTTP stack and others a general browser behavior, or consulting with a backend service.
These can be used for light checking of requests that eliminate most of the low effort scraping.
See Transparent challenges and Non-JavaScript challenges on the Wiki for more information.
A WASM interface for server-side proof generation and checking is offered. We provide js-pow-sha256
as an example of one.
You can implement Captchas or other browser fingerprinting tests within this interface.
See Custom JavaScript challenges on the Wiki for more information.
Support for HAProxy PROXY protocol can be enabled.
This allows sending the client IP without altering the connection or HTTP headers.
Supported by HAProxy, Caddy, nginx and others.
You can enable automatic certificate generation and TLS for the site via any ACME directory, which enables HTTP/2.
Without TLS, HTTP/2 cleartext is supported, but you will need to configure the upstream proxy to send this protocol (h2c://
on Caddy for example).
When running with TLS via autocert, TLS Fingerprinting of the incoming client is done.
This can be targeted on conditions or other application logic.
Some specific search spiders do follow robots.txt and are well behaved. However, many actors can reuse user agents, so the origin network ranges must be checked.
The samples provide example network range fetching and rules for Googlebot / Bingbot / DuckDuckBot / Kagibot / Qwantbot / Yandexbot.
Network ranges can be loaded via fetched JSON / TXT / HTML pages, or via lists. You can filter these using jq or a regex.
Example for jq:
aws-cloud:
- url: https://ip-ranges.amazonaws.com/ip-ranges.json
jq-path: '(.prefixes[] | select(has("ip_prefix")) | .ip_prefix), (.prefixes[] | select(has("ipv6_prefix")) | .ipv6_prefix)'
Example for regex:
cloudflare:
- url: https://www.cloudflare.com/ips-v4
regex: "(?P<prefix>[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+/[0-9]+)"
- url: https://www.cloudflare.com/ips-v6
regex: "(?P<prefix>[0-9a-f:]+::/[0-9]+)"
Multiple backends are supported, and rules specific on backend can be defined, and conditions and rules can match this as well.
Subdomain wildcards like *.example.com
, or full fallback wildcard *
are supported.
This allows one instance to run multiple domains or subdomains.
In case a client connects over IPv4 first then IPv6 due to Fast Fallback / Happy Eyeballs, the challenge will automatically be retried.
This is tracked by tagging challenges with a readable flag indicating the type of address.
The policy file at examples/forgejo.yml provides a ready template to be used on your own Forgejo instance.
Important notes:
- Edit the
http-cookie-check
challenge, as this will fetch the listed backend with the given session cookie to check for user login. - Adjust the desired blocked networks or others. A template list of network ranges is provided, feel free to remove these if not needed.
- Check the conditions and base rules to change your challenges offered and other ordering.
- By default Googlebot / Bingbot / DuckDuckBot / Kagibot / Qwantbot / Yandexbot are allowed by useragent and network ranges.
The policy file at examples/generic.yml provides a baseline to place on any site, that can be modified to fit your needs.
Important notes:
- Edit the
homesite
rule, as it's targeted to pages you always want to have available, like landing pages. - Edit the
is-static-asset
condition or theallow-static-resources
rule to allow static file access as necessary. - If you have an API, add a PASS rule targeting it.
- Check the conditions and base rules to change your challenges offered and other ordering.
- Add or modify rules to target specific pages on your site as desired.
- By default Googlebot / Bingbot / DuckDuckBot / Kagibot / Qwantbot / Yandexbot are allowed by useragent and network ranges.
You can define snippets to be included. YAML anchors/aliases are supported.
See examples/snippets/ for some defaults including indexer bots, challenges and other general matches.
In the past few years this small git instance has been hit by waves and waves of scraping. This was usually fought back by random useragent blocks for bots that did not follow robots.txt, until the past half year, where low-effort mass scraping was used more prominently.
Recently these networks go from using residential IP blocks to sending requests at several hundred requests per second.
If the server gets sluggish, more requests pile up. Even when denied they scrape for weeks later. Effectively spray and pray scraping, process later.
At some point about 300Mbit/s of incoming requests (not including the responses) was hitting the server. And all of them nonsense URLs, or hitting archive/bundle downloads per commit.
If AI is so smart, why not just git clone the repositories?
-
Wikimedia has posted about How crawlers impact the operations of the Wikimedia projects [01/04/2025]
-
Xe (Anubis creator) has written about similar frustrations in several blogposts:
- Amazon's AI crawler is making my git server unstable [01/17/2025]
- Anubis works [04/12/2025]
-
Drew DeVault (sourcehut) has posted several articles and outages regarding the same issues:
- Drew Blog: Please stop externalizing your costs directly into my face [17/03/2025]
- (fun tidbit: I'm the one quoted as having the feedback discussion interrupted to deal with bots!)
- sourcehut status: LLM crawlers continue to DDoS SourceHut [17/03/2025]
- sourcehut Blog: You cannot have our user's data [15/04/2025]
- Drew Blog: Please stop externalizing your costs directly into my face [17/03/2025]
-
Others were also suffering at the same time [1] [2] [3] [4] [5].
Initially I deployed Anubis, and yeah, it does work!
This tool started as a way to replace Anubis as it was not found as featureful as desired, and the impact was too high.
go-away may not be as straight to configure as Anubis but this was chosen to reduce impact on legitimate users, and offers many more options to dynamically target new waves.
Yes, they can. At the moment their spray-and-pray approach is cheap for them.
If they have to start adding an active browser in their scraping, that makes their collection expensive and slow.
This would more or less eliminate the high rate low effort passive scraping and replace it with an active model.
go-away offers a highly configurable set of challenges and rules that you can adapt to new ways.
go-away has most of the desired features from the original checklist that was made in its development. However, a few points are left before go-away can be called v1.0.0:
- Several parts of the code are going through a refactor, which won't impact end users or operators.
- Documentation is lacking and a more extensive one with inline example is in the works.
- Policy file syntax is going to stay mostly unchanged, except in the challenges definition section.
- Allow end users to pick fallback challenges if any fail, specially with custom ones.
- Replace Anubis-like default template with own one.
- Define strings and multi-language support for quick modification by operators without custom templates.
- Have highly tested paths that match examples.
- Caching of temporary fetches, for example, network ranges.
- Allow live and dynamic policy reloading.
- Multiple domains / subdomains -> one backend handling, CEL rules for backends
- Merge all rules and conditions into one large AST for higher performance.
- Explore exposing a module for direct Caddy usage.
- More defined way of picking HTTP/HTTP(s) listeners and certificates.
- Expose metrics for challenge solve rates and acting on them.
- Metrics for common network ranges / AS / useragent
Project | Source Code | Description | Method |
---|---|---|---|
Anubis | Go / MIT |
Proxy that uses JavaScript proof of work to weight request based on simple match rules | JavaScript PoW (SHA-256) |
powxy | Go / BSD 2-Clause |
Powxy is a reverse proxy that protects your upstream service by challenging clients with proof-of-work. | JavaScript PoW (SHA-256) with manual program |
PoW! Bot Deterrent | Go / GPL v3.0 |
A proof-of-work based bot deterrent. Lightweight, self-hosted and copyleft licensed. | JavaScript PoW (WASM scrypt) |
CSSWAF | Go / MIT |
A CSS-based NoJS Anti-BOT WAF (Proof of Concept) | Non-JS CSS Subresource loading order |
anticrawl | Go / None |
Go http handler / proxy for regex based rules | Non-JS manual Challenge/Response |
ngx_http_js_challenge_module | C / GPL v3.0 |
Simple javascript proof-of-work based access for Nginx with virtually no overhead. | JavaScript Challenge |
haproxy-protection | Lua / GPL v3.0 |
HAProxy configuration and lua scripts allowing a challenge-response page where users solve a captcha and/or proof-of-work. | JavaScript Challenge / Captcha |
This Go package can be used as a command on git.gammaspectra.live/git/go-away/cmd/go-away
or a library under git.gammaspectra.live/git/go-away/lib