Skip to content

custom domains - middleware and verification #2145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

Soxasora
Copy link
Member

@Soxasora Soxasora commented May 1, 2025

Description

Part of #1942
Focuses on custom domain insertion, providing a working form and verification process via pgboss.
Uses ACM with LocalStack to provide certificates CRUD simulation.

Screenshots

TBD

Additional Context

#1958 is being split to facilitate review and build greater stuff on top of it.
Doesn't contain Auth Sync or custom branding.

This PR provides:

  • custom domains
    -- context
    -- crud, resolvers, fragments
    -- form with basic UI/UX
    -- ACM certificates support
    -- local DNS server via dnsmasq support

  • middleware
    -- custom domain detection via endpoint
    -- doesn't apply referrals yet (no rewrites here)

  • domain verification
    -- job every 5 minutes
    -- AWS error handling
    -- log every verification attempt and its steps
    -- normalized schema for Domain, Record, Attempt, Certificate

  • triggers
    -- on territory stop, HOLD the domain
    -- on territory takeover, delete the domain
    -- on domain/domainCertificate deletion, delete from ACM via pgboss job

  • workers
    -- full domain verification
    -- clear domains that have been on HOLD for more than 30 days
    -- delete certificate from ACM (job)

Checklist

Are your changes backwards compatible? Please answer below:
Yes, everything is an addition. An exception is the way we handle S3 via localstack, now the container is called AWS and also allows ACM simulations other than S3.

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
7, QA OK, domain verification handles errors correctly and follows its step, middleware correctly detects a custom domain and obtains its connected sub.

For frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes, it uses standard SN colors and styles that fits with theme changes.

Did you introduce any new environment variables? If so, call them out explicitly here:
An endpoint for localstack that can be contacted by our worker container
LOCALSTACK_ENDPOINT=http://aws:4566

local DNS server via dnsmasq, with fallback to 1.1.1.1 in both dnsmasq and code.
DOMAINS_DNS_SERVER=172.20.0.2:53

Soxasora added 2 commits May 1, 2025 18:38
- ACM support
- custom domains crud, resolvers, fragments
- custom domains form, guidelines
- custom domains context
- domain verification every 5 minutes via pgboss
- domain validation schema
- basic custom domains middleware, to be completed
- TODOs tracings
- CustomDomain -> Domain
- DomainVerification table
- CNAME, TXT, SSL verification types
- WIP DomainVerification upsert
@Soxasora Soxasora force-pushed the custom_domains_base branch from cd46aeb to 5e80c3f Compare May 2, 2025 23:33
@Soxasora Soxasora closed this May 6, 2025
@Soxasora Soxasora reopened this May 6, 2025
Soxasora added 9 commits May 6, 2025 14:51
- use DomainVerificationStatus enum for domains and records
- adapt Territory Form UI to new schema
- return 'records' as an object with its types
- wip: prepare for attempts and certificate usage for prisma
fix:
- fix setDomain mutation transaction
- fix schema typedefs

enhance:
- DNS records guidelines with flex-wrap for longer records

cleanup:
- add comments to worker
- remove console.log on validation values
@Soxasora Soxasora force-pushed the custom_domains_base branch from 775a966 to 82a71f5 Compare May 8, 2025 23:33
@Soxasora Soxasora marked this pull request as ready for review May 8, 2025 23:55
@huumn
Copy link
Member

huumn commented May 14, 2025

https://github.com/stackernews/stacker.news/blob/c7321351bfef74dc2cfa9d0c2542bb3fbbc5db2c/docs/dev/custom-domains-base.md

So to QA this I need to use real DNS? I suspect we'll want to mock DNS somehow, like we mock ACM etc, going forward to make this trivial to iterate on.

Copy link
Member

@huumn huumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a very casual high level first pass to get acquainted. It looks very good.

Thank you for separating it into its own PR. Also thanks to @ekzyis for his first review which I'm sure led to lots of improvements.

I left some questions and nits. I'll make another high level pass tomorrow. Then maybe again depending on when we can test this in a self-contained way.

Comment on lines 136 to 137
-- !QUIRK: If someone else takes over the sub, the new guy needs to delete the domain
CREATE OR REPLACE FUNCTION hold_domain_and_delete_certificate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is weird. I guess it's not clear to us whether the transfer is from an alt to another alt or to another person entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually just remove the custom domains settings and certificate if the owner is different on re-activation.

And, instead, on transfer, we retain settings

Comment on lines 19 to 30
// make sure to delete the certificate from ACM if the sub is stopped, if we have it.
if (nextStatus === 'STOPPED' && sub.domain?.certificate?.certificateArn) {
await deleteDomainCertificate(sub.domain.certificate.certificateArn)
}

await models.sub.update({
include: { user: true },
where: {
name: subName
},
data: {
status: nextBillingWithGrace(sub) >= new Date() ? 'GRACE' : 'STOPPED',
status: nextStatus,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm hard to tell if this is the right behavior unless we automatically re-add the cert when it's paid.

Copy link
Member Author

@Soxasora Soxasora May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a new certificate from ACM is fast and free, what's not really good (for the territory owner) is that they will be subjected again to ACM DNS validation.

Maybe we don't need to remove the certificate at all as long as the domain remains the same but this is also somewhat connected to what you said on the previous comment:

I guess it's not clear to us whether the transfer is from an alt to another alt or to another person entirely.

Soxasora added 2 commits May 17, 2025 08:07
… HOLD

handle territory changes via triggers
- on territory stop, HOLD the domain
- on territory takeover from another user, delete the domain and its associated records

handle ACM certificates via trigger
- on domain/domainCertificate deletion, ask ACM to delete the certificate via a pgboss job; removes the need to ask ACM in multiple places

clear domains that have been on HOLD for more than 30 days, check every midnight via pgboss schedule

use 'domains' profile for worker jobs
Comment on lines +185 to +186
-- everytime a domain (relation) or a domainCertificate is deleted, also ask ACM to delete the certificate
CREATE TRIGGER trigger_ask_acm_to_delete_certificate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of jumping around the code to ask ACM to delete a certificate (mostly in shady places), a trigger on domain/domainCertificate deletion offloads this job to our worker via pgboss.

Comment on lines +135 to +139
-- SCENARIO: Territory got stopped after grace period
-- HOLD the domain when the sub is stopped
-- this is to prevent the domain from being used by another sub;
-- won't delete anything but it will require a new verification attempt if the sub is resumed
CREATE OR REPLACE FUNCTION hold_domain_on_sub_stop()
Copy link
Member Author

@Soxasora Soxasora May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of deleting a domain from a territory, we put it on HOLD so that the original owner can come back without losing any customization or their ACM certificate.

The user will inevitably need to re-verify the domain to make it active again.
edit: well it can be avoided by triggering ACTIVE when a territory goes up again but imo it's not ideal

Comment on lines +153 to +157
-- SCENARIO: Territory got taken over by a different user
-- clear the domain when the sub is taken over by a different user;
-- this will delete the domain, its certificates, verification attempts, DNS records and brandings.
-- will also trigger a request to ACM to delete the certificate because the domain is being deleted
CREATE OR REPLACE FUNCTION clear_domain_on_sub_takeover()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we only HOLD the domain on territory expiration, I think it's fair to delete any domain information, with its certificate (thanks to trigger_ask_acm_to_delete_certificate), if the territory gets resumed by another user instead of the original owner.

Comment on lines 114 to 115
-- Clear domains that have been on HOLD for 30 days or more every midnight.
INSERT INTO pgboss.schedule (name, cron, timezone) VALUES ('clearLongHeldDomains', '0 0 * * *', 'America/Chicago') ON CONFLICT DO NOTHING;
Copy link
Member Author

@Soxasora Soxasora May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wanting to add this for a long time but didn't know if we actually need it, regardless, now we check every midnight if a domain has been on HOLD for 30 days or more.

There might be a better way to do this so this might get iterated?

The primary reason for this is to delete unused certificates. In the future, it would ideally be coupled with routine checks on domain ownership just to be safer.

edit: still the same but encapsulated in a function for plpgsql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants