include likely country of origin as abuse signal

jahooma · jahooma · commit b6b169cda285 · 2026-04-22T16:36:15.000-07:00
diff --git a/web/src/server/free-session/abuse-detection.ts b/web/src/server/free-session/abuse-detection.ts
@@ -271,6 +271,38 @@ export async function identifyBotSuspects(params: {
       score += 15
     }
 
+    // --- Region signal (corroborating, scored only when stacked with usage) ---
+    // The free tier is intended for users in approved regions: English-speaking
+    // (US, UK, Canada, Australia, NZ, Ireland) and western-European markets.
+    // We have no IP data, so region is inferred from email provider and the
+    // unicode characters in the display name. CJK indicators (Chinese/Japanese/
+    // Korean Unicode in name, Chinese-provider emails, .edu.cn domains) are
+    // the only signal we can detect reliably, and empirically our abuse
+    // clusters are overwhelmingly from these provider pools. Diaspora users
+    // from approved regions may trip this flag, so it only contributes to the
+    // score when combined with heavy usage (the combination, not the region
+    // alone, is what justifies the score bump).
+    const hasCjkName =
+      !!s.name &&
+      /[一-鿿぀-ヿ가-힯]/.test(s.name)
+    const hasChineseDomain =
+      !!s.email &&
+      /@(qq|163|126|sina|sina\.cn|foxmail|aliyun|139|yeah|tom)\.(com|cn|net)$/i.test(
+        s.email,
+      )
+    const hasCnEduDomain = !!s.email && /\.edu\.cn$/i.test(s.email)
+    const nonApprovedRegion =
+      hasCjkName || hasChineseDomain || hasCnEduDomain
+    if (nonApprovedRegion) {
+      const reasons: string[] = []
+      if (hasCjkName) reasons.push('cjk-name')
+      if (hasChineseDomain) reasons.push('cn-provider')
+      if (hasCnEduDomain) reasons.push('cn-edu')
+      flags.push(`non-approved-region[${reasons.join(',')}]`)
+      if (msgs24h >= 500) score += 40
+      else if (msgs24h >= 300) score += 25
+    }
+
     // --- Email/handle pattern flags (purely informational) ---
     // These are too noisy in isolation (many real users have digits in their
     // email, use plus-aliases for privacy, or sign up via duck.com). They're
diff --git a/web/src/server/free-session/abuse-review.ts b/web/src/server/free-session/abuse-review.ts
@@ -50,6 +50,8 @@ A very young GitHub account (gh_age < 7d, especially < 1d) combined with heavy u
 
 Conversely, a GitHub account older than ~30 days is meaningful counter-evidence. The "day-1 of coding = day-1 of GitHub" pattern that makes fresh-GH such a strong bot signal doesn't apply once the GH predates the codebuff account by a month or more. gh_age ≥ 30d + a moderate quiet gap (≥4h) + any agent diversity reads like an excited power user, not a bot. Don't tier these as HIGH unless there's a genuinely unambiguous per-account signal (true near-continuous activity, see below).
 
+The free tier is intended for users in approved regions: English-speaking (US, UK, Canada, Australia, NZ, Ireland) and western-European markets. We have no IP geolocation, so region is inferred heuristically — the \`non-approved-region[...]\` flag fires when the account has a CJK-character display name (\`cjk-name\`), a Chinese email provider (\`cn-provider\` — qq.com, 163.com, 126.com, sina.com, foxmail.com, aliyun.com, 139.com, yeah.net, tom.com), or a \`.edu.cn\` domain (\`cn-edu\`). Empirically our abuse clusters are overwhelmingly from these provider pools, and heavy free-tier usage from them strongly correlates with VPN-based farming. BUT real diaspora developers from approved regions exist and trip this flag too. So: region alone is NEVER grounds for a ban. Treat it as corroborating evidence that RAISES confidence when stacked with heavy usage (msgs_24h ≥ 300) or other bot signals — a \`non-approved-region\` user with \`very-heavy\` usage on a young account is TIER 1; the same user with established-GH + low usage + diverse-agents stays in TIER 2.
+
 Creation-cluster membership is a WEAK signal on its own. The detector is purely temporal — accounts created within 30 minutes of each other. At normal signup volume, unrelated real users routinely land in the same window (product launches, HN/Reddit posts, timezone-aligned bursts). A cluster is only actionable when its members share a concrete cross-account pattern: matching email-local stems or digit siblings (\`v6apiworker\` / \`v8apiworker\`), a shared uncommon domain (\`@mail.hnust.edu.cn\`), sequential-number naming, or near-identical msgs_24h / distinct_hours footprints across multiple members. Absent such a shared pattern, treat a cluster list as background noise and tier members purely on their per-account signals. When you do use a cluster as evidence, name the shared pattern explicitly — "cluster sharing the \`vNNapiworker\` stem", not "member of 5-account creation cluster".
 
 Produce a markdown report with two sections: