Detectors Overview¶
Detectors are the core building blocks of wuming. Each detector scans input text for a specific type of PII and returns a list of matches with position, confidence, and metadata.
Supported Locales and Patterns¶
| Locale | Detector | PII Type | Confidence | Validation |
|---|---|---|---|---|
| Global | 0.95 | Regex | ||
| Global | Credit Card | CREDIT_CARD | 0.95 | Luhn algorithm |
| Global | IBAN | IBAN | 0.95 | ISO 13616 mod-97 |
| Global | IP Address | IP_ADDRESS | 0.90 | IPv4 octet range, IPv6 regex |
| Global | URL | URL | 0.90 | Regex |
| Global | MAC Address | MAC_ADDRESS | 0.90 | Regex |
| US | SSN | NATIONAL_ID | 0.85 | Area/group/serial rules |
| US | EIN | TAX_ID | 0.85 | IRS prefix validation |
| US | ITIN | TAX_ID | 0.80 | Group range validation |
| US | Phone | PHONE | 0.80 | NANP format |
| US | Passport | PASSPORT | 0.60 | Regex |
| US | ZIP Code | POSTAL_CODE | 0.60 | Regex (5 or 5+4) |
| US | Medicare | HEALTH_ID | 0.80 | MBI pattern |
| NL | BSN | NATIONAL_ID | 0.90 | 11-proof checksum |
| NL | Phone | PHONE | 0.85 | Dutch format |
| NL | Postal Code | POSTAL_CODE | 0.90 | Format + invalid combos |
| NL | KvK | TAX_ID | 0.60-0.90 | Context-boosted |
| NL | ID Document | NATIONAL_ID / PASSPORT | 0.70 | Pattern matching |
| EU | VAT Number | TAX_ID | 0.90 | Country prefix regex |
| EU | Passport MRZ | PASSPORT | 0.95 | ICAO 9303 TD3 format |
| GB | NIN | NATIONAL_ID | 0.85 | HMRC prefix rules |
| GB | NHS Number | HEALTH_ID | 0.90 | Mod-11 check digit |
| GB | UTR | TAX_ID | 0.55-0.85 | Context-boosted |
| GB | Phone | PHONE | 0.85 | UK format |
| GB | Postcode | POSTAL_CODE | 0.90 | UK format regex |
| DE | Steuer-ID | TAX_ID | 0.85 | ISO 7064 Mod 11,10 |
| DE | ID Card | NATIONAL_ID | 0.75 | Weighted checksum (7,3,1) |
| DE | Sozialversicherung | NATIONAL_ID | 0.75 | Date validation |
| DE | Phone | PHONE | 0.85 | German format |
| DE | PLZ | POSTAL_CODE | 0.60-0.85 | Range + context boost |
| FR | NIR | NATIONAL_ID | 0.90 | Mod-97 control key |
| FR | NIF | TAX_ID | 0.70 | Regex |
| FR | ID Card | NATIONAL_ID | 0.65 | Old (12-digit) + new (9-char) |
| FR | Phone | PHONE | 0.85 | French format |
| FR | Postal Code | POSTAL_CODE | 0.60 | Department prefix validation |
| BR | CPF | NATIONAL_ID | 0.90 | Check digits |
| BR | CNPJ | TAX_ID | 0.90 | Check digits |
| BR | Phone | PHONE | 0.85 | Brazilian format |
| BR | CEP | POSTAL_CODE | 0.60 | Regex |
| BR | PIS/PASEP | NATIONAL_ID | 0.80 | Check digit |
| BR | CNH | DRIVERS_LICENSE | 0.75 | Pattern matching |
| JP | My Number | NATIONAL_ID | 0.90 | Check digit |
| JP | Corporate Number | TAX_ID | 0.90 | Check digit |
| JP | Phone | PHONE | 0.85 | Japanese format |
| JP | Postal Code | POSTAL_CODE | 0.60 | Regex |
| JP | Passport | PASSPORT | 0.60 | Regex |
| IN | Aadhaar | NATIONAL_ID | 0.90 | Verhoeff checksum |
| IN | PAN | TAX_ID | 0.85 | Pattern matching |
| IN | Phone | PHONE | 0.85 | Indian format |
| IN | PIN Code | POSTAL_CODE | 0.60 | Regex |
| IN | Passport | PASSPORT | 0.60 | Regex |
| IN | GSTIN | TAX_ID | 0.85 | Check digit |
| CN | Resident ID | NATIONAL_ID | 0.90 | Check digit (GB 11643) |
| CN | Phone | PHONE | 0.85 | Chinese format |
| CN | Postal Code | POSTAL_CODE | 0.60 | Regex |
| CN | Passport | PASSPORT | 0.60 | Regex |
| CN | USCC | TAX_ID | 0.85 | Pattern matching |
| KR | RRN | NATIONAL_ID | 0.90 | Check digit |
| KR | Phone | PHONE | 0.85 | Korean format |
| KR | Postal Code | POSTAL_CODE | 0.60 | Regex |
| KR | Passport | PASSPORT | 0.60 | Regex |
| AU | TFN | TAX_ID | 0.85 | Weighted checksum |
| AU | Medicare | HEALTH_ID | 0.80 | Check digit |
| AU | ABN | TAX_ID | 0.85 | Weighted checksum |
| AU | Phone | PHONE | 0.85 | Australian format |
| AU | Postcode | POSTAL_CODE | 0.60 | Regex |
| CA | SIN | NATIONAL_ID | 0.90 | Luhn check digit |
| CA | Phone | PHONE | 0.85 | NANP format |
| CA | Postal Code | POSTAL_CODE | 0.60 | Canadian format |
| CA | Passport | PASSPORT | 0.60 | Regex |
The Detector Interface¶
Every detector implements the port.Detector interface:
type Detector interface {
Detect(ctx context.Context, text string) ([]model.Match, error)
Name() string
Locales() []string
PIITypes() []model.PIIType
}
Detect-- Scans the input text and returns all matches found.Name-- Returns a unique identifier (e.g.,"nl/bsn","common/email").Locales-- Returns which locales this detector supports. An empty slice means the detector is locale-independent (global).PIITypes-- Returns which PII types this detector can find.
Confidence Scoring¶
Each match carries a confidence score between 0.0 and 1.0:
- 0.90 -- 0.95: High confidence, validated by checksum or strong structural rules (e.g., credit card with Luhn, IBAN with mod-97, BSN with 11-proof)
- 0.80 -- 0.89: Good confidence, structural validation but no checksum (e.g., SSN area rules, phone formats)
- 0.60 -- 0.79: Moderate confidence, pattern-based with limited validation (e.g., ZIP codes, passport numbers, bare KvK numbers)
- Below 0.60: Low confidence, context-dependent patterns
Some detectors use context boosting -- if keywords like "BSN", "KvK", "PLZ", or "UTR" appear near a candidate match, the confidence is increased. This reduces false positives for ambiguous patterns like bare digit sequences.
Use WithConfidenceThreshold() to filter out matches below a desired confidence level.