Skip to content

ioncache/data-sanitization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

192 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-sanitization: protect credentials and personal data from accidental exposure

Sensitive data (credentials, PII, PHI, and other private information) ends up in logs more often than it should.

Node CI Coverage npm Bundle size CodeRabbit PR Reviews

npm  •  Changelog  •  GitHub


data-sanitization masks or removes sensitive field values before they leave your application.

Use it in log pipelines, request handlers, and error reporters to catch what might otherwise slip through.

It matches field names across objects, arrays, and strings, and lets you extend the built-in defaults with your own patterns for PII, PHI, or any domain-specific fields.

Before / After

const input = {
  username: 'mark',
  password: 'super-secret',
  api_key: 'sk_live_abc123',
};

sanitizeData(input);
// => { username: 'mark', password: '**********', api_key: '**********' }

Highlights

  • Zero runtime dependencies, with compiled JS and full TypeScript declarations
  • Sanitizes nested structures at any depth, preserving types and class instances
  • Handles circular references safely
  • Sanitization errors never expose the original input payload

Table of Contents

Installation

npm install data-sanitization
yarn add data-sanitization
pnpm add data-sanitization
bun add data-sanitization

Importing

import { sanitizeData, DataSanitizationError } from 'data-sanitization';
import sanitizeData from 'data-sanitization';
const { sanitizeData } = require('data-sanitization');

Usage

Quick start

import { sanitizeData } from 'data-sanitization';

const input = {
  username: 'mark',
  password: 'super-secret',
  api_key: 'sk_live_abc123',
};

const result = sanitizeData(input);
// => { username: 'mark', password: '**********', api_key: '**********' }

Sanitize a string

Pass a string directly and it will be sanitized in place. This is useful for sanitizing serialized data before logging. For example, a raw request body, a form-encoded payload, or a JSON string you have not yet parsed:

sanitizeData('{"password":"secret","username":"mark"}');
// => '{"password":"**********","username":"mark"}'

sanitizeData('password=secret&username=mark');
// => 'password=**********&username=mark'

Parse JSON strings

By default, string inputs are sanitized using text-based pattern matching. This works for most cases, but it cannot detect numeric-valued sensitive fields:

sanitizeData('{"password":12345,"username":"mark"}');
// => '{"password":12345,"username":"mark"}' (numeric value not masked)

Setting parseJsonStrings: true parses the JSON first and sanitizes it the same way an object would be, which handles numeric values correctly:

sanitizeData('{"password":12345,"username":"mark"}', {
  parseJsonStrings: true,
});
// => '{"password":9999999999,"username":"mark"}'

Tip

parseJsonStrings: true is also 3–4× faster for JSON string inputs than the default text-based approach. The tradeoff is that output is re-serialized with JSON.stringify, which does not preserve original whitespace or formatting.

Remove fields instead of masking

sanitizeData(
  { password: 'secret', token: 'abc', username: 'mark' },
  { removeMatches: true },
);
// => { username: 'mark' }

Sanitize PII and PHI with custom patterns

Use customPatterns to mask fields that are sensitive for your domain, such as PII or PHI fields.

import { sanitizeData } from 'data-sanitization';

const sensitivePatterns = [
  'address',
  'date_of_birth',
  'email',
  'emergency_contact',
  'full_name',
  'health_card',
  'ip_address',
  'medications',
  'phone',
  'postal_code',
  'ssn',
];

const patient = {
  accountId: 'acct_123',
  full_name: 'Avery Example',
  email: 'avery@example.com',
  phone: '+1-555-0100',
  date_of_birth: '1989-04-12',
  health_card: 'HC-1234-5678',
  medications: ['example-medication'],
};

sanitizeData(patient, {
  customPatterns: sensitivePatterns,
  useDefaultPatterns: false,
});
// => {
//   accountId: 'acct_123',
//   full_name: '**********',
//   email: '**********',
//   phone: '**********',
//   date_of_birth: '**********',
//   health_card: '**********',
//   medications: '**********',
// }

Use removeMatches with the same patterns to remove those fields instead of masking them.

sanitizeData(patient, {
  customPatterns: sensitivePatterns,
  removeMatches: true,
  useDefaultPatterns: false,
});
// => { accountId: 'acct_123' }

Options

Option Type Default Description
patternMask string ********** String used to replace matched string field values
numericMask number 9999999999 Number used to replace matched number field values
removeMatches boolean false Remove matched fields entirely instead of masking
scanStringValues boolean true Scan string values on non-sensitive keys for embedded patterns. Applies to object input and to string input when parseJsonStrings is enabled; has no effect on raw string input.
parseJsonStrings boolean false Parse valid JSON string inputs as structured data and sanitize by field name. Re-serializes with JSON.stringify, discarding original whitespace.
customPatterns string[] [] Additional field name patterns to match
customMatchers DataSanitizationMatcher[] [] Additional regex matchers for custom string formats
useDefaultPatterns boolean true Set to false to use only your custom patterns, ignoring the built-in defaults.
useDefaultMatchers boolean true Set to false to use only your custom matchers, ignoring the built-in defaults.

Default patterns

The following field name patterns are matched by default using a case-insensitive substring match:

  • apikey
  • api_key
  • password
  • secret
  • token

A field named db_password or client_secret_key would also match because these patterns match as substrings.

Default matchers

Three matchers are included by default:

  • JSON matcher: matches "fieldName":"value" patterns in JSON and JSON-like strings
  • Escaped JSON matcher: matches \"fieldName\":\"value\" patterns in JSON embedded inside JSON string values
  • Form-encoded matcher: matches fieldName=value and fieldName:value patterns in URL-encoded and similarly delimited strings

Custom patterns and matchers

Use customPatterns to add field names on top of the defaults, or use useDefaultPatterns: false to replace the defaults entirely:

import { sanitizeData } from 'data-sanitization';

const data = {
  username: 'mark',
  ssn: '123-45-6789',
  credit_card: '4111111111111111',
};

// Add to the built-in defaults
sanitizeData(data, {
  customPatterns: ['ssn', 'credit_card'],
});
// => { username: 'mark', ssn: '**********', credit_card: '**********' }

// Use only specific patterns, ignoring the defaults
sanitizeData(data, {
  customPatterns: ['ssn'],
  useDefaultPatterns: false,
});
// => { username: 'mark', ssn: '**********', credit_card: '4111111111111111' }

// Use a different mask string
sanitizeData(data, {
  customPatterns: ['ssn', 'credit_card'],
  patternMask: '[REDACTED]',
});
// => { username: 'mark', ssn: '[REDACTED]', credit_card: '[REDACTED]' }

Number-typed sensitive values are masked with numericMask to preserve the field's type:

sanitizeData({ password: 12345, username: 'mark' });
// => { password: 9999999999, username: 'mark' }

sanitizeData({ password: 12345, username: 'mark' }, { numericMask: 0 });
// => { password: 0, username: 'mark' }

For custom data formats, provide a DataSanitizationMatcher, a function that takes a pattern string and returns a global, case-insensitive RegExp. The regex must use capture groups $1 and $2 to preserve the field name and trailing delimiter while replacing the value.

const headerMatcher = (pattern: string) =>
  new RegExp(`(${pattern}:\\s*).+?(\\n|$)`, 'gi');

sanitizeData('authorization: Bearer abc123\nuser: mark', {
  customMatchers: [headerMatcher],
  customPatterns: ['authorization'],
  useDefaultMatchers: false,
});
// => 'authorization: **********\nuser: mark'

Error handling

sanitizeData throws a DataSanitizationError when:

  • The input is not a string, object, or null.
  • An unexpected error occurs during sanitization.
import { sanitizeData, DataSanitizationError } from 'data-sanitization';

try {
  sanitizeData(123 as any);
} catch (error) {
  if (error instanceof DataSanitizationError) {
    console.error(error.message); // 'Invalid data type'
    console.error(error.details); // { inputType: 'number' }
  }
}

Error details are limited to safe diagnostic metadata and do not include the original input payload.

How it works

sanitizeData dispatches on the input type and applies the configured patterns and matchers accordingly:

  1. String input is sanitized directly via regex replacement with the configured matchers.
  2. Object input is sanitized recursively by key name without JSON serialization. Sensitive keys are masked or removed regardless of whether their values are strings, numbers, arrays, objects, or other primitives.
  3. Plain nested objects and arrays are cloned as they are sanitized. Non-plain object instances are preserved without modification to avoid corrupting their prototypes.
  4. Null input is accepted and returns null.
  5. For object input, each pattern is matched case-insensitively against key names. By default (scanStringValues: true), string values on non-sensitive keys are also scanned, which catches credentials embedded in log messages or other free-text fields.
  6. For string input, each pattern is tested against each matcher to find and replace sensitive values in the raw string directly.

Performance

sanitizeData is designed for in-process sanitization of log payloads, request/response objects, and similar data before they leave your application. It is not designed for streaming pipelines or bulk batch processing of large files.

String-value scanning (scanStringValues: true, the default) adds overhead on object workloads. The cost depends on how many non-sensitive string fields the input has and how long they are. Rough throughput on a modern laptop (Apple M-series, Node.js 22):

Workload ops/s ms/call scan overhead
Shallow object (1 sensitive key) ~464,000 ~0.002 ~18%
Log object, stack trace with credentials ~46,000 ~0.022 ~88%
Log object, clean stack trace ~318,000 ~0.003 ~18%
Object with 10KB non-sensitive string ~200,000 ~0.005 ~68%
Large flat object (50 fields, 1 sensitive key) ~82,000 ~0.012 ~10%
Array (1,000 items, 1 sensitive key each) ~2,161 ~0.46 ~5%
Array (1,000,000 items, 1 sensitive key each) ~1.7 ~574 ~4%

Array workloads pay ~3–5% overhead regardless of size. The per-item pre-filter cost is negligible. The cost is most visible on individual objects with long non-sensitive string values such as stack traces or large text fields; a single 10KB non-sensitive string value incurs ~68% overhead.

Tip

Set scanStringValues: false when you control your data structure and know sensitive values only appear on sensitive-named keys. This recovers full pre-scanning throughput.

Set parseJsonStrings: true when your string inputs are JSON. It is 3–4× faster than the default regex path and correctly masks numeric-valued sensitive fields.

On first call with a given set of options, sanitizeData compiles its regex set and caches the result by option fingerprint. Subsequent calls with the same options reuse the cache at no extra cost. This applies whether options are passed inline or as a variable, as long as the content is the same.

Warning

Building customPatterns dynamically per call from variable data causes a cache miss on every call, so compilation runs on each request instead of being reused.

// Anti-pattern: patterns differ on every call, cache never hits
app.post('/log', (req) => {
  sanitizeData(req.body, {
    customPatterns: [...basePatterns, ...req.user.sensitiveFields],
  });
});

// Correct: build options once at startup (or per stable configuration)
const sanitizerOptions = {
  customPatterns: [...basePatterns, ...knownSensitiveFields],
};

app.post('/log', (req) => {
  sanitizeData(req.body, sanitizerOptions);
});

If dynamic options are unavoidable, set scanStringValues: false. This skips the string-scanning cache and avoids the fingerprinting overhead on every call.

When options must genuinely vary per call, each call pays the first-call compilation cost (~32× slower than a cached call).

For full benchmark tables, charts, and scaling analysis see docs/performance.md. To run the benchmarks:

yarn bench

Contributing

Bug reports and pull requests are welcome. Open an issue or PR on GitHub.

See docs/development.md for setup, build, test, and release instructions, and docs/ROADMAP.md for planned work.

License

MIT

About

Sanitization library for obfuscating/removing/securing data.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors