Detect and block toxic, hateful, violent, and harmful content in both user inputs and AI outputs.

Content Policy

Detects and blocks harmful content categories using pattern matching and semantic analysis. Applies to both the user's prompt (input) and the AI's response (output).

const guard = new Guardian({
  content: {
    enabled:     true,
    sensitivity: 'medium',
    categories: {
      toxicity:    true,  // Offensive language, insults
      hate:        true,  // Hate speech targeting protected groups
      violence:    true,  // Threats, graphic violence
      selfHarm:    true,  // Content encouraging self-harm
      sexual:      false, // Explicit sexual content (disabled by default)
    },
  },
});

Configuration

Option	Type	Default	Description
`enabled`	`boolean`	`false`	Enable content policy
`sensitivity`	`'low' \| 'medium' \| 'high'`	`'medium'`	Detection aggressiveness
`onInput`	`boolean`	`true`	Check user prompts
`onOutput`	`boolean`	`true`	Check AI responses
`categories.toxicity`	`boolean`	`true`	Offensive / toxic language
`categories.hate`	`boolean`	`true`	Hate speech
`categories.violence`	`boolean`	`true`	Violent content / threats
`categories.selfHarm`	`boolean`	`true`	Self-harm encouragement
`categories.sexual`	`boolean`	`false`	Explicit sexual content

Example

try {
  await guard.protect(callFn, '[harmful prompt]');
} catch (err) {
  if (err instanceof ContentPolicyError) {
    console.log(err.code);              // 'CONTENT_POLICY_VIOLATION'
    console.log(err.context.category);  // 'toxicity'
    console.log(err.context.score);     // 0.91
    console.log(err.context.direction); // 'input' | 'output'
 
    // Return a safe response to the user
    return Response.json({
      error: 'Your message violates our content policy.'
    }, { status: 400 });
  }
}

Result Metadata

const result = await guard.protect(callFn, safePrompt);
console.log(result.meta.contentPolicy);
// {
//   passed: true,
//   inputScore: { toxicity: 0.02, hate: 0.01, violence: 0.0, selfHarm: 0.0, sexual: 0.0 },
//   outputScore: { toxicity: 0.01, hate: 0.0, violence: 0.0, selfHarm: 0.0, sexual: 0.0 }
// }

Standalone Usage

import { analyzeContent } from '@edwinfom/ai-guard/content';
 
const report = await analyzeContent('I hate all people from [group]', {
  categories: { hate: true },
});
// { violations: [{ category: 'hate', score: 0.94, severity: 'high' }] }