Data Classification
On this page
- Why Classification Matters
- Classification Levels (8 Total)
- 1. Protected Health Information (PHI)
- 2. Deidentified Data
- 3. Personally Identifiable Information (PII)
- 4. Sensitive Personal Data
- 5. Payment Card Industry Data (PCI)
- 6. Financial Data
- 7. Confidential Data
- 8. Public Data
- Automatic Classification
- Content-Based Classification (ML)
- Validation
- Framework Mapping
- Usage Examples
- Creating a Stream with Classification
- Querying Classification Requirements
- Best Practices
- 1. Start Restrictive, Relax Later
- 2. Use Pattern-Based Naming
- 3. Document Exceptions
- 4. Audit Classification Changes
- 5. Multi-Framework Compliance
- Formal Verification
- Migration Guide
- Upgrading from 3-Level to 8-Level Classification
- See Also
Kimberlite provides 8 data classification levels to support multi-framework compliance across healthcare (HIPAA), privacy (GDPR), financial (PCI DSS, SOX), and security standards (ISO 27001, FedRAMP).
Why Classification Matters
Data classification determines:
- Encryption requirements - Which data must be encrypted at rest
- Access controls - Who can access the data
- Audit logging - What must be logged
- Retention policies - How long data must be kept
- Compliance scope - Which regulations apply
Without proper classification, you risk:
- HIPAA violations ($50K+ fines per violation)
- GDPR violations (€20M or 4% revenue)
- PCI DSS non-compliance (loss of payment processing)
- Data breaches from inadequate protection
Classification Levels (8 Total)
Kimberlite supports 8 classification levels, ordered from least to most restrictive:
Public < Deidentified < Confidential < PII < Financial < PCI < Sensitive < PHI
1. Protected Health Information (PHI)
Definition: Medical records and health information subject to HIPAA regulations.
Examples:
- Patient medical records
- Diagnoses and treatment plans
- Lab results and prescriptions
- Clinical notes
Compliance: HIPAA Privacy Rule, HIPAA Security Rule
Requirements:
- Encryption at rest (HIPAA § 164.312(a)(2)(iv))
- Audit logging (HIPAA § 164.312(b))
- Access controls (HIPAA § 164.312(a)(1))
- Minimum retention: 6 years from last treatment (HIPAA § 164.530)
Applicable Frameworks: HIPAA, GDPR (as PII), ISO 27001, FedRAMP
2. Deidentified Data
Definition: Data stripped of all 18 HIPAA identifiers per Safe Harbor method.
Examples:
- Anonymized patient datasets
- Aggregate statistics
- De-identified cohort studies
Compliance: HIPAA Safe Harbor Method (§ 164.514(b)(2))
Requirements:
- No encryption required (no identifiers present)
- No audit logging required
- Can be used for research without consent
Applicable Frameworks: HIPAA only
Note: Must remove all 18 identifiers: names, addresses, dates (except year), phone numbers, email, SSN, medical record numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full-face photos, and any other unique identifying numbers.
3. Personally Identifiable Information (PII)
Definition: Information that can identify an individual (GDPR Article 4).
Examples:
- Names, email addresses, phone numbers
- IP addresses, location data
- User profiles, customer contacts
- Online identifiers (cookies, device IDs)
Compliance: GDPR Articles 5-11 (lawfulness, consent, purpose limitation)
Requirements:
- Encryption at rest (GDPR Article 32(1)(a))
- Audit logging (GDPR Article 30)
- Access controls (GDPR Article 32(1)(b))
- Purpose-limited retention (GDPR Article 5(1)(e))
Data Subject Rights:
- Right to access (GDPR Article 15)
- Right to rectification (GDPR Article 16)
- Right to erasure/“right to be forgotten” (GDPR Article 17)
- Right to data portability (GDPR Article 20)
Applicable Frameworks: GDPR, ISO 27001, FedRAMP
4. Sensitive Personal Data
Definition: Special category data requiring explicit consent (GDPR Article 9).
Examples:
- Racial or ethnic origin
- Political opinions, religious beliefs
- Trade union membership
- Genetic data, biometric data
- Health data (not covered by HIPAA)
- Data concerning sex life or sexual orientation
Compliance: GDPR Article 9 (explicit consent required)
Requirements:
- Encryption at rest (GDPR Article 32)
- Audit logging (GDPR Article 30)
- Explicit consent (GDPR Article 9)
- Data Protection Impact Assessment (GDPR Article 35)
- Stricter access controls than regular PII
Processing Restrictions: Processing prohibited unless:
- Explicit consent obtained
- Processing necessary for employment/social security
- Legal claims or vital interests
- Legitimate activities of foundation/association
- Data made public by data subject
- Specific legal authorization
Applicable Frameworks: GDPR, ISO 27001, FedRAMP
5. Payment Card Industry Data (PCI)
Definition: Cardholder data subject to PCI DSS requirements.
Examples:
- Credit/debit card numbers (Primary Account Number - PAN)
- Cardholder name with PAN
- Expiration date with PAN
- Service code
Compliance: PCI DSS Requirements 1-12
Requirements:
- Encryption at rest (PCI DSS Requirement 3)
- Audit logging (PCI DSS Requirement 10)
- Access controls (PCI DSS Requirement 7)
- Network segmentation (PCI DSS Requirement 1)
- Minimum retention: 3 months required, 1 year recommended
Critical Restrictions:
- NEVER store full magnetic stripe data
- NEVER store CVV/CVV2/CVC2 after authorization
- NEVER store PIN or PIN block
- Mask PAN when displaying (show last 4 digits only)
Applicable Frameworks: PCI DSS, GDPR (as PII), ISO 27001, FedRAMP
6. Financial Data
Definition: Financial records subject to Sarbanes-Oxley (SOX) regulations.
Examples:
- General ledger entries
- Financial statements
- Audit trails for financial transactions
- Balance sheets, income statements
- Revenue and expense records
Compliance: Sarbanes-Oxley Act § 302, § 404
Requirements:
- Encryption at rest
- Audit logging (SOX § 404)
- Immutable audit trails
- Internal controls documentation
- Minimum retention: 7 years (SOX § 802)
Applicability: Public companies and their partners
Applicable Frameworks: SOX, ISO 27001, FedRAMP
7. Confidential Data
Definition: Internal business data and trade secrets.
Examples:
- Proprietary algorithms
- Business strategies and plans
- Internal communications
- Employee records
- Trade secrets, intellectual property
Compliance: ISO 27001 Annex A.8 (Asset Management)
Requirements:
- Encryption at rest (best practice)
- Audit logging (best practice)
- Access restricted to authorized personnel
- Retention: Business-defined
Applicable Frameworks: ISO 27001, FedRAMP
8. Public Data
Definition: Publicly available information with no restrictions.
Examples:
- Public website content
- Press releases
- Published research papers
- Blog posts
- Public announcements
Compliance: No special requirements
Requirements:
- No encryption required
- No audit logging required
- Unrestricted access
Applicable Frameworks: None (no compliance restrictions)
Automatic Classification
Kimberlite can automatically infer data classification based on stream names:
use infer_from_stream_name;
use DataClass;
// Healthcare patterns
assert_eq!;
assert_eq!;
// Privacy patterns
assert_eq!;
assert_eq!;
// Financial patterns
assert_eq!;
assert_eq!;
// General patterns
assert_eq!;
assert_eq!;
Pattern Keywords:
| Classification | Keywords |
|---|---|
| PHI | patient, medical, health, diagnosis, prescription, lab_result, clinical |
| Deidentified | deidentified, anonymized, aggregate |
| PCI | credit_card, payment, card_transaction, cvv, pan |
| Financial | financial, ledger, accounting, audit_trail, balance_sheet, revenue |
| Sensitive | biometric, genetic, racial, ethnic, political, religious, sexual_orientation, trade_union |
| PII | user, customer, person, profile, contact, email, phone, address |
| Confidential | internal, confidential, proprietary, trade_secret, strategy |
| Public | public, announcement, press_release, blog, documentation |
Safe Default: If no patterns match, Kimberlite defaults to Confidential (better to be too restrictive than too permissive).
Content-Based Classification (ML)
Beyond stream name inference, Kimberlite provides content-based classification that scans actual field values to detect sensitive data patterns:
use ;
// Scan individual fields
let scanner = ContentScanner;
assert!;
assert!;
assert!;
assert!;
assert!;
// Classify content with confidence scoring
let result = classify_content;
assert_eq!;
assert!;
Detected Patterns:
| Pattern | Detection Method | Classification |
|---|---|---|
| SSN (XXX-XX-XXXX) | Regex | PII |
| Credit card numbers | Regex + Luhn check | PCI |
| Email addresses | Regex | PII |
| Medical terms | Keyword dictionary (ICD-10 codes, drug names, diagnoses) | PHI |
| Financial terms | Keyword dictionary (CUSIP, ISIN, account numbers) | Financial |
Confidence Scoring:
- Each field scan contributes a confidence weight (0.0-1.0)
- Multiple detections across fields increase overall confidence
- The
higher_class()function ensures the most restrictive classification wins - Results include
matched_patternsfor audit trail transparency
Use Cases:
- Automated compliance tagging during data ingestion
- Classification validation — verify user-assigned classification matches content
- Data discovery — scan existing streams to identify unclassified sensitive data
- Breach assessment — quickly determine data classes affected by an incident
Validation
Kimberlite prevents users from under-classifying data:
use validate_user_classification;
use DataClass;
// ✅ Allowed: User matches inference
validate_user_classification; // true
// ✅ Allowed: User is MORE restrictive (safe)
validate_user_classification; // true
// ❌ Denied: User is LESS restrictive (dangerous!)
validate_user_classification; // false
validate_user_classification; // false
Restrictiveness Ordering:
Public (0) < Deidentified (1) < Confidential (2) < PII (3) <
Financial (4) < PCI (5) < Sensitive (6) < PHI (7)
Framework Mapping
| Classification | HIPAA | GDPR | PCI DSS | SOX | ISO 27001 | FedRAMP |
|---|---|---|---|---|---|---|
| PHI | (PII) | — | — | |||
| Deidentified | — | — | — | — | — | |
| PII | — | — | — | |||
| Sensitive | — | (Art 9) | — | — | ||
| PCI | — | (PII) | — | |||
| Financial | — | — | — | |||
| Confidential | — | — | — | — | ||
| Public | — | — | — | — | — | — |
Usage Examples
Creating a Stream with Classification
use Kimberlite;
use ;
let kmb = open?;
// Healthcare stream
kmb.create_stream?;
// Payment stream
kmb.create_stream?;
// Public stream
kmb.create_stream?;
Querying Classification Requirements
use *;
use DataClass;
// Check encryption requirements
assert!; // true
assert!; // true
assert!; // false
// Check audit logging requirements
assert!; // true
assert!; // false
// Check consent requirements (GDPR Article 9)
assert!; // true
assert!; // false
// Get minimum retention periods
assert_eq!; // 6 years
assert_eq!; // 7 years
assert_eq!; // 1 year
assert_eq!; // No requirement
// Get applicable frameworks
let phi_frameworks = applicable_frameworks;
assert!;
assert!;
assert!;
Best Practices
1. Start Restrictive, Relax Later
Default to Confidential if unsure. It’s easier to relax restrictions later than to retroactively add them.
// Safe default
Confidential
// Don't guess - if it might be PHI, classify as PHI
PHI
2. Use Pattern-Based Naming
Name streams to match classification patterns:
// Good: Clear classification
"patient_medical_records" // → PHI
"deidentified_patient_cohort" // → Deidentified
"user_email_addresses" // → PII
"credit_card_vault" // → PCI
"public_blog_posts" // → Public
// Bad: Ambiguous
"data_stream_1" // → Confidential (default)
"records" // → Confidential (default)
3. Document Exceptions
If you override automatic classification, document why:
// Override: Marketing emails are public (user consent obtained)
create_stream?;
// ^ Documented in privacy policy: users opt-in to public marketing
4. Audit Classification Changes
Track when classifications change:
// Log classification changes for compliance audit trail
info!;
5. Multi-Framework Compliance
For data spanning multiple frameworks, use the most restrictive classification:
// Health data in EU: Both HIPAA and GDPR apply
// Use PHI (most restrictive)
PHI
Formal Verification
All data classification properties are formally verified with:
TLA+ Specification: specs/tla/compliance/MetaFramework.tla
DataClassIntegrityinvariantDataClassificationCompletetheorem (all 8 levels defined)EncryptionEnforcedtheorem (encryption requirements verified)FrameworkMappingCorrecttheorem (framework mappings verified)
Kani Proofs: crates/kimberlite-kernel/src/kani_proofs.rs
- Proof #31: Classification restrictiveness ordering
- Proof #32: Classification inference determinism
Property Tests: crates/kimberlite-kernel/src/classification.rs
- 11 unit tests covering all classification levels
- Inference pattern matching
- User classification validation
Migration Guide
Upgrading from 3-Level to 8-Level Classification
Before (v0.3.0):
PHI // Protected Health Information
NonPHI // Everything else
Deidentified // Anonymized data
After (v0.4.0):
PHI // Protected Health Information (unchanged)
Deidentified // Anonymized data (unchanged)
// NonPHI split into 6 categories:
PII // User data (GDPR)
Sensitive // Special category data (GDPR Art 9)
PCI // Payment card data (PCI DSS)
Financial // Financial records (SOX)
Confidential // Internal business data
Public // Publicly available data
Automatic Migration:
All existing NonPHI streams are migrated to Public (least restrictive). Review and reclassify based on content.
See Also
- Compliance Overview - Multi-framework compliance architecture
- Encryption - Encryption recipes
- Audit Trails - Audit trail recipes
- Security - Security and compliance operations
- Access Control - RBAC and field-level security
Key Takeaway: Proper data classification is the foundation of compliance. Kimberlite’s 8-level classification system ensures you meet HIPAA, GDPR, PCI DSS, SOX, ISO 27001, and FedRAMP requirements with formal verification guarantees.