Methodology

How We Score Ingredient Safety

Our methodology for the 34,000+ ingredient safety assessments powering theformulator.ai

Every formulation generated on theformulator.ai includes per-ingredient safety scores across six hazard axes. This page explains how those scores are calculated, what data sources feed them, and why we built a proprietary system instead of licensing an existing one.

Why existing safety databases fall short for formulators

The two most widely referenced ingredient safety databases — EWG Skin Deep and SkinSafe — were designed for consumers, not formulators. This creates fundamental problems when R&D teams try to use them as data inputs for formulation decisions.

EWG Skin Deep scores approximately 11,500 ingredients. Its methodology applies a data gap penalty: ingredients with limited safety studies receive elevated hazard scores. For a formulator working with novel or niche raw materials, this means an absence of negative evidence gets treated as evidence of harm. The scoring algorithm is opaque — weights and methodology are not published. Exposure context (whether an ingredient appears in a leave-on serum at 2% or a rinse-off shampoo at 0.5%) is not factored into the score. And the database is consumer-facing, with framing that tends toward alarmism rather than risk-based assessment.

SkinSafe takes a different approach, focusing on allergen and irritant avoidance based on Mayo Clinic data. While clinically grounded for contact dermatitis, it does not cover the full hazard spectrum that regulatory affairs teams need — carcinogenicity, reproductive toxicity, and endocrine disruption are outside its scope.

Neither system provides per-market regulatory status. A formulator developing for both EU and China needs to know that an ingredient permitted in Europe may be prohibited or restricted under NMPA — this cross-market view does not exist in consumer-facing databases. Neither system updates automatically from primary regulatory sources. Neither system adjusts scores based on product type or exposure duration.

We needed a scoring system that:

Draws from the same primary sources regulatory authorities use
Treats absence of data honestly — as a data gap, not a hazard signal
Accounts for exposure context (leave-on vs rinse-off)
Covers the full hazard spectrum across six independent axes
Provides per-market regulatory intelligence alongside safety scores
Updates from primary sources automatically, not via periodic manual review

Six axes of hazard assessment

Each ingredient is scored 0–10 on six independent hazard axes. A composite score is calculated, but all six axis scores are always visible to the formulator — because a single number hides critical information.

Carcinogenicity

What it measures

Evidence of cancer-causing potential from chronic or repeated exposure.

Primary sources

IARC Monographs (Group 1, 2A, 2B classifications), ECHA CLP Annex VI (H350, H351 hazard statements), NTP Report on Carcinogens.

Scoring approach

IARC Group 1 (known carcinogen) maps to the highest severity range. Group 2A (probable) and 2B (possible) map progressively lower. When multiple authoritative sources provide different classifications for the same ingredient, the highest score is retained — we never downgrade a carcinogenicity assessment. Any ingredient classified as a known carcinogen receives a composite score floor regardless of how clean its other five axes are.

Developmental & Reproductive Toxicity (DART)

What it measures

Risk of harm to fertility, fetal development, or lactation.

Primary sources

ECHA CLP Annex VI (H360, H361, H362 hazard statements), SCCS opinions on specific cosmetic ingredients.

Scoring approach

H360 (may damage fertility or the unborn child) maps to the highest severity. H361 (suspected) maps to moderate severity. H362 (may cause harm to breastfed children) is scored based on exposure route relevance to cosmetic use.

Sensitization

What it measures

Potential to cause allergic contact dermatitis on repeated exposure.

Primary sources

ECHA CLP (H317 — skin sensitizer classification), NACDG (North American Contact Dermatitis Group) patch test prevalence data, CIR safety assessments.

Scoring approach

The ECHA H317 classification provides a regulatory baseline. NACDG prevalence rates add clinical weight — a sensitizer that causes positive patch test reactions in 5% of patients is scored higher than one affecting 0.3%. Strong sensitizers receive a composite score floor to prevent other clean axes from masking the risk.

Systemic Toxicity

What it measures

Potential for organ damage from single or repeated exposure via dermal, oral, or inhalation routes.

Primary sources

ECHA CLP (H370, H371 for single exposure; H372, H373 for repeated exposure), SCCS safety opinions.

Scoring approach

"Causes organ damage" classifications map to high severity. "May cause organ damage" maps lower. Repeated-exposure classifications are scored based on exposure route relevance — an inhalation-only hazard is less relevant for a topical cream than for a spray product.

Irritation

What it measures

Potential to cause non-allergic skin or eye irritation on contact.

Primary sources

CIR safety assessments (clinical patch test data), Zein Number dissolution data (for surfactants specifically), ECHA CLP (H315 skin irritation, H319 eye irritation).

Scoring approach

For surfactants, the Zein Number provides a quantitative protein denaturation measure that correlates directly with skin irritation potential — this is more predictive than binary classification. CIR clinical data provides human-relevant irritation thresholds. High irritation scores receive a modest composite floor to ensure highly irritating ingredients are flagged even when all other axes are clean.

Endocrine Disruption

What it measures

Potential to interfere with hormonal systems (estrogen, androgen, thyroid, steroidogenesis pathways).

Primary sources

ECHA Endocrine Disruptor assessment list, REACH SVHC (Substances of Very High Concern) identifications with ED concern, EU Community Rolling Action Plan (CoRAP) evaluations.

Scoring approach

Confirmed endocrine disruptors on the ECHA list score highest. Ingredients under active assessment score at moderate severity. SVHC identification with endocrine disruption concern adds weight to the score.

Why multi-axis scoring matters

A single composite score can mask critical safety signals. Two ingredients can land at the same composite value with completely different underlying hazard profiles. Our system always shows both the composite and the individual axis breakdown — so the formulator decides what matters for their product.

Same score, different story

Benzophenone-3 (Oxybenzone)

Composite 1.50·GREEN·HIGH confidence

Phenoxyethanol

Composite 1.50·GREEN·HIGH confidence

Benzophenone-3 and Phenoxyethanol both carry a composite score of 1.50 — both in the GREEN band. But the underlying hazard profiles are entirely different.

Benzophenone-3 shows signals across carcinogenicity (2), developmental toxicity (3), and endocrine disruption (4) — a broad, low-level concern pattern driven primarily by its endocrine activity.

Phenoxyethanol concentrates its risk in irritation (7) and endocrine disruption (4), with a minor DART signal (2). For a product applied near the eyes or on compromised skin, this irritation spike matters.

A single composite score treats these as identical. Our 6-axis system shows they are not.

Green doesn't mean zero concern

Salicylic Acid

Composite 1.90·GREEN·HIGH confidence

Salicylic Acid scores 1.90 — comfortably in the GREEN band. But the radar chart reveals three elevated axes: irritation at 7 (classified H318 — causes serious eye damage), DART at 4 (developmental/reproductive toxicity signal), and endocrine disruption at 4.

For a general-purpose exfoliant in a rinse-off cleanser, this profile is acceptable. For a leave-on product targeted at pregnant consumers, the DART signal at 4 is information a formulator needs to see — and a composite score of 1.90 would never surface it.

We don't make the decision. We surface the data. The formulator decides.

Data-driven ingredient substitution

Propylene Glycol

Composite 0.80·GREEN·HIGH confidence

Propanediol

Composite 0.40·GREEN·HIGH confidence

Propylene Glycol and Propanediol are functionally interchangeable humectants and solvent carriers. Both are GREEN. But Propylene Glycol carries an irritation score of 4 — a moderate signal from ECHA CLP classification — while Propanediol shows no irritation concern.

For sensitive skin formulations, this difference matters. Our system makes it visible so the substitution decision is informed by data, not marketing.

Scores derived from 11 authoritative sources including ECHA CLP Annex VI, IARC/NTP carcinogen classifications, CIR safety assessments, ESSCA clinical patch test data, and EPA CompTox predictions. Full methodology and source hierarchy published on this page.

Five principles that guide our scoring

1
No data gap penalty
If an ingredient has no studies addressing a particular hazard axis, that axis scores 0 — not an elevated score. Absence of evidence is not evidence of harm. This is the single largest methodological difference between our system and EWG Skin Deep, which inflates scores for under-studied ingredients. A novel botanical extract with limited toxicology literature should not receive a higher hazard score than a well-studied petrochemical with confirmed clean safety data.
2
Exposure context matters
The same ingredient at the same concentration poses different risks in a leave-on facial serum (hours of skin contact) versus a rinse-off shampoo (30 seconds of contact). Our scoring applies an exposure modifier based on product type. This is standard practice in regulatory toxicology — SCCS and CIR both evaluate safety in the context of exposure — but is absent from consumer-facing databases.
3
Source hierarchy
Not all safety data is equal. Peer-reviewed regulatory assessments (ECHA CLP, SCCS opinions, CIR final reports) carry more weight than preliminary findings or supplier-provided data. Our system enforces a clear hierarchy: regulatory authority classifications first, then peer-reviewed clinical data, then curated literature. Scores from higher-tier sources are never downgraded by lower-tier data.
4
Confidence tiering
Every safety score carries a confidence level: high (direct ECHA, CIR, or IARC classification), medium (extracted from peer-reviewed literature or CIR group assessments), or low (baseline from COSING registration with no hazard evidence found). Only medium and high confidence scores are displayed to users. Low confidence ingredients show "No hazard data available in our sources" rather than a green score that implies safety has been confirmed.
5
Conservative on carcinogenicity
Carcinogenicity is treated asymmetrically. A known carcinogen classified by IARC or NTP cannot have its composite score pulled below a floor by clean scores on other axes. When multiple sources provide different carcinogenicity classifications, the highest (most conservative) score is retained. This reflects the irreversible nature of carcinogenic harm compared to reversible irritation.

Where our data comes from

Source	What we extract	Coverage
ECHA CLP Annex VI	H-statement hazard classifications mapped to 6 axes	3,831 cosmetic-relevant ingredients
CIR (Cosmetic Ingredient Review)	Safety assessment conclusions, clinical patch test data	2,654 assessment reports
IARC Monographs	Carcinogen group classifications (1, 2A, 2B, 3)	811 substances classified
NTP Report on Carcinogens	Known and reasonably anticipated carcinogens	256 substances
NACDG Patch Test Data	Sensitization prevalence rates (% positive reactions)	Top allergens with prevalence data
SCCS Scientific Opinions	EU-specific safety evaluations for cosmetic ingredients	Referenced per ingredient
PubMed / PMC	Hazard signals extracted from peer-reviewed literature	28,800+ papers mined
REACH SVHC List	Substances of very high concern (ED, CMR, PBT flags)	213 SVHC-flagged ingredients
COSING (EU)	INCI identity, Annex II–VI regulatory status	Baseline identity for all EU-listed ingredients

All extraction pipelines run automatically. Regulatory source pages across 16 markets are monitored daily for changes. When a regulatory authority updates an ingredient classification or restriction, the change is detected via content hashing, logged as a structured event, and propagated to affected safety scores without manual intervention.

Safety scores are only half the picture

An ingredient can score well on all six hazard axes and still be prohibited in your target market. Formaldehyde releasers are a clear example — permitted with concentration limits in some markets, banned outright in others. Safety scoring without market-specific regulatory status is incomplete.

Every ingredient in our system carries per-market regulatory status across 16 markets: EU, US (FDA), China (NMPA IECIC), Japan (MHLW), South Korea (MFDS), India (BIS IS 4707), Canada, Australia (TGA), Brazil (ANVISA), Thailand, Malaysia (NPRA), Singapore (HSA), Indonesia, Vietnam, Philippines, and ASEAN harmonised standards. Status categories include: permitted, restricted (with maximum concentrations, required labelling, and product-type limitations), prohibited, and not listed.

What we deliberately exclude

No certification claims. We do not recommend or imply COSMOS, Ecocert, Vegan, Halal, or any other certification status. Certifications are granted to finished products by certifying bodies based on the full supply chain — they cannot be determined from an ingredient's INCI name alone. Any platform that tells you "this ingredient is COSMOS-approved" based solely on identity data is misleading you.

No trade names. Every ingredient is identified by its INCI name only. We do not surface supplier trade names, branded ingredient names, or proprietary blend names in any output. This ensures supplier neutrality — our recommendations are based on chemistry, not commercial relationships.

No EWG or SkinSafe data. We do not use, reference, or derive scores from EWG Skin Deep or SkinSafe databases. Our scoring is built entirely from primary regulatory and scientific sources. This is a deliberate architectural decision, not an oversight.

No "clean beauty" judgments. We provide hazard data and regulatory status. We do not label ingredients as "clean," "toxic," "natural," or "synthetic" — these are marketing categories, not scientific ones. Formulators make their own informed decisions based on the data we provide.

This methodology is not static. As regulatory authorities update classifications, as new clinical data enters the peer-reviewed literature, and as our automated pipelines expand coverage, scores are refined. We publish this methodology because transparency is a prerequisite for trust — and trust is what separates a tool formulators rely on from one they dismiss.

Last updated: April 2026