Is the base image in the FROM line of your Dockerfile safe to use? Pulling nginx, python, grafana, or any other image from Docker Hub with the latest tag doesn't mean it's patched, current, or secure. A standard nginx image returns 300+ CVEs. Most are vulnerabilities in OS utilities unrelated to serving HTTP traffic, but not all of them can be ignored.
Some vulnerabilities are enough to compromise your application even if your code and infrastructure are flawless. CVE-2025-27363, an out-of-bounds write in libfreetype6 that can lead to arbitrary code execution, was present in the latest official nginx images at the time the vulnerability became known, and patched base layers took time to reach Docker Hub. With an EPSS score of 0.65 and confirmed active exploitation in the CISA KEV catalog, this is not a theoretical risk. It arrived silently, with nothing in the nginx release notes and no changes to the image you were already running.
So, even well-maintained images lag. nginx:latest has carried OpenSSL vulnerabilities for months at a time. The upstream fix exists, but the Docker Hub layer takes time to catch up. Teams pulling latest and following best practices were still deploying the vulnerability throughout.
The interpretation problem
A scanner gives you a long list of vulnerabilities that is hard to interpret and act on:
- Which image is genuinely dangerous, and which is safe to use?
- Which vulnerabilities deserve attention first?
- Of hundreds of CVEs, which actually matter for your deployment?
- How do you justify to an auditor or CTO why you're ignoring 298 out of 300 CVEs?
Solution: a systematic approach
This article covers the first step: reducing scanner noise through automated prioritization based on severity, exploitation probability, and image context, turning a raw CVE list into a manageable set that actually requires engineering attention.
Automated classification based on image metadata sits between two extremes: raw severity filtering, which ignores whether a vulnerable package has any role in the container, and reachability analysis, which knows exactly what code executes and what it touches. The metadata approach gets you most of the way there, and how far depends on how much context you give the model about the image.
The approach prioritizes vulnerabilities based on:
- Severity, CVSS score, exploitation probability (EPSS), and confirmed active exploitation (CISA KEV)
- Relevance to the specific image's function
In follow-up articles, we'll enrich the results with runtime context (which libraries are loaded, which functions are called) and show whether vulnerabilities can be triggered given your specific configuration.
After completing the full workflow, you'll be able to say with confidence:
- ✓Which CVEs are actually dangerous for your configuration
- ✓Whether the image needs an immediate update or can wait
- ✓Whether to switch to an alternative base image
- ✓Which risks are acceptable and which require immediate action
- ✓How to justify those decisions to a security auditor or management
The full workflow can be automated and integrated into a CI/CD pipeline. This series covers how to do that.
The Algorithm
Run a scanner against the image to get a complete list of CVEs, including transitive dependencies you might not know are in the image.
For each CVE, add two signals beyond CVSS:
- EPSS: probability of exploitation in the wild within the next 30 days, updated daily by FIRST.org
- CISA KEV: catalog of CVEs with confirmed active exploitation maintained by the US government
CVSS measures theoretical severity. EPSS and KEV tell you whether exploitation is actually happening.
Assign each CVE to a bucket based on how the vulnerable package relates to what the container does:
Categorization is done by an LLM using the CVE description, package type, and image purpose. Runtime tracing in later steps validates these decisions against the actual container.
Setup
We use Trivy as the scanner, the most widely adopted open-source container scanner, with over 34.2k stars on GitHub. For the experiment we picked three images with different base distributions and workload types:
| Image | Base | Type |
|---|---|---|
node:20-alpine | Alpine | Runtime |
nginx:1.25 | Debian | Web server |
grafana/grafana:10.0.0 | Ubuntu | Application |
These are pinned versions used during testing. The algorithm applies to any image. Pinning versions ensures the results in this article are reproducible.
Categorization in Practice
Each CVE gets one of two outcomes: PROCESS or IGNORE. The decision tree:
KEV listed or
EPSS > 0.1→PROCESSSeverity: LOW→IGNOREMEDIUM / HIGH / CRITICAL→CLASSIFYHow prompt formulation affects results
We ran the same dataset through three prompt versions. The difference between 28% and 45% noise reduction comes entirely from how the rules are formulated:
| Prompt version | node (15) | nginx (363) | grafana (293) | total (671) |
|---|---|---|---|---|
| v1no image context | 33% (5) | 56% (205) | 23% (68) | 41% (278) |
| v2image context only | 20% (3) | 46% (167) | 6% (18) | 28% (188) |
| v3context + heuristics | 27% (4) | 70% (253) | 15% (44) | 45% (301) |
The model's accuracy is directly tied to the context in the prompt: image purpose, entrypoint, package type. Without that context (as the v1 results show), the model has little to anchor a decision on and defaults to UNKNOWN. More runtime context (loaded libraries, actual call graph) would narrow this further; that's what Stages 2 and 3 cover.
Applying the Algorithm
01Scan
| Image | Critical | High | Medium | Low | Unknown | Total |
|---|---|---|---|---|---|---|
node:20-alpine | 0 | 12 | 1 | 2 | 0 | 15 |
nginx:1.25 | 16 | 61 | 125 | 156 | 5 | 363 |
grafana/grafana:10.0.0 | 15 | 70 | 190 | 18 | 0 | 293 |
02Enrich
Across the three images, 3 entries match KEV (2 unique CVEs):
| CVE | Severity | Package | Image |
|---|---|---|---|
| CVE-2025-27363 | HIGH | libfreetype6 | nginx |
| CVE-2023-44487 | HIGH | nghttp2-libs | grafana |
| CVE-2023-44487 | MEDIUM | golang.org/x/net | grafana |
CVE-2023-44487 is the HTTP/2 Rapid Reset Attack, one of the most widely exploited vulnerabilities of 2023. It appears twice in grafana because two separate packages are affected.
03Categorize
LOW severity CVEs without a KEV flag are ignored by default (176 CVEs). The remaining 319 CVEs went through AI classification: 125 were assigned Not Applicable and ignored. Three examples:
| Image | CVE | Sev | EPSS | KEV | Category | Decision |
|---|---|---|---|---|---|---|
| nginx | CVE-2025-27363 | HIGH | 0.65 | ✓ | KEV | PROCESS |
| node | CVE-2024-21538 | HIGH | 0.00 | — | Directly Exposed | PROCESS |
| nginx | CVE-2024-2398 | HIGH | 0.02 | — | Not Applicable | IGNORE |
Results
| Image | Before | Process | Ignore | Noise reduction |
|---|---|---|---|---|
node:20-alpine | 15 | 11 | 4 | 27% |
nginx:1.25 | 363 | 110 | 253 | 70% |
grafana/grafana:10.0.0 | 293 | 249 | 44 | 15% |
Where static context runs out
CVE-2026-23950 is a race condition in [email protected] (npm). The model flagged it Not Applicable: “the library is not loaded at runtime.” For a production container running node app.js that reasoning holds: npm never runs, tar never loads. But the model had a stronger argument available in the CVE description and ignored it: the vulnerability only triggers on case-insensitive filesystems like macOS APFS; Alpine uses ext4. Without that reasoning, the classification breaks for any container where npm install runs at runtime. This points to prompt design: explicit instructions to reason about OS and filesystem context would produce a more reliable result.
Conclusion
Across three images, 671 CVEs were processed: 301 filtered without manual review, 370 requiring attention. Noise reduction ranges from 15% to 70% depending on the image type and how conservatively the prompt is tuned.
The approach works best on images with a known, fixed purpose. For generic base images without application context, Not Applicable is harder to assign and more CVEs default to PROCESS.
The more context the prompt has (image purpose, entrypoint, package type), the more accurate the classifications and the fewer CVEs end up in Unknown. The three prompt versions above show the range: 28% to 45% noise reduction from the same dataset, purely from rule differences. This approach is better than raw severity filtering and worse than reachability analysis: it reasons from metadata, not from what the container actually loads. The tar example above shows where that matters. Treat the results as a first pass: useful for cutting the queue, but the prompt needs tuning for your specific infrastructure, and the calls it makes on ambiguous cases need verification. Stages 2 and 3 provide that.