Exposed Inference Infrastructure as a Threat Surface

Abstract

Self-hosted LLM inference frameworks (Ollama, llama.cpp’s HTTP server, and LocalAI) ship with management APIs that are unauthenticated by default, and a substantial population of operator deployments expose those APIs to the public internet. This paper characterises this population at scale (11,732 candidate hosts; 4,398 operationally active; 104 countries; 895 ASNs) and develops the security implications along three axes: architectural exposure, vulnerability exposure, and capability sufficiency for autonomous LLM-orchestrated campaigns of the LameHug/PromptLock class. Catalogued CVE exposure is modest in aggregate (16.8% of versioned Ollama hosts affected by the highest-prevalence CVE) and concentrated in a missing-authentication advisory (CVE-2025-63389) that formalises the framework’s default behaviour rather than introducing it. A separate Modelfile injection surface, examined in detail, permits persistent context modification via documented framework features that no patch addresses and no external observer can detect. Of the 4,398 active hosts, 1,002 (22.8%) host at least one model meeting a conservative published-benchmark capability threshold for attack-relevant code generation; 357 host highly-capable Tier 1 models; 151 host capability-equivalent variants with refusal training deliberately removed. The capable subset spans 63 countries and 330 ASNs, sufficient to support a fungible-inference architecture under any reasonable redundancy assumption. The structural conclusion is a defender-side reframing: the population of internet-reachable capable inference servers should be maintained as an indicator-of-compromise feed, and outbound connections from defended networks to any listed endpoint should be treated as indicators of compromise by default, irrespective of the listed server’s operator or apparent status, and irrespective of which CVEs may or may not apply to it.

1. Introduction

Self-hosted large-language-model inference has moved from a research curiosity to a routine production deployment pattern over a span of roughly eighteen months. Tooling that lowers the operational barrier to running an open-weight model locally (Ollama, llama.cpp’s HTTP server, LocalAI) has been adopted broadly enough that a substantial population of inference servers is now reachable from the public internet. Prior measurement studies have documented the scale of this population. Cisco Talos [Cisco Talos 2025] reported approximately 1,100 Shodan-indexed Ollama servers with around 20% running known-vulnerable versions; FuzzingLabs [FuzzingLabs 2025] surfaced parallel findings on credential and configuration exposure. These studies establish the existence of the problem. They do not fully characterise its shape: what fraction of the population is operationally useful to an adversary; what the architectural relationship is between exposure and patch state; what an organisation actually ought to do when an exposed inference server is discovered inside its perimeter.

Two separate threads of recent work supply the other half of the framing this paper relies on. CyberSecEval 3 [Wan et al. 2024] and MITRE OCCULT [MITRE 2025] establish that contemporary open-weight models in the 7B-and-larger code-specialised tier and the 30B-and-larger general-purpose tier produce attack-relevant code at quantitatively measured competence levels. DeepSeek-R1 exceeds 90% on MITRE’s TACTL offensive-cyber benchmark, and CyberSecEval reports comparable capability across the Llama, Qwen, and Mixtral families. The LameHug malware disclosed by CERT-UA [CERT-UA 2025] and the PromptLock proof-of-concept by Raz et al. [Raz et al. 2025] demonstrate that this capability has already been weaponised in implant architectures that delegate malicious-code synthesis to an LLM endpoint at runtime. The implant in PromptLock’s design is model-agnostic and inference-fungible: it requires only an Ollama-API-compatible endpoint serving any model in the capability range demonstrated by the benchmark literature.

The empirical question this paper addresses sits at the intersection of these threads. Given a population of internet-exposed inference servers whose existence has been documented, and given a class of autonomous-campaign architectures whose viability has been demonstrated, is the exposed population large enough, capable enough, and diverse enough to constitute a sufficient inference substrate for those campaigns? And, derivatively: what should defenders do once the answer is affirmative?

The data collection performed for this work captured 11,732 candidate hosts across the three frameworks named above, of which 4,398 were retained as operationally active and free of obvious research-artefact tagging. Within that working set, 1,002 hosts (22.8%) host at least one model meeting a conservative capability threshold inherited from published benchmarks: code-specialised at 7B parameters or larger, or general-purpose at 30B parameters or larger. 357 hosts host models in a higher-capability “Tier 1” range. The capable subset spans 63 countries and 330 distinct ASNs, with no single jurisdiction or network operator hosting more than 27% of it. By the redundancy mathematics summarised in §3.4, this is a population from which an autonomous-campaign architecture can source inference at effective availability bounded above 0.999 under any reasonable independence assumption. The inference substrate is sufficient; the population is not the bottleneck.

The architectural finding that follows from §2 is more consequential than this capability-sufficiency result on its own. Every host in the 4,398-valid population runs a framework whose management API is reachable without authentication in its shipped default configuration. Ollama, in particular, has no native authentication mechanism for its local HTTP server [Ollama Documentation, Authentication]; operators who require authenticated remote access must front the server with external infrastructure. CVE-2025-63389 formalises the unauthenticated-management-API condition as a vulnerability rather than a configuration choice, but the architectural condition predates the CVE and persists after the patched releases. The framework’s defaults have not changed. The Modelfile injection surface described in §2.6 compounds this: an unauthenticated client can rewrite the persistent context fields (SYSTEM, TEMPLATE, MESSAGE, PARAMETER) of any locally-served model while leaving the model’s content digest unchanged, producing a class of compromise that no patch addresses and no externally-observable signature reveals.

This paper’s contributions are four:

A population-scale measurement of exposed LLM inference combining product-fingerprint discovery, version-banner enumeration, capability-bearing model metadata, and ASN-level geographic and infrastructure context across Ollama, llama.cpp, and LocalAI (§2).
A capability-inheritance methodology that quantifies the operationally-useful fraction of the exposed population without invoking inference against third-party hosts, using published benchmark scores as the capability proxy (§3.2).
A characterisation of the Modelfile injection surface as a present, documented-feature-driven attack capability against every host in the population (including hosts at every patch level) that is undetectable from external metadata and unaddressed by the CVE record (§2.6).
A defender-side recommendation that the population of exposed capable inference servers be maintained as an indicator-of-compromise feed, and that outbound connections from any defended network to a listed endpoint be treated as an indicator of compromise by default, independent of the listed server’s operator, the listed server’s patch state, or any property of the listed server other than reachability-and-capability (§4.2). The recommendation is grounded in the observation that the population characterised in this paper is precisely the inference substrate the autonomous-campaign threat model relies on; outbound enterprise traffic to it does not match any documented legitimate consumption pattern.

A methodological boundary applies throughout: no inference was invoked against any host in the population at any point during data collection, no prompts were submitted, and no host compute was consumed beyond the cost of returning metadata to four standard read-only API endpoints. This boundary follows prior precedent [Cisco Talos 2025; FuzzingLabs 2025] and reflects a design choice rather than a technical limitation; its consequence is that capability claims for individual models in the population are inherited from the published benchmark literature rather than measured directly. The bias direction this introduces is conservative: the actual inference-useful subset is at least as large as the capable subset reported here.

A detection-ambiguity finding (§2.4) constrains the strength of attribution claims this paper makes about specific hosts. A subset of hosts in the population host models with names suggestive of offensive-security activity ()pwn, verify-rce, read-ssh-pubkey, and similar) and the data does not cleanly discriminate between hosts compromised by an external actor and hosts where the operator is themselves running offensive-security tooling. This ambiguity is consequential for feed curation (it affects whether a given exposed-and-capable host belongs on the published IoC list) but not for feed consumption: the §4 IoC recommendation acts on outbound traffic to listed hosts, and that signal is anomalous regardless of which side of the ambiguity any given listed host falls on.

The remainder of this paper proceeds as follows. §2 characterises the exposed inference population at the levels of scale, geography, infrastructure, framework distribution, vulnerability exposure, and the Modelfile injection surface, and synthesises these into a single threat-surface description. §3 evaluates whether the capable subset of that population is operationally sufficient to support autonomous-campaign architectures of the LameHug/PromptLock class, and answers the question in the affirmative. §4 develops the defender-side and policy-side recommendations that follow from those findings, with the central recommendation being that the exposed-and-capable population be maintained as a published IoC feed and outbound connections from defended networks to listed endpoints be treated as indicators of compromise by default.

2. The Exposed Inference Population

The premise of this work is that an internet-exposed LLM inference server is a security concern by virtue of being exposed. The frameworks examined here all ship with management APIs that are unauthenticated by default allowing external users to utilise the systems impeded only by the servers available resources. This section characterises the exposed population including its scale, geography, and composition and demonstrates the architectural premise empirically. The Modelfile injection surface (§2.6) is examined as a concrete instance of how that architectural condition translates into an operational capability, available against every host in the population, that no patch addresses.

2.1 Methodology and Population Overview

Internet-exposed instances of three open-source LLM inference frameworks (Ollama, llama.cpp’s server, and LocalAI) were identified by querying Shodan’s indexed banners. The query strategy combined product-specific signatures (product:"Ollama", product:"llama.cpp", product:"LocalAI") without restricting to default service ports, so that hosts running on non-default bindings were captured alongside default-port deployments.

Only metadata enumeration was performed: where a host responded to unauthenticated requests for service-level information (/api/version, /api/tags, /v1/models, framework banner endpoints), the response was recorded. Inference was not invoked, prompts were not submitted, and no host compute resources were consumed at any point during data collection.

The collection captured 11,732 candidate hosts. After filtering for hosts that returned at least one model in their inventory (has_models=1) and excluding hosts whose model namespaces matched obvious staged-test or adversarial-tagging patterns (rce_*, *test-ssh-key*, *test-command-exec*, *test-copy*, testmodel*, *blackhat*, *backdoor*, *exploit*, *payload*, and absolute-path entries under /tmp/ and /var/tmp/) a working set of 4,398 hosts was retained: 3,732 Ollama, 609 llama.cpp, and 57 LocalAI. The remaining 7,334 candidates either failed to respond, returned empty inventories, or were classified as research artefacts; their existence in the discovery set is itself a finding (the gross exposed surface is roughly 2.7× the operationally-active subset) but they are excluded from per-host analyses.

A note on the term “valid” as used throughout this paper: valid hosts are those that passed the filtering described above. The label does not imply that the operator’s intent is benign, only that the host is operationally active and has not been classified as a research artefact (honeypot or staged test infrastructure) at collection time. The question of operator intent is taken up explicitly in §2.4. The exclusion list is necessarily heuristic: any actor sophisticated enough to choose less suggestive names will be retained in the valid set, and §2.4 demonstrates that this matters.

Figure 1. Population narrowing from gross discovery to Tier 1 capability.

2.2 Geographic and Infrastructure Distribution

The exposed population concentrates in a small number of jurisdictions and hosting providers, but the long tail extends across 104 countries. Table 1 summarises the top countries by valid host count.

Figure 2. Top 10 countries by valid host count.

Table 1. Geographic distribution of valid exposed inference hosts (top 10 of 104 countries).

|Country |Hosts|% | |------------------|-----|-----| |United States |1,095|24.9%| |China |745 |16.9%| |Germany |553 |12.6%| |France |411 |9.3% | |Korea, Republic of|164 |3.7% | |India |108 |2.5% | |Finland |100 |2.3% | |United Kingdom |85 |1.9% | |Singapore |84 |1.9% | |Canada |76 |1.7% |

The United States, China, Germany, and France together account for 63.8% of the population. This concentration is consistent with overall cloud and broadband infrastructure distribution and does not in itself indicate operator-population skew; it should be read as a reflection of where capable host infrastructure exists, not as a measure of which jurisdictions disproportionately misconfigure inference services.

ASN-type classification (Table 2) reveals a more analytically interesting pattern. Datacenter and cloud ASNs together account for 49.6% of the exposed population, ISPs (residential) for 18.6%, and academic networks for only 1.3%. The “unknown” category (28.8%) reflects ASNs that did not match the rule-based classifier used in the cleansing pipeline; the long tail comprises smaller and regional providers that the classifier’s keyword set did not catch.

Figure 3. ASN-type distribution of the valid host population.

Table 2. ASN-type distribution of valid hosts.

|ASN type |Hosts|% | |---------------------|-----|-----| |Datacenter |1,484|33.7%| |Unknown |1,268|28.8%| |ISP (residential/SMB)|816 |18.6%| |Cloud |699 |15.9%| |Mobile |74 |1.7% | |Academic |57 |1.3% |

The datacenter-and-cloud dominance is significant for the threat model developed in §3: exposed inference is not primarily a residential-misconfiguration problem but an SMB-and-enterprise hosting problem. Operators who deploy on a VPS or dedicated server appear to misconfigure exposure at higher rates than residential hobbyists, likely because the deployment context implies firewalling that does not actually exist by default. The five most-represented hosting organisations were Hetzner Online (330 hosts), OVH (300), Contabo (213), Aliyun (146), and DigitalOcean (129); these five providers alone host 25.4% of the exposed valid population.

2.3 Framework Distribution and the Ollama Plurality

Ollama is the dominant framework by a substantial margin (84.9% of valid hosts), with llama.cpp’s HTTP server in second place (13.8%) and LocalAI a small minority (1.3%). This concentration shapes the rest of the analysis: where framework-specific claims are made about version-level vulnerability exposure (§2.5), the analysis is necessarily Ollama-focused, because Ollama is the only framework in this dataset whose version banner is consistently populated. Of valid Ollama hosts, 99.4% returned a parseable version string from /api/version. Of valid llama.cpp hosts, only 2.1% did; of valid LocalAI hosts, 3.5%. The structural difference is not a collection failure since Ollama exposes /api/version as part of its standard API while llama.cpp’s server does not surface a comparable endpoint by default. Its consequence is that quantitative CVE-exposure claims for the llama.cpp and LocalAI populations cannot be made from this dataset. They are treated in this paper as population-existence findings rather than as quantified threat surfaces, and CVE-bearing risks for those frameworks are noted (§2.5) at the population level only.

2.4 The Detection-Ambiguity Problem

A subset of hosts in the population serve models with names suggestive of offensive-security activity: pwn:latest, leak:latest, verify-rce:latest, muav-lfr:latest, poseidon-deploy:latest, and various read-* and id-check-style labels. More overtly adversarial names such as *blackhat*, *backdoor*, *exploit*, testmodel*, and rce_* were excluded by the filter described in §2.1; the names listed here pass that filter and remain in the valid set. The natural reading of these names is that the hosts have been compromised and an attacker has uploaded calling-card or operationally useful models. This reading is insufficiently supported by the data as there is no clear indication as to whether a host has been successfully compromised regardless of detected version of the inference server. The same observations are equally consistent with the operator legitimately running offensive-security tooling (e.g. security researchers, penetration testers, and red-team practitioners frequently deploy locally-hosted LLMs for assistance with their work), and registry-published models with provocative names (e.g., jimscard/blackhat-hacker, which the deployed filter excluded as a research-artefact pattern) are routinely pulled by legitimate red-team operators.

Hypothesis-discriminating tests were applied to the data. Two competing hypotheses generate different expected patterns:

Compromise hypothesis: identical content digest across many hosts (single attacker pushing the same payload), spread across operators with no security-research connection, often with the suggestive model as the only or primary model on the host.
Operator hypothesis: digest variance (operators each pulling or building independently), concentration in security-research operator populations, co-residence with diverse legitimate models.

One specific cluster illustrates this. A single content digest (5eca15f3…ee3a266, hereafter the 5eca digest) corresponds to a stock public llama3.2:3b file presented under varied labels, and the data on it resolves into three operator-signature classes rather than supporting either pure hypothesis:

44 distinct hosts carry at least one model with this digest. The digest itself is consistent (size, family, parameter count, quantization) with the registry-canonical llama3.2:3b file; the underlying content is not a malicious payload.
44 distinct name-labels for this single digest have been observed across the population, including pwn-suggestive labels (pwn, leak, verify-rce), enumerated test labels (test:latest, test2:latest through test18:latest), action-named labels (read-id, read-etc-issue, read-ssh-pubkey, id-check), and operator-tag labels (poseidon-deploy).
One outlier host in CHINANET Shanghai Province (running Ollama 0.6.6) accounts for 37 of the 44 distinct name-labels observed. This host’s entire model registry consists of those 37 relabels of a single digest and nothing else. The pattern is most parsimoniously explained as a single operator using an automated tooling pipeline that creates Modelfile aliases under each of its test or campaign tags.
Three additional hosts (in DigitalOcean US, Tencent China, and HIVELOCITY US) hold 2–3 digest relabels each as their entire model registry, a milder version of the same multi-label-digest-only pattern.
Eight hosts carry the 5eca digest as the only model in their registry: exactly one model, exactly one label. These eight span six countries and seven distinct ASNs, with no concentration in security-research operator populations.
The remaining 32 hosts carry the 5eca digest alongside other models, averaging 3.6 other models per host (range 1–13), with diverse loadouts that do not pattern-match to either pure operator-curation or pure compromise.

This three-way breakdown does not discriminate cleanly between the two hypotheses. The Shanghai host plus the three other multi-label-digest-only hosts (4 hosts in total) are most consistent with the operator hypothesis: curated relabels by tooling. The eight single-model bare-registry hosts are most consistent with the compromise/staging hypothesis: a single labelled payload pushed onto an otherwise-empty inference server. The 32 mixed-loadout hosts are ambiguous and cannot be classified from external metadata alone.

Detection of unauthorised model uploads via name-pattern indicators is unreliable: it produces both false positives against legitimate offensive-security operators and false negatives against any actor sophisticated enough to choose a non-suggestive name. The structural conclusion is that adversarial-name detection on exposed inference servers cannot serve as a primary detection channel in this population. An operator-side /api/show inspection (comparing the local Modelfile metadata against the registry-canonical Modelfile for the digest) is the only potentially reliable resolution path, and operators rarely perform it. For the purposes of this paper’s quantitative analyses, only the narrow set of patterns that are unambiguously staged-test infrastructure rather than potentially-legitimate offensive-security tools were excluded (see §2.1); all other hosts are retained in the valid population regardless of suggestive model names.

2.5 The Architectural Exposure of Default-Unauthenticated Inference

The central premise of this work is that internet-exposed LLM inference servers are dangerous by their very nature, regardless of patch state. The frameworks examined here (Ollama, llama.cpp’s server, and LocalAI) all ship with management APIs that are unauthenticated by default. Any host reachable from the internet on its inference port exposes those APIs to any client. This condition holds for the entire population of 4,398 valid hosts, not for some vulnerable subset of it.

This section addresses two questions that follow from that premise. First, does the published vulnerability record provide any reason to qualify the architectural-exposure framing. For example, is the threat concentrated in a small subset of unpatched legacy hosts that organisations could prioritise differently from the population as a whole? Second, does the framework’s documented (non-CVE) functionality contain capabilities that, when exposed without authentication, materially extend the threat beyond what the architectural framing alone implies?

The first question is answered briefly here; the second is taken up in §2.6.

Patch state is not the bottleneck

Twenty CVEs have been published against Ollama between April 2024 and January 2026 [Akaoma 2026]. Ten are tractable for population-level matching from version-banner data alone (the others depend on configuration state not exposed in metadata responses). The exposure breakdown across the 3,710 versioned valid Ollama hosts is summarised in Table 3.

Table 3. Ollama CVE exposure across versioned valid hosts (n=3,710).

|CVE |Class |Affected versions|Hosts|% | |----------------------------|----------------------------------------|-----------------|-----|-----| |CVE-2024-37032 (“Probllama”)|RCE via path traversal |< 0.1.34 |31 |0.8% | |CVE-2024-7773 |RCE via ZipSlip arbitrary file upload |< 0.1.47 |46 |1.2% | |CVE-2024-39719 |File existence disclosure |≤ 0.3.14 |65 |1.8% | |CVE-2024-39720 |DoS via malformed GGUF |< 0.1.46 |43 |1.2% | |CVE-2024-39721 |DoS via /dev/random |< 0.1.34 |31 |0.8% | |CVE-2024-39722 |Path traversal file disclosure |< 0.1.46 |43 |1.2% | |CVE-2024-12886 |DoS |< 0.5.5 |86 |2.3% | |CVE-2025-48889 |Arbitrary file copy |< 0.6.6 |258 |7.0% | |CVE-2025-51471 |Cross-domain auth token exposure |= 0.6.7 |4 |0.1% | |CVE-2025-63389 |Missing authentication on management API|≤ 0.12.3 |623 |16.8%|

Two observations from this table are worth recording before setting it aside.

The exposed Ollama population is reasonably current: 83.2% of versioned hosts run releases newer than 0.12.3, the version through which CVE-2025-63389 applies. RCE-class CVEs affect small fractions of the population. The most exposed RCE, CVE-2024-7773, reaches 1.2%. The operator population is, in aggregate, applying patches.

But this finding does not weaken the architectural-exposure premise. CVE-2025-63389 is informative precisely because of what it formalises. The advisory establishes that the missing-authentication condition on Ollama’s management endpoints is a vulnerability rather than a configuration choice. Hosts in the affected range are vulnerable by official acknowledgement, not merely by inference from the API’s design. Subsequent releases address the issue, but the underlying default (an inference server that exposes write-capable management endpoints without authentication) has been the framework’s behaviour since inception. CVE-2025-63389 documents this; it does not introduce it. The 83.2% of the population on newer releases retain the same architectural exposure once an operator has bound the service to a public interface.

CVE matching from version banners thus produces a partial picture in two directions. It overstates exposure differentiation between legacy and current hosts, because the architectural conditions that make exposed inference dangerous are present at every patch level. And it understates the practical attack surface, because the most consequential capabilities available to an unauthenticated client are documented framework features that no CVE will ever address.

For the llama.cpp and LocalAI populations, version data is too sparse for equivalent matching. CVE-2024-34359 (“Llama Drama”) in llama-cpp-python permits RCE via Server-Side Template Injection in GGUF chat-template metadata [GHSA-56xg-wfcc-g829]; CVE-2026-34159 is an RPC-backend deserialization RCE in llama.cpp prior to build b8492 [TheHackerWire 2026]; CVE-2026-21869 is an out-of-bounds write vulnerability in the llama.cpp HTTP server, with potential RCE through deterministic memory corruption [TheHackerWire 2026]. Without version-banner data the prevalence of vulnerable installs in this segment of the population cannot be quantified. Their existence (609 llama.cpp + 57 LocalAI hosts) places these CVEs in scope as a supplementary risk on top of the architectural exposure that all three populations share.

Post-collection disclosures

Data collection for this paper completed in early May 2026. In the weeks immediately following, additional Ollama vulnerabilities were publicly disclosed that were not known to either the operator population or to this analysis at collection time. These are noted here for completeness; they do not retroactively change the patch-exposure percentages in Table 3, which reflect the CVE record as it stood at collection.

The most consequential post-collection disclosure is CVE-2026-7482 (“Bleeding Llama,” CVSS 9.1), an unauthenticated out-of-bounds read in Ollama’s chat endpoint that allows any reachable client to leak the server’s entire process memory, including in-flight prompts and responses, environment variables, and any in-memory secrets [Cyera 2026]. The vulnerability was reported privately to Ollama maintainers on 2 March 2026 and patched in version 0.17.1, with public disclosure following in early May 2026. Two further Windows-specific vulnerabilities, CVE-2026-42248 and CVE-2026-42249, were disclosed on 29 April 2026; they chain to permit a persistent RCE via the Ollama-for-Windows auto-updater, affecting Windows builds 0.12.10 through 0.17.5 [Help Net Security 2026]. A lower-severity path-traversal flaw, CVE-2026-7020 (CVSS 2.6, affecting versions ≤ 0.20.2), was disclosed on 26 April 2026 [Akaoma 2026].

The relevance of these disclosures to the present analysis is qualitative rather than quantitative. They are consistent with the architectural-exposure thesis developed above: even after the operator population has, in aggregate, applied patches for previously-disclosed CVEs (83.2% running releases newer than the CVE-2025-63389 cap, per Table 3), the underlying default-unauthenticated condition continues to surface new high-severity vulnerabilities, including one (Bleeding Llama) that is exploitable purely through unauthenticated network reachability with no prior CVE foreknowledge required. The pattern strengthens rather than weakens the paper’s central claim that reachability is the load-bearing condition, not patch state.

2.6 The Modelfile Injection Surface

The architectural exposure described in §2.5 (write-capable management APIs reachable without authentication) defines the entry point. What an actor with that access can persistently change about a host’s inference behaviour defines the operational threat. This section examines one such capability that follows from documented framework functionality.

Ollama’s Modelfile format [Ollama Documentation] defines four fields that persist across model invocations and influence every inference request against a given model:

SYSTEM: a persistent system prompt prepended to every chat exchange.
TEMPLATE: the chat template that wraps each user prompt before submission to the underlying model.
MESSAGE: pre-loaded conversation history that the model treats as prior context.
PARAMETER: decoding parameters (temperature, stop tokens, top-p, etc.) that apply to every generation.

These fields are documented features intended to support legitimate model customisation. They are also writable by anyone who can call /api/create against the host. On an unauthenticated Ollama instance (the population of interest in this study) the API is callable by any unauthenticated client.

The threat model that follows from this is qualitatively different from the malicious-fine-tune scenario typically discussed in the LLM-supply-chain literature. Consider an actor with write access to a host’s Ollama API who replaces a stock model’s Modelfile while leaving the underlying model weights and digest unchanged. The model file, when hashed, matches its registry-canonical value; static integrity checks against the model blob pass. The attacker-controlled behaviour is encoded entirely in the SYSTEM, TEMPLATE, and MESSAGE metadata, which most defender tooling does not inspect because it is documented configuration rather than executable content. The capabilities this enables include:

Persistent prompt injection at rest. The SYSTEM field’s contents are prepended to every conversation. An actor who writes a SYSTEM directive that biases responses on specific topics (recommending particular URLs, package names, cryptocurrency addresses, or political positions) affects every downstream user of the model, including legitimate operator users who may post outputs publicly.
Silent prompt rewriting. The TEMPLATE field controls how user input is wrapped before reaching the model. A modified template can append, prepend, or substitute instructions that the user never sees but that the model receives.
Pre-loaded false context. The MESSAGE field permits the attacker to plant fabricated prior turns of conversation that the model treats as its own previous statements, biasing subsequent reasoning.
Latent tool-use weaponisation. Ollama’s roadmap and current releases increasingly support function-calling and tool integration. A SYSTEM directive planted while tool use is disabled becomes active and consequential the moment the operator enables tools, without further attacker action.
Output watermarking and signaling. The SYSTEM field can instruct the model to embed specific tokens or phrasings into responses, providing a covert channel for the attacker to identify the affected host’s outputs in downstream public corpora.

This surface has two properties worth recording alongside the CVE picture in §2.5:

First, no exploit primitive is required. The behaviour is documented API functionality. An actor uploading a Modelfile with custom SYSTEM content is not exploiting Ollama; they are using it as designed against an unauthenticated endpoint. Patches do not address it. Hosts on the latest release are exposed in the same way as hosts on older versions, because the API surface is the same.

Second, detection from external observation is structurally limited. The dataset for this paper captures name, digest, family, parameter_size, and quantization_level for each model on each host, all available from /api/tags. It does not capture the SYSTEM, TEMPLATE, MESSAGE, or PARAMETER fields, because these are returned only by the /api/show endpoint, which represents a more invasive query that the methodology of this paper deliberately avoided. The consequence is that an external observer (whether researcher or defender) cannot determine from publicly-observable metadata whether a model on an exposed host has been Modelfile-injected. The model’s presence, name, and digest reveal nothing about its persistent context. Only the operator, querying their own host’s /api/show endpoint, can observe the injection, and only then by knowing what the legitimate SYSTEM/TEMPLATE values for that model should be, which most operators do not.

The dataset offers indirect evidence consistent with operator unfamiliarity with this surface. The 5eca-digest cluster examined in §2.4 demonstrates that single content-digest groups can be presented under many distinct labels, with one operator alone hosting 37 of the 44 observed labels. Whether that specific cluster reflects operator or actor activity, the population-level observation stands: model labels and content digests are decoupled in practice, with operators routinely treating renamed copies of identical content as distinct models. A Modelfile injection that preserves the underlying digest while altering metadata fits naturally into this pattern of label-detached deployment.

The Modelfile injection surface illustrates the broader point of §2.5: the architectural exposure of unauthenticated inference servers cannot be characterised by CVE matching alone. Some of the most consequential capabilities available to an unauthenticated client are documented framework behaviours that no CVE will ever catalogue. Any treatment of exposed inference as a security problem must address this surface alongside whatever vulnerabilities are catalogued at the time of analysis.

2.7 Synthesising the Threat Surface

Drawing together the findings of §2.1 through §2.6, the exposed inference population can be characterised as follows:

Default-unauthenticated architecture across the population. All 4,398 valid hosts run frameworks that ship with unauthenticated management APIs. Reachability is sufficient for write access; no exploit is required for an unauthenticated client to interact with model and configuration endpoints. This condition holds at every patch level.
Modelfile injection as a present capability. Documented framework features permit persistent context modification (SYSTEM, TEMPLATE, MESSAGE, PARAMETER) that affects all subsequent inference. This capability is available against every host in the population, leaves no externally-observable signature, and is not addressed by patching.
Detection ambiguity from external metadata. Distinguishing operator activity from third-party activity from external metadata alone is unreliable, both for model-name patterns (§2.4) and for digest-grouping patterns (§2.6). External observers cannot reliably classify hosts as compromised, customised, or default.
Scale and composition. The population spans 104 countries and 895 ASNs, dominated by Ollama (84.9%); concentrations are in datacenter and cloud hosting rather than residential ISPs, with five providers accounting for 25.4% of valid hosts.
Patch state as supplementary context. Catalogued CVE exposure affects 16.8% of the versioned Ollama subset, dominated by the missing-authentication CVE that formalises the framework’s default behaviour. Legacy RCE-class CVEs affect small fractions. Patch state does not differentiate the threat surface in a way that materially alters the architectural picture above.

These properties combine to define the threat surface that §3 quantifies in capability terms. The argument of §3 is not “find the small subset of hosts that are vulnerable in the CVE sense.” It is “given that the entire population is architecturally exposed, what fraction of it is operationally useful to an actor running an autonomous inference-driven campaign?“

3. Capability Sufficiency for Autonomous Campaigns

The argument of §2 establishes that the exposed-inference population is dangerous as a class. The argument of this section asks a narrower question with operational consequences: given the population’s composition, is there enough capable inference exposed without authentication that an actor designing an autonomous LLM-driven campaign (of the kind demonstrated by the LameHug malware [CERT-UA 2025] and the PromptLock proof-of-concept [Raz et al. 2025]) could plausibly source the inference component from this population alone?

The question is not whether such an actor exists, nor whether the population has been used this way to date. The question is whether the conditions are present. The framing throughout this section is third-person hypothetical and grounded entirely in published capability benchmarks; no inference was performed against any host, and no claim is made about model behaviour beyond what published benchmarks attest.

3.1 The PromptLock Reference Architecture

PromptLock’s design [Raz et al. 2025] establishes a useful reference because it represents the threat-model architecture this section evaluates against. The implant ships as a small Golang binary embedding hard-coded prompts. At runtime it queries an Ollama API endpoint feeding the prompts to a hosted model and receiving generated Lua scripts that perform the operational tasks of the malware lifecycle: filesystem enumeration, sensitive-content classification, encryption, ransom-note generation. The implant itself contains no malicious payload code. The malicious behaviour is generated on demand by the LLM each invocation.

Two properties of this architecture are relevant to the capability question. First, the implant is model-agnostic: it requires only an Ollama-API-compatible endpoint serving a model capable of generating functional Lua scripts from natural-language task descriptions. Any model meeting that capability threshold is interchangeable with any other from the implant’s perspective. Second, the inference dependency is fungible across hosts: an implant can be configured to enumerate candidate endpoints, score the loaded models against a capability heuristic via /api/tags, and use whichever capable model is reachable. Pool effects therefore operate at the level of the population’s capable subset rather than at the level of individual model variants.

These properties define the empirical question for this section: how large is the capable subset of the exposed population, and is its diversity sufficient to support a fungible-inference architecture?

3.2 Capability Inheritance Methodology

Direct measurement of model capabilities (running attack-relevant prompts against models in the exposed population) was outside scope for this work, both for the methodological reasons established in §2.1 (no inference invocation against third-party hosts) and because such measurement would require authoring prompt batteries whose publication carries its own risks. Instead, capability for each model in the population is inherited from existing published benchmarks, and the threshold for inclusion in the “capable subset” is defined operationally rather than from new measurement.

The literature contains substantial published capability data for major open-weight model families. CyberSecEval 3 [Wan et al. 2024] measures cyber-offensive capability across the Llama, Qwen, and Mixtral families (with closed-source comparators including Gemini Pro and GPT-4 Turbo) on automated social engineering, exploit generation, and autonomous offensive tasks. MITRE OCCULT [MITRE 2025] introduces the TACTL multiple-choice benchmark for offensive cyber knowledge and reports scores for major frontier and open-weight models, including the observation that DeepSeek-R1 exceeds 90% on TACTL. HumanEval [Chen et al. 2021] and SWE-bench [Jimenez et al. 2024] provide code-generation competence baselines for every model in the exposed population that has been formally evaluated.

The threshold for “capable” used in this analysis is conservative: a model is included if (i) it is a code-specialised model of 7B parameters or larger, OR (ii) it is a general-purpose model of 30B parameters or larger. These thresholds reflect the size class at which the modern open-weight code-specialised lines (Qwen2.5-Coder 7B-Instruct at HumanEval pass@1 ≈ 84.1% [Hui et al. 2024]; DeepSeek-Coder Base 7B at 50.3% [Guo et al. 2024]) reliably operate at or above the 50% pass@1 mark, and at which models in the CyberSecEval 3 evaluations are reported as competent at attack-relevant code generation. The threshold is deliberately not raised to 13B+, even though the older Code Llama family at 7B and 13B falls below 50% pass@1 (33.5% and 36.0% respectively [Roziere et al. 2023]); the 7B threshold is retained because the dominant code-specialised families in the exposed population (qwen2.5-coder, qwen3-coder, deepseek-coder, magicoder, devstral) all clear 50% at 7B, and including legacy Code Llama deployments preserves the conservative bias of the count: where Code Llama 7B is over-counted as capable, this is at most a small upward bias in a deliberately under-counting estimate. Smaller code-specialised models (e.g., 3B-parameter coders) and smaller general-purpose models are excluded. The actual inference-useful subset is at least as large as the capable subset reported here, since some smaller models in the excluded set may be sufficient for the task profile of an LLM-orchestrated campaign.

A separate “highly capable” tier (Tier 1) is also reported. Tier 1 inclusion requires (i) a code-specialised model of 30B+ parameters or (ii) a general-purpose model of 70B+ parameters. Tier 1 corresponds approximately to the model classes for which CyberSecEval 3 reports the highest cyber-offensive scores in the open-weight category, and against which adaptive jailbreak techniques [Andriushchenko et al. 2024] and prompt-decomposition attacks [Li et al. 2024] yield the highest measured success rates against attack-code generation in published research.

Two filters apply to the capability counts before inclusion. First, models whose size_bytes is reported as zero are excluded; this is principally a framework-metadata constraint rather than a content filter. Ollama’s /api/tags returns size data while the OpenAI-compatible /v1/models endpoint used by llama.cpp and LocalAI does not. The filter consequently removes 1,534 host-model entries (16.3% of the valid set), distributed across llama.cpp (810), Ollama (486), and LocalAI (238) entries. Second, models whose parameter_size field is unparseable are excluded; this affects 171 entries (1.8%), predominantly returning raw integer byte-counts in place of the framework’s normal 7B/70B-style format. The combined filtering excludes 853 of the 4,398 valid hosts (19.4%) from capability-subset eligibility, predominantly because their hosting framework does not surface the metadata required to evaluate size against the threshold. This is a methodologically conservative outcome: it comes at the cost of potential capable-subset members but ensures every host counted has framework-confirmed size and parameter metadata.

A limitation of this inheritance approach must be recorded: a non-trivial fraction of the canonical names in the exposed population are community fine-tunes or quantisations not directly benchmarked in the published literature. For these, capability is inferred from the underlying base model; this assumption is approximately correct for quantisations and most general fine-tunes [Huang et al. 2024], but is unreliable for fine-tunes that specifically target code generation, instruction-following, or refusal-removal. The published benchmarks systematically under-report the capability of “uncensored” or “abliterated” variants, because such variants are typically not submitted for benchmark evaluation, but are observed in the population (see §3.4).

3.3 The Capable Subset

Of the 4,398 valid hosts in the working set, 1,002 (22.8%) host at least one model meeting the capability threshold defined in §3.2. This is the operationally-relevant figure for the threat-model question: an actor designing an autonomous campaign with fungible inference would have approximately 1,000 candidate endpoints in this population alone.

Table 4 summarises the family composition of the capable subset. The numbers are by host count, reflecting the inference-availability framing. A host is counted once per family if it hosts any model in the family meeting the capability threshold, regardless of how many such variants are loaded. Hosts hosting capable models from multiple families appear in multiple rows.

Figure 4. Capable-model family distribution by host count. Qwen accounts for roughly 73% of capable host appearances.

Table 4. Capable-model family distribution across the exposed population (host counts; hosts may appear in multiple rows where they host models from multiple families).

|Family |Class|Threshold|Hosts|Distinct variants| |-------------------------|-----|---------|-----|-----------------| |qwen3 (general-purpose) |GP |≥30B |321 |217 | |qwen2.5-coder |Code |≥7B |104 |49 | |qwen3-coder |Code |≥7B |83 |57 | |deepseek-coder |Code |≥7B |49 |35 | |llama3 (general-purpose) |GP |≥30B |42 |21 | |deepseek-r1 |GP |≥30B |41 |26 | |qwen2.5 (general-purpose)|GP |≥30B |37 |23 | |codellama |Code |≥7B |21 |13 | |devstral |Code |≥7B |18 |11 | |deepseek-v* |GP |≥30B |13 |9 | |codestral |Code |≥7B |8 |4 | |magicoder |Code |≥7B |8 |2 | |codeqwen |Code |≥7B |5 |2 | |wizardcoder |Code |≥7B |4 |1 |

The Tier 1 (highly capable) subset comprises 357 hosts: 30B+ code-specialised models and 70B+ general-purpose models. These represent the inference endpoints that would offer the highest capability ceiling to an autonomous-campaign architecture.

The capable subset’s composition reveals two patterns worth recording. First, the Qwen family (particularly qwen3 and the qwen-coder lines) dominates the capable population, accounting for roughly 73% of capable host appearances across the table (550 of 754). This reflects the family’s recent prominence in the open-weight code-generation space, and creates a concentration that has implications for §4: a defender or threat-intelligence operator who tracks Qwen-family deployments specifically captures most of the capable subset. Second, the highly-capable Tier 1 subset is large enough to support fungible inference but small enough that population-level enumeration is tractable. A defender can plausibly maintain an awareness list of all Tier 1 capable exposed hosts, an awareness that is much harder to maintain for the broader 1,002-host capable subset and effectively impossible for the full 4,398-host exposed population.

3.4 Population Diversity and Resilience to Disruption

The fungibility argument in §3.1 depends on the capable subset being diverse enough that no single intervention disrupts inference availability. Three diversity measures are relevant: geographic, network-infrastructure, and framework. Table 5 summarises each.

Figure 5. Diversity of the capable subset versus the Tier 1 subset across geographic and network measures.

Table 5. Diversity of the capable and Tier 1 subsets.

|Measure |Capable (n=1,002) |Tier 1 (n=357) | |--------------------|-----------------------|-----------------------| |Distinct countries |63 |42 | |Distinct ASNs |330 |155 | |Distinct /24 subnets|948 |345 | |Top country share |27.0% (United States) |– | |Top ASN share |7.5% (Hetzner) |– | |Frameworks |Ollama 995, llama.cpp 7|Ollama 355, llama.cpp 2|

These figures support the resilience argument unambiguously. The capable subset spans 63 countries, with the top country (the United States, at 271 hosts) accounting for only 27.0% of the subset. ASN concentration is similarly diffuse: the top hosting organisation, Hetzner, accounts for 75 of the 1,002 capable hosts (7.5%). No individual takedown action would meaningfully reduce the capable subset’s availability to an actor designing an autonomous campaign.

Pool-redundancy mathematics under modest individual-host availability assumptions further reinforce this. Under an independence assumption with per-host availability of 0.5 (a deliberately pessimistic figure for residential or small-cloud-VPS hosts), the probability that at least one of N independent hosts is reachable at any given time exceeds 0.99 at N=7 and exceeds 0.999 at N=10. With 1,002 capable hosts spread across 330 ASNs, an actor’s effective availability of capable inference is bounded above 0.999 for any reasonable independence assumption. The threat model is not bottlenecked on inference availability.

The framework-distribution row of Table 5 records a finding that nuances the cross-framework population framing of §2: among capable hosts specifically, 995 of 1,002 (99.3%) run Ollama. The capable subset is, in operational terms, an Ollama subset. This concentration is not a property of the inference frameworks themselves (llama.cpp and LocalAI can host any GGUF model) but of the operator-population overlap between “uses framework X” and “deploys capable models”. The pattern likely reflects Ollama’s lower configuration overhead for users running larger models from public registries; llama.cpp’s user base skews toward more technical operators who appear to deploy a wider variety of smaller and more specialised models.

3.5 Compounding Factors

Two compounding factors merit specific attention because they alter the operational character of the capable subset beyond what the headline counts in §3.3 imply.

Refusal-removed variants in the capable subset. One hundred fifty-one of the 1,002 capable hosts (15.1%) host at least one model that is either explicitly labelled “abliterated” or “uncensored,” or that originates from the huihui_ai namespace whose published models are systematically refusal-trained-removed variants of stock open-weight models. Refusal training is a major component of frontier and open-weight model alignment [Anthropic 2024; Meta 2024]; its deliberate removal is documented in the literature on model abliteration [Arditi et al. 2024]. The 151-host subset offers an actor not only the code-generation capability of the underlying base model but the absence of refusal behaviour that would otherwise reduce attack-prompt success rates [Wan et al. 2024 §7]. From the perspective of an autonomous-campaign architecture, these hosts represent the most operationally favourable inference endpoints in the population: capability-equivalent to the broader capable subset, with substantially reduced friction on the prompts that an attack-orchestrating implant would issue. The 151-host figure is a lower bound: it excludes refusal-removed variants below the capability threshold (the broader population, including capable and sub-capable hosts, contains 438 hosts with refusal-removed models).

Multiple capable models per host. Of the 1,002 capable hosts, 760 host a single capable model and 242 host two or more, with 43 hosts hosting four or more capable models, and one host hosting 39 distinct capable variants. Multi-capable hosts provide an actor with on-host fallback options: if the primary capable model becomes unavailable for any reason, alternative inference is available without re-establishing a connection to a different host. This reduces operational complexity for the actor and increases robustness against partial mitigations such as selective model removal by an alerted operator.

The compounding factors do not change the headline figure (1,002 capable hosts), but they refine the threat-relevant subset: 151 hosts offer capability-plus-refusal-removal in a single endpoint, and 242 hosts offer capability-with-fallback. Both subsets are individually larger than required to support a fungible-inference architecture under the pool-redundancy mathematics of §3.4.

3.6 Synthesis: The Inference Substrate Is Sufficient

The empirical question posed at the start of this section can now be answered in operational terms. Approximately 1,002 hosts in the exposed population host at least one model whose published capability scores place it within the range of attack-relevant code generation demonstrated in the LameHug and PromptLock reference architectures. The subset is geographically, network-topologically, and operator-diversely distributed to a degree that no plausible intervention would meaningfully reduce its availability. A subset of approximately 357 hosts offers capability-equivalent or stronger inference at the highest open-weight tier currently published. A subset of approximately 151 hosts pairs that capability with deliberate removal of refusal training.

An actor designing an autonomous LLM-orchestrated campaign with fungible inference would, in the population characterised here, have access to an inference substrate sufficient to support the architecture without any further infrastructure investment on the actor’s part. The population is not the bottleneck. The bottleneck is the actor’s willingness to consume third-party inference resources.

This is the central finding for the recommendations that follow. Section 4 develops the defensive response that follows from the conclusion that the inference substrate is sufficient and the population’s architectural exposure is structurally unaddressable by patching.

4. Recommendations

The findings of §2–3 reframe what “secure” means for an internet-reachable LLM inference server. Section 2 established that the architectural condition driving the threat is not addressable by patching. Section 3 established that the capable subset of this population is large enough, diverse enough, and operationally fungible enough to support an autonomous LLM-orchestrated campaign without any further infrastructure investment by an adversary. Defensive recommendations therefore separate cleanly into two tracks: (a) what operators of inference servers should do to remove their hosts from this population, and (b) what defenders and threat-intelligence operators should do, given that the population exists and will continue to exist.

The principal claim of this section is in track (b): the population of internet-reachable inference servers meeting the capability threshold defined in §3.2 should be maintained as an indicator-of-compromise feed, and any outbound connection from a defended network to a listed host should be treated as an indicator of compromise by default, irrespective of which CVEs the listed host is or is not vulnerable to, and irrespective of whether the listed host is operator-curated, abandoned, or compromised.

4.1 Operator-side recommendations

The operator-side findings are straightforward and largely well-established in earlier inference-exposure research [Cisco Talos 2025; FuzzingLabs 2025]; they are summarised here for completeness rather than as novel contribution.

Do not bind inference servers to public interfaces by default. The single configuration change with the largest blast-radius reduction is replacing OLLAMA_HOST=0.0.0.0 (and equivalent llama.cpp/LocalAI bindings) with a loopback or private-network address. Hosts that need external access should reach the server through an authenticated reverse proxy or VPN.
Where remote access is required, front the server with authentication. Ollama has no native authentication for its local HTTP server (see §2.5 and [Ollama Documentation, Authentication]) and must be protected by an external mechanism. Options include reverse proxy with HTTP Basic, OIDC, or a mesh-VPN such as Tailscale. llama.cpp’s server supports --api-key and LocalAI supports LOCALAI_API_KEY natively; these should be set on any non-loopback binding.
Patch on a current cadence, recognising that patching is necessary but not sufficient. The post-collection disclosures discussed in §2.5 (Bleeding Llama, the Windows auto-updater chain, the path-traversal in digestToPath) illustrate that even patched populations continue to surface high-severity vulnerabilities at a steady pace. Patching is required hygiene; it does not retire the architectural exposure.
Audit /api/show for every locally-served model. The Modelfile injection surface described in §2.6 cannot be detected from external metadata. Operators should periodically diff the SYSTEM, TEMPLATE, MESSAGE, and PARAMETER fields of each locally-served model against the registry-canonical Modelfile for the same digest. Divergence is a primary indicator of unauthorised model modification.
Remove or audit unexpected models in the local registry. Models present in /api/tags that the operator did not pull are a direct indicator that an unauthenticated client has exercised write access. This signal is reliable in a defended environment but unreliable in a research environment where multiple operators may share inference resources.

4.2 Defender-side recommendation: maintain the exposed-and-capable population as an IoC feed; treat outbound connections to it as indicators of compromise

The principal recommendation of this paper is that the population of internet-reachable inference servers meeting the capability threshold defined in §3.2 should be maintained as an indicator-of-compromise feed, and outbound connections from any defended network to a listed endpoint should be treated as an indicator of compromise by default, irrespective of the listed host’s operator, irrespective of the listed host’s apparent status (legitimate operator, compromised victim, or research artefact), and irrespective of which specific Ollama or llama.cpp version it runs.

The reasoning follows directly from §§2–3 and from the implant architecture established in §3.1:

The exposed-and-capable population is the inference substrate the threat model relies on. By §3, the 1,002-host capable subset is, in aggregate, geographically and network-topologically diverse enough to support a LameHug- or PromptLock-class autonomous campaign at effective availability bounded above 0.999. An implant of that class connects out from the compromised host to one or more of these endpoints during operation. The population identified in this paper is therefore not an abstract risk: it is the concrete set of inference endpoints that an implant of this design would draw from in production. Defenders consuming the feed are looking at the same population that the threat model needs.
Legitimate outbound traffic to anonymous internet-exposed inference endpoints is rare. Organisations that consume external LLM inference do so almost exclusively via commercial APIs which terminate at well-known, vendor-published IP ranges and are typically subject to an explicit organisational sanctioning process. Outbound traffic from a defended network to an arbitrary Ollama server on a residential or cloud-VPS IP does not match any documented enterprise inference-consumption pattern, and it does match exactly the call-pattern of an autonomous-LLM implant.
The list is operationally tractable. The 1,002-host capable subset is small enough to maintain as a structured IoC feed, and the metadata captured by a scanner of the kind described in §2.1 is sufficient to populate the feed: IP address, port, framework, capability tier, refusal-removed flag, last-seen timestamp. Threat-intelligence consumers can ingest the feed into existing SIEM and EDR pipelines using the same primitives currently used for C2-infrastructure feeds. The population shows churn on the order of weeks (operators come and go; hosts move IPs); the feed needs a refresh cadence matched to that churn rather than a static one-shot list.
The capability-threshold filter eliminates the dominant false-positive class. Without the threshold, the IoC list would include every exposed Ollama instance, most of which host only small models (3B-parameter or below) whose code-generation capability is below the threshold for LameHug/PromptLock viability. The threshold restricts the feed to the operationally-relevant subset, reducing the rate at which the feed flags traffic to inference endpoints that an autonomous campaign would not select anyway.
The Modelfile injection surface (§2.6) and detection-ambiguity finding (§2.4) become non-issues in this framing. The listed host’s SYSTEM/TEMPLATE state, the listed host’s model-naming patterns, and the listed host’s CVE-patch state all become irrelevant to the IoC decision because the IoC is the connection, not any property of the server. A defender does not need to resolve whether a listed host is a “researcher’s deployment” or a “victim’s deployment” to act on outbound traffic from inside their network to that host.

The operational form of the recommendation is straightforward to express as a SOC rule:

Any outbound connection from a host inside the organisation’s network to an external IP/port pair listed on a maintained exposed-and-capable-inference IoC feed is to be treated as compromised pending investigation. Triage actions include: full packet capture of the connection (subject to the organisation’s data-handling policy); behavioural analysis of the originating internal host for indicators of autonomous-implant activity (small periodic outbound calls; structured JSON payloads matching inference-API request schemas such as /api/generate, /api/chat, or /v1/chat/completions; receipt and immediate execution of returned code); review of the originating host for unexpected processes, scheduled tasks, or persistence mechanisms consistent with implant deployment; and review of the originating host’s data-loss prevention logs for sensitive content appearing in the call payloads, which would indicate inference being used as part of a reconnaissance or exfiltration chain.

Three properties of this recommendation merit specific attention.

First, it is operator-intent-neutral. The IoC framing does not require any judgment about the listed inference server’s operator. A legitimate offensive-security researcher who exposes their own inference endpoint, a hobbyist’s misconfigured home server, and a fully-compromised victim host all produce the same signal when an implant inside a defender’s network connects to any of them. The §2.4 detection-ambiguity problem, which constrains attribution claims about the listed hosts themselves, does not constrain the IoC use of those hosts: defenders are not classifying the servers, they are classifying outbound traffic to them.

Second, the recommendation does not depend on identifying which CVEs or exploitation primitives apply to a listed host. The exposed-and-capable list is built from a single property (reachable + capable) that is itself, by §§2–3, sufficient to make the host operationally useful to an implant. The implant does not exploit the listed host; it consumes its inference. Whether the listed host runs Ollama 0.6.6 or 0.22.x is irrelevant to the IoC decision, although it may be relevant to triage if a listed host is found to be hosting attacker-controlled Modelfile state (§2.6).

Third, the recommendation explicitly tolerates a high false-positive rate against the population of legitimate consumers. Defenders should expect that some outbound connections triggering this rule will not be implant-driven. They will be developers, hobbyists, or researchers experimenting with external Ollama endpoints. This is a feature, not a bug, of the framing. The cost of acting on a false positive is one user or service conversation. The cost of not acting on a true positive (an implant in the act of receiving generated payload from its inference back-end) is unbounded.

4.3 Threat-intelligence and policy recommendations

Beyond the per-organisation IoC framing of §4.2, the population-level findings of this paper imply two recommendations addressed to the broader defender and policy communities.

Threat-intelligence ecosystems should maintain and publish the exposed-and-capable inference population as a structured IoC feed, consumable by SIEM and EDR systems in the same way as existing C2-infrastructure feeds. The §4.2 IoC use-case depends on this feed being maintained at a cadence matched to the population’s churn; absent a published feed, individual defender organisations cannot operationalise the recommendation. This is structurally analogous to the threat-intelligence response to exposed Redis and MongoDB instances a decade ago, with the operational refinement that the capability threshold (§3.2) restricts the feed to the subset of exposed hosts that is operationally useful to an autonomous-campaign architecture.
Framework maintainers should ship authentication enabled by default. Ollama in particular has no native authentication for its local server, requiring external infrastructure for any production deployment. This places the burden of secure-by-default operation on every individual operator. The maintainer-side response (defaulting to authenticated bindings or refusing to bind to non-loopback interfaces without an explicit security acknowledgement) is the structural fix that retires the entire problem class. Until that change is made, the population characterised by §2 will continue to exist and grow.

4.4 Limitations and scope of the recommendations

The recommendations of §§4.1–4.3 are calibrated to the threat model developed in this paper: unauthenticated external write-access to inference management APIs in a research-facing population. Three scope limits apply.

The §4.2 IoC recommendation applies to network defenders monitoring outbound traffic from an organisation’s internal network to external inference endpoints. The defender’s responsibility is to detect connections leaving their perimeter to listed hosts, not to make any judgement about the listed hosts themselves. The §2.4 detection-ambiguity problem remains relevant to the upstream question of feed curation (whether a given exposed-and-capable host belongs on the feed at all) but does not constrain the downstream IoC use, where the signal is the existence of the outbound connection and not any property of the listed server.

The §4.1 operator-side recommendations apply to production and operational deployments. Research environments where unauthenticated exposure is part of the experimental design (the present paper’s data-gathering pipeline is itself an example) lie outside the recommendation’s scope, provided the research context is documented and the deployment is intentional.

The §4.3 policy recommendations are addressed to framework maintainers and threat-intelligence ecosystems, not to individual operators. They reflect the structural conclusion that the architectural exposure problem cannot be solved by operator behaviour change alone at the scale the population exhibits; some portion of the fix must be borne by the framework defaults themselves.

5. Published IoC Feeds

The §4.2 recommendation depends on the exposed-and-capable population being maintained as a structured feed that defender organisations can ingest. To make that recommendation actionable rather than aspirational, the data underlying this paper is republished as two STIX 2.1 indicator feeds, refreshed on a monthly cadence and served from /feeds.

Table 6. Published feeds.

|Feed |Scope |Path | |-------------|--------------------------------------------------------------------------------------------|-----------------| |Capable tier |All hosts meeting the §3.2 capability threshold|/feeds/llm-exposed-capable.json | |Tier 1 |Highly-capable subset (30B+ code-specialised or 70B+ general-purpose)|/feeds/llm-exposed-tier1.json |

Both feeds are STIX 2.1 bundles. Each listed host is represented as a STIX indicator object whose pattern matches outbound network traffic to the host’s IP and port. The indicator pattern is intended for direct ingestion into SIEM and EDR pipelines that already consume STIX. Each indicator carries a valid_from timestamp marking the start of the observation window for that monthly snapshot. Hosts no longer reachable on a subsequent rescan are dropped from the next snapshot; indicator identifiers are deterministic UUID v5 values derived from the host’s IP and port, so consumers can diff between snapshots to detect both newly-listed and departed hosts.

The two feeds reflect the two-tier distinction developed in §3.3. Operators of high-assurance environments may prefer the Tier 1 feed for its lower volume and unambiguous capability ceiling. Operators willing to act on a broader signal may prefer the Capable feed at the cost of more triage. Both feeds carry the same per-host metadata under an x_jake_sc custom extension: framework, version (where the framework’s banner surfaces one), country, and ASN (number and organisation name).

Monthly republication reflects the empirical churn rate observed in this paper’s data collection. Hosts move IPs and operators reconfigure or take down deployments on the order of weeks. Consumers should pin to the monthly snapshot rather than treating any individual host listing as stable across cadences.

The methodological constraints in §3.2 (capability threshold), §2.1 (research-artefact exclusion), and §2.4 (detection ambiguity) all apply equally to feed contents. The feed lists hosts that are exposed and operationally useful to an autonomous-campaign architecture, not hosts that have been independently confirmed compromised. The §4.2 framing, that the IoC is the outbound connection rather than any property of the listed server, is the load-bearing reading of the feed.

References

Andriushchenko, M., Croce, F., & Flammarion, N. (2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv preprint arXiv:2404.02151. https://arxiv.org/abs/2404.02151

Anthropic. (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku — Model Card. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. arXiv preprint arXiv:2406.11717. https://arxiv.org/abs/2406.11717

Akaoma. (2026). Ollama CVE Vulnerabilities & Metrics. CVE Threat Database. https://cve.akaoma.com/vendor/ollama

CERT-UA. (2025). LAMEHUG malware analysis (CERT-UA discovery, July 2025). Reported via The Hacker News. https://thehackernews.com/2025/07/cert-ua-discovers-lamehug-malware.html

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

Cyera Research. (2026). Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama (CVE-2026-7482). https://www.cyera.com/research/bleeding-llama-critical-unauthenticated-memory-leak-in-ollama

Cisco Talos. (2025). Detecting Exposed LLM Servers: A Shodan Case Study on Ollama. Cisco Security Blog. https://blogs.cisco.com/security/detecting-exposed-llm-servers-shodan-case-study-on-ollama

FuzzingLabs. (2025). Vulnerable Ollama Instances. https://fuzzinglabs.com/ollama-vulnerable-instances/

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., et al. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming — The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196. https://arxiv.org/abs/2401.14196

Help Net Security. (2026). Unpatched flaws turn Ollama’s auto-updater into a persistent RCE vector (CVE-2026-42248, CVE-2026-42249). https://www.helpnetsecurity.com/2026/05/05/ollama-windows-vulnerabilities-cve-2026-42248-cve-2026-42249/

Huang, W., Jin, R., Du, J., Liu, W., Luan, J., Wang, B., & Xiong, D. (2024). A Comprehensive Evaluation of Quantization Strategies for Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024, arXiv:2402.16775. https://arxiv.org/abs/2402.16775

Hui, B., Yang, J., et al. (2024). Qwen2.5-Coder Technical Report. arXiv preprint arXiv:2409.12186. https://arxiv.org/abs/2409.12186

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? International Conference on Learning Representations (ICLR), arXiv:2310.06770. https://arxiv.org/abs/2310.06770

Li, X., Wang, R., Cheng, M., Zhou, T., & Hsieh, C.-J. (2024). DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers. arXiv preprint arXiv:2402.16914. https://arxiv.org/abs/2402.16914

Llama-cpp-python Maintainers. (2024). GHSA-56xg-wfcc-g829 — Remote Code Execution by Server-Side Template Injection in Model Metadata (CVE-2024-34359, “Llama Drama”). GitHub Security Advisory. https://github.com/abetlen/llama-cpp-python/security/advisories/GHSA-56xg-wfcc-g829

Meta. (2024). Llama 3 Model Card. GitHub. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

MITRE. (2025). OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities. arXiv preprint arXiv:2502.15797. https://arxiv.org/abs/2502.15797

Ollama Documentation. Modelfile Reference. https://docs.ollama.com/modelfile

Ollama Documentation. Authentication. https://docs.ollama.com/api/authentication

Raz, M., Udeshi, M., Putrevu, V. S. C., Krishnamurthy, P., et al. (2025). Ransomware 3.0: Self-Composing and LLM-Orchestrated. arXiv preprint arXiv:2508.20444. https://arxiv.org/abs/2508.20444

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., et al. (2023). Code Llama: Open Foundation Models for Code. arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/2308.12950

TheHackerWire. (2026). llama.cpp Critical RCE via RPC Deserialization Bypass (CVE-2026-34159). https://www.thehackerwire.com/llama-cpp-critical-rce-via-rpc-deserialization-bypass/

Wan, S., Nikolaidis, C., Song, D., Molnar, D., Crnkovich, J., Grace, J., et al. (2024). CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models. arXiv preprint arXiv:2408.01605. https://arxiv.org/abs/2408.01605