GPT-4.1 EU Data Zone Standard: intermittent token corruption in Dutch output — English insertion, gibberish, truncation

Question

GPT-4.1 EU Data Zone Standard: intermittent token corruption in Dutch output — English insertion, gibberish, truncation

Tom Huibers 150

We are experiencing intermittent but severe output quality degradation with GPT-4.1 on Data Zone Standard (EU). We generate Dutch-language structured reports and load-balance across multiple EU Azure OpenAI instances.

Symptom A subset of production requests produces corrupted output. We have classified the corruption into 4 distinct categories:

Random English insertion — valid English words injected into Dutch text Examples: "luxury", "pipeline", "timeline", "prestige", "cup", "deep"
Gibberish / corrupted tokens — non-words or garbled text Examples: "musteraan", "plumeert", "jeae", "Êr", "resolootje"
Truncated / abbreviated words — words cut off mid-way Examples: "Onbek vac.", "NBermogen", "sup.", "gem", "pn"
Word substitution — wrong Dutch words or grammar Examples: "de has", "based" substituted for correct Dutch

Scale Multiple affected production sessions identified. The majority of traffic is handled correctly — this is intermittent but consistent over time.

Hypothesis The intermittent nature and variety of corruption types suggests one or more degraded inference nodes in the EU Data Zone pool. The corruption is not prompt-related or input-related; it appears infrastructure-side.

Configuration

Model: gpt-4.1
Deployment type: Data Zone Standard (EU)

Priority This is our highest priority issue. The corruption is severe enough to fundamentally undermine trust in AI-generated output. Because we are on Data Zone Standard, we have no control over request routing and cannot mitigate this ourselves — the fix must come from Microsoft's side.

We want to ensure this is visible to the Azure OpenAI infrastructure team and is being actively investigated.

Yannic 5 Reputation points

2026-03-12T10:28:57.0966667+00:00

We are currently observing the same issue in production.

Our setup also uses GPT-4.1 on Azure OpenAI (Data Zone Standard EU) and we are seeing identical corruption patterns:

random English words appearing in non-English output

garbled tokens / non-words

truncated words

incorrect word substitutions

The failures appear intermittently, while most requests are processed correctly.

This strongly suggests the problem is not prompt-related, but likely tied to specific inference nodes in the EU Data Zone pool.

For us this is also a critical production issue.

The corrupted responses make the generated output unreliable and require manual validation.

We urgently need clarification from Microsoft/Azure whether:

this issue is already identified internally

specific clusters or nodes are degraded

a mitigation or fix is in progress

If anyone from the Azure OpenAI infrastructure team can confirm investigation status, it would help a lot.
Konstantin Afanasiev 0 Reputation points

2026-03-12T10:33:30.1866667+00:00

We are experiencing exactly the same issue with GPT-4.1.

In our case the model occasionally produces corrupted tokens or strange word fragments in otherwise simple text outputs. Examples include partial words, unexpected English tokens in German sentences, or random artifacts appearing inside templates.

What we noticed is that the problem strongly correlates with temperature:

Any temperature greater than 0 causes these artifacts to appear very frequently.

With temperature = 0 the issue mostly disappears, but we still occasionally observe corrupted tokens.

This behavior started roughly few days ago for us. Before that, the exact same prompts and pipeline were working reliably for months.

Our use case is relatively simple (template-based responses), so the artifacts are very noticeable because the output should normally be almost deterministic.

It would be very helpful to know whether this is a known issue with GPT-4.1 in the EU Data Zone or if there were any recent backend changes that might affect token sampling.
Julio Perez 15 Reputation points

2026-03-12T11:31:30.39+00:00

We are currently experiencing the exact same issue with GPT-4.1 in the EU Data Zone, specifically affecting Spanish outputs. Until today, the model had been providing consistently strong and reliable responses using the same prompts. However, starting this morning, we began noticing unusual artifacts appearing in the generated text that clearly do not align with the rest of the output.

Given that this model is used in our production environments, we would appreciate confirmation that the Azure AI team is aware of and actively investigating this issue.
Fabian Mijsters 0 Reputation points

2026-03-12T12:07:20.1066667+00:00

Adding some diagnostic data that may help narrow this down. In one of our corrupted outputs, the sentence "De SUD attempting (Subjective Units of Disturbance) voor de herinnering was bij aanvang 9" should have read "De SUD score...". When inspecting the logprobs for the corrupted token, the top 5 alternatives were: " attempting" (-1.75), " vid" (-2.86), " ram" (-2.99), " flag" (-3.18), " half" (-3.22) — none of which are valid Dutch. The correct token "score" doesn't appear in the top 5 at all, which suggests the affected node's token distribution might be fundamentally skewed, not just noisy. This might be consistent with a corrupted model weight or misconfigured sampling on specific inference nodes. Additionally the probability of the options are on average atleast a full -1 lower than all other valid tokens.
Tom Huibers 150 Reputation points

2026-03-12T12:20:53.8433333+00:00

Thanks all for input,

FYI; i have started a support request on this with azure this morning, after some back and forth the support engineer is "investigating" it, they first told me that this might just be normal non-deterministic llm behaviour, but after some further explaining i hope to have made clear that it definiteley is not.

will keep this thread updated if any meaningful results come back.
Yannic 5 Reputation points

2026-03-12T13:01:07.65+00:00

Because the corrupted responses make the generated output unreliable for our use case, we decided to temporarily switch our production traffic to the 4.1-mini model as a short-term mitigation.
The change was made purely to stabilize production until the root cause with GPT-4.1 in the EU Data Zone is resolved.

For our use case this works well so far.

If you are able to, switching to 4.1-mini might also be a possible temporary workaround until the issue with GPT-4.1 is fixed.
Anshika Varshney 7,995 Reputation points Microsoft External Staff Moderator

2026-03-12T13:48:33.7366667+00:00

Hi Tom Huibers,

Thank you for bringing this to our attention. We are currently checking the details and working on resolving the issue. We’ll update you as soon as we have more information.

Thankyou!
Tom Albrecht 0 Reputation points

2026-03-12T14:40:09.8166667+00:00

Hey @Anshika Varshney ,
we are also experiencing these problems on multiple hosted Instances within the EU-Datazone.
We are using german-language instead of dutch but experience the same issues.
Are you looking into a general Outage/Problems with gpt-4.1 in the EU-Datazone or specifically in the Instance of Tom Huibers?

Best regards
Tom Huibers 150 Reputation points

2026-03-12T15:09:24.78+00:00

We have also switched our production traffic to another model as a temporaray fix. results are consistent, so issues seem to be isolated to the 4.1 model.

This model, is however, less accurate for our use case, so we would like to switch back a.s.a.p

Thanks for the response @Anshika Varshney, looking forward to your update.
Daniel Lönnberg 0 Reputation points

2026-03-12T15:23:07.5366667+00:00

After git bisecting for the past hours to find a commit where our system wasn't providing jibberish output, we finally found this thread instead. We're hoping this is highly prioritized by the Microsoft team.
Yoni 0 Reputation points

2026-03-12T15:41:03.1+00:00

We are also experiencing issues with our models, specifically when using Data Zone Standard and GPT 4.1
Yuval Peled 0 Reputation points

2026-03-12T15:47:47.9566667+00:00

Confirmed on our end as well. We have a scale of billions of tokens a month and our accuracy has reduced from 98% to 40% due to this.
Ben Verhees 0 Reputation points

2026-03-12T16:24:46.94+00:00

We are experiencing the same issue; gibberish and even some profanity:

Je wilt shit graag laten beoordelen en eventueel laten soppen reciprocate.

shit, soppen and reciprocate are unexpected in this sentence.
Omar Elbaghdadi 0 Reputation points

2026-03-12T17:00:19.3733333+00:00

We are experiencing the same issue for GPT-4.1. It's powering the main functionality of our app, so this completely breaks our functionality. It definitely doesn't look like regular non-deterministic behavior. The output is completely different from the patterns we've seen before.
Charuka Rathnayaka 0 Reputation points

2026-03-12T17:14:00.5066667+00:00

We are experiencing the same issue with GPT-4.1 in the Sweden Central region. The deployed model intermittently outputs gibberish and produces erroneous responses when answering in Norwegian. This is impacting our production workflows, and we hope this is being addressed as a high-priority task so it can be resolved quickly.
Anshika Varshney 7,995 Reputation points Microsoft External Staff Moderator

2026-03-12T18:55:09.35+00:00

Hello All Users,

We’ve raised an Incident for this issue, and our team is actively working on it. Please remain patient we’ll provide an update as soon as we receive further information.

If you have any remaining questions or additional details to share, feel free to let us know. We’ll be glad to provide further clarification or guidance

Thankyou!
Daniel Ekbom 0 Reputation points

2026-03-12T18:57:17.4066667+00:00

We are currently experiencing the exact same issue with GPT-4.1 in the EU Data Zone. Extremely high priority for us as this is currently very badly affecting the reputation of our product.
The answers are full of strange words that does not exist. We are using Swedish language.