Enterprise Compliance GDPR HIPAA

Enterprise Offline Voice-to-Text: Why Regulated Industries Need On-Device Speech Recognition

For most consumers, cloud voice dictation is fine. For an engineer at BMW writing a powertrain spec, a banker at Deutsche Bank drafting an internal memo before earnings, a clinician at the Mayo Clinic narrating a patient note, or a researcher at Roche capturing trial observations, it is not. Voice is sensitive data. This guide explains why on-device speech recognition is becoming the default for regulated industries, and what an offline-first deployment looks like in practice.

Why on-device dictation is now an enterprise requirement

Three things changed in the past 18 months. First, open-weight speech models — particularly OpenAI's Whisper family and the multilingual NeMo Parakeet models — became accurate enough on commodity laptops that the historical reason for cloud dictation (better models, more compute) effectively disappeared. Second, regulatory scrutiny on cross-border data transfers escalated: Schrems II and the EU-US Data Privacy Framework, the EU AI Act's biometric categorisation, the Digital Operational Resilience Act (DORA) in financial services, the rolling NIS2 implementation. Third, board-level interest in generative AI surfaced data-leakage concerns that legal and security teams now apply to any tool that transmits raw user content — including voice.

The result is a clear preference inside large enterprise IT and security organisations: where a tool can run locally without losing capability, it should. Voice-to-text is one of the cleanest examples because the on-device option is genuinely as accurate as the cloud option for almost every workload.

Voice as a regulated data class

GDPR — voice may be biometric, and is always personal data

Under the EU General Data Protection Regulation, recorded human voice is personal data because it can be linked to an identified or identifiable individual. Many Data Protection Officers go further and classify voice as biometric data under Article 9 when the audio could be used for speaker identification — which any modern model can do. That moves voice into the special-category bucket with the higher consent and lawful-basis bar.

Practical impact: if a German engineering team uses a US-hosted cloud dictation service, the organisation needs an SCC-based transfer mechanism, a DPA, a TIA (Transfer Impact Assessment), a clear retention policy on the audio, and likely a DPIA. None of that is required when the audio is transcribed on the laptop and discarded immediately.

HIPAA — voice notes routinely contain PHI

In US healthcare, audio notes captured by clinicians, scribes, and researchers usually contain Protected Health Information. Cloud dictation requires a Business Associate Agreement with the speech vendor and any sub-processors, plus the standard administrative, physical, and technical safeguards. The on-device alternative removes the BAA chain on the speech path because PHI never reaches a Business Associate.

DORA, MiFID II, and SOX — banking and finance

Financial services firms operate under multiple overlapping regimes. DORA expects them to demonstrate operational resilience including ICT third-party risk management. MiFID II governs records of relevant conversations. SOX governs internal control over financial reporting. None of these forbid cloud dictation, but each adds friction to a vendor relationship that processes raw voice. On-device transcription is a much shorter audit trail.

Sector-specific export and IP rules

Defence, aerospace, semiconductor, and certain energy firms operate under export-control regimes (ITAR, EAR, EU dual-use). Speaking technical content into a third-party speech API is, depending on jurisdiction and audience, an export. On-device transcription avoids the question.

Automotive — where engineering speech is competitive intelligence

Automotive groups — BMW, Mercedes-Benz, Volkswagen, Stellantis, Ford, Toyota, Honda — run heavy engineering organisations whose internal voice traffic includes prototype performance numbers, supplier negotiations, regulatory dossiers, and pre-launch product detail. The default posture in most automotive IT shops today is that anything spoken about a vehicle programme should not transit a consumer-grade cloud service. Engineers who want voice-to-text — and many do, because narrating a requirement is faster than typing it — get told to either turn off cloud dictation or simply not use it. An on-device alternative changes that conversation: the productivity gain becomes available without a compliance argument every time.

Tier-1 suppliers (Bosch, Continental, ZF, Magna, Denso) sit in the same world. So do the semiconductor and battery suppliers (Samsung SDI, CATL, Northvolt) whose roadmaps are themselves sensitive. The rule of thumb across the sector is identical: if the audio leaves the device, the audio is a problem.

Banking & finance — MNPI, internal memos, and "no audio off-endpoint"

Investment banks (Goldman Sachs, Morgan Stanley, JPMorgan, Deutsche Bank, BNP Paribas, UBS) and large asset managers (BlackRock, Fidelity, Vanguard) typically operate under explicit policies that prevent Material Non-Public Information from leaving the corporate endpoint via consumer cloud services. Voice notes about an upcoming deal, a draft research note, an internal credit memo, or a pre-earnings analyst preview all fit that category. The dictation tool that survives a bank's information-security review is one whose data flow is "audio is captured, transcribed locally, discarded — text stays in the same controlled application as before".

Healthcare & pharma — PHI, trial data, lab notes

Hospital systems and academic medical centres (Mayo Clinic, Cleveland Clinic, Charité, Karolinska) and large pharma (Pfizer, Roche, Novartis, AstraZeneca, GSK, Merck) have well-established workflows for managing PHI and trial data. Cloud dictation is permitted under BAA with specific vendors, but friction is real: every new project, every new ward, every new investigator site is another approval. On-device dictation collapses that approval surface. Many institutions are now moving routine clinician note-taking to local-only models for exactly this reason.

Public sector & defence

Government and defence (NATO members, Five Eyes partners, ENISA-listed agencies, EU institutions) operate at higher classifications than commercial enterprise. For anything above unclassified, cloud dictation is simply not permitted. On-device transcription is the only model that fits, and even there the typical requirement is air-gap-friendly: the speech model must run without phoning home, without telemetry, without an account check.

Reference architecture for an offline dictation rollout

The shape that has become standard for enterprise on-device dictation in 2026 looks like this:

  1. Endpoint-only audio. The microphone is captured locally. The audio buffer never leaves the device. No analytics, no waveform sent for "model improvement".
  2. Local Whisper-class model. The model is shipped with the application or downloaded once at install. It runs on CPU or GPU on the endpoint. Inference is sub-second for short utterances on modern silicon.
  3. Text injection at the cursor. The transcribed text is typed into whatever app already has focus, using the OS keyboard injection API. The dictation tool does not need to integrate with the receiving app.
  4. Optional AI rewrite via BYOK. If a "polish this paragraph" step is wanted, the request goes from the endpoint directly to the customer's chosen AI provider using the customer's API key. The dictation vendor is not in the data path.
  5. No required server-side state. Authentication and licence checks may run against the vendor's servers, but the dictation path is independent of them. Air-gap mode is supported.

This pattern is easy to defend in a security review: there is no third-party processor of voice data, no cross-border transfer of voice data, no retention of voice data anywhere except the customer's endpoint (and there only transiently, in memory).

How AirTypes fits the enterprise pattern

AirTypes was built around the architecture above. Whisper runs locally on the user's machine. Audio is captured, transcribed, and discarded inside the same process. Text is injected into the active application via OS keyboard injection. There is no plugin, no add-in, no Office or Slack integration — so the receiving app's vendor is not part of the deal either.

The optional My Agent AI rewrite step is BYOK by design. The customer plugs in their own OpenAI, Anthropic, Google Gemini, OpenRouter, Groq, or Ollama endpoint. The request goes from the endpoint directly to the customer's chosen provider using the customer's key. AirTypes is not a proxy, not a man-in-the-middle, not a logger.

Linux and macOS are available today, with Windows in active development. For deployments standardised on Windows, the same offline-first architecture applies — Windows users can register on the download page for the early-access build.

A short risk-review checklist for procurement

If you are evaluating any voice dictation tool for an enterprise rollout, the following checklist tracks the questions that typically come up in security and procurement reviews:

  • Where is audio captured? (Should be: on the endpoint, never sent.)
  • Where is audio transcribed? (Should be: on the endpoint.)
  • Is there any telemetry, analytics, or "model improvement" pipeline that touches audio? (Should be: no.)
  • Does the tool require an internet connection to transcribe? (Should be: no.)
  • Does the tool support air-gapped or offline-only operation? (Should be: yes.)
  • If an AI rewrite step is included, does it use the customer's own provider key? (Should be: yes — BYOK.)
  • Is the vendor in the data path between the endpoint and the AI provider? (Should be: no.)
  • Is text injection done via the OS or via per-app integrations? (OS-level avoids per-app vendor reviews.)
  • Is the licence/auth path independent from the dictation path so dictation works under network outage? (Should be: yes.)

FAQ

Is voice considered personal data under GDPR?

Yes. Recorded voice that can be linked to an identified or identifiable individual is personal data. Many DPOs further treat voice as biometric special-category data under GDPR Article 9 when it could be used for speaker identification, which raises the lawful-basis bar.

Does on-device transcription eliminate GDPR obligations entirely?

No — the resulting text is still personal data and must be handled according to the organisation's normal data-handling policy. What on-device transcription removes is the cross-border transfer of audio, the third-party speech processor relationship, and the retention question on the audio path.

Are open-weight Whisper models accurate enough for technical enterprise vocabulary?

Yes. Whisper Large-V3 and equivalent models reach near-human accuracy on technical English and most major European and Asian languages. For specialised vocabularies (medical coding, ticker symbols, internal product code names) a small custom vocabulary or post-processing layer closes any remaining gap.

Can offline voice tools run on shared or virtual desktops (VDI)?

Yes, with caveats. The microphone has to be passed through to the VDI session, and the model has to run inside the session. Modern VDI platforms (Citrix, AVD, Horizon) all support audio redirection, and Whisper-class models run comfortably on the CPU allocations typical of knowledge-worker VDI.

Is AirTypes air-gap compatible?

The dictation path is. Whisper runs locally and does not require an internet connection. Licence activation requires connectivity at activation time, but ongoing dictation continues without network access in the modes designed for that environment.

Evaluate AirTypes for your enterprise dictation rollout

Linux & macOS available now. Windows in development. Talk to us about volume licensing.

See deployment options