← Home

Making a perfect recording button. Simple yet complex thing.

20 May 2026 · Dmitriy Tseyler

Intro

From the start of XSpeak, I wanted it to provide the best possible feel for the user: simple, fast, and responsive. Since it's a recording app, one of its main components is the button the user presses to record a conversation.

Actually, it's a control we're used to in many apps. The simplest example is the standard Voice Memos app on iPhone.

The button looks simple and does two things: starts recording and stops recording. However, behind the scenes, it takes many steps to start the whole pipeline, which is far from simple. In this article, I'm going to cover the technical and usability aspects of a recording button and will try to explain why it's important to make it work perfectly and why it's not as simple as it might look.

Perfect button

I'd call the recording button perfect if it is:

  1. Functional
  2. Responsive

When I say functional, I mean that it should be able to start and stop recording when it's enabled and I press it. The button should indicate that the state has switched. And for such an operation that usually takes hundreds of milliseconds, that feedback should be immediate.

Let's imagine that feedback is not immediate and there's some progress between states. In this case, the user:

This discomfort might feel minor for some people. However, when the user does it so many times a day, it might add significant overhead. And that's not what we should expect from a helping tool.

Two qualities define a perfect recording button: it must be functional and responsive. The press should never wait on the pipeline.
Imperfect case:
  1. User presses a button.
  2. Button is disabled, shows a progress state or simply doesn't react.
  3. After some time, the button is functional again, and recording is started.

Behind the scenes

Why is starting recording not that simple? When you press the button, the following happens in XSpeak:

  1. Checks microphone and system audio permissions.
  2. Shows a disclaimer reminding you to get consent from participants.
  3. Checks if another recording is in progress and asks if you want to interrupt it.
  4. Checks if the AI model and speech assets for the chosen language are available.
  5. Tries to reserve the chosen language locale.
  6. Checks if any meeting recording is playing now, and stops it.
  7. Starts microphone recording.
  8. Waits a bit to prevent races in CoreAudio and starts System Audio recording.
  9. Starts the mixer, which mixes system audio with microphone input to produce a single audio stream for transcription.
  10. Resolves the transcriber audio format.
  11. Creates and starts the transcriber.

All these operations happen asynchronously. It means that we lose flow during each, and when we resume, the world could have changed: the user might have pressed the button several more times, previously available resources might have become unavailable, and so on.

Besides that, it launches side management threads that restart the mixer to prevent drift between two sources and restart the transcriber to prevent model context overflow.

Quite a start, isn't it? Probably, after that, your perception of this simple button will change, sorry for that :)

Let's see how different apps manage this or similar complexity.

Examples

iPhone Voice Memos

Voice Memos on iOS 26.5
stopped
started
stopped

When I press start, it starts. When I press stop, it stops. Nothing more.

Otter

Otter 1.4.2
stopped
delay
progress
started
delay
stopped

As you can see, the button becomes disabled while it starts recording. This makes me feel slightly uncomfortable every time I press it. I feel unresponsiveness and heaviness. And I need to wait before I can stop recording.

Talat

Talat 0.11.5
stopped
progress
started
progress
stopped

The button is disabled while recording is started. The good thing is that the recording start is quite fast here. However, it still produces a tiny unresponsiveness feeling.

MacWhisper

MacWhisper 13.21.1
stopped
delay
started
stopped

There's a slight delay between the press of the start button and the appearance of the stop button. Also, the button changes its position after I start recording, which requires additional cognitive effort from me to find it.

Fireflies

Fireflies 0.1.30
stopped
progress
started
progress
stopped

The button is locked during start.

XSpeak

XSpeak 3.7
stopped
started
stopped

As you can see, the button reacts instantly to user action. And if you change your mind, it reacts instantly back.

iPhone Voice Memos
Instant
Otter
Progress state
Talat
Progress state
MacWhisper
Delay
Fireflies
Progress state
XSpeak
Instant

Implementation

I'll not write a book here about all the approaches I considered and tried. Instead, I'll go from a naive approach to the solution I implemented.

Let's agree that we want instant feedback from the button and will not disable it during our startup chain. Also, let's declare our states:

Sui
What the user sees. Changes instantly on press.
Sreal
What actually happens in audio pipelines. Eventually catches up.
Each can be started or stopped. Our goal: keep them eventually consistent without ever blocking the user.

The naive approach would be when the user presses the button:

  1. Change Sui to started.
  2. Launch startup pipeline.

However, the obvious problem would be a race condition. Imagine the following order of operations:

MainActor
The user presses start recording
MainActor
Sui is changed to started
MainActor
Permissions check
AudioActor
Start microphone recording
MainActor
The user presses stop
AudioActor
Suspended
MainActor
Sui = stopped
AudioActor
Suspended
AudioActor
Stopping all running recordings
AudioActor
Start system audio recording

In the end, we have Sui = stopped and Sreal = started.

We have to linearize this pipeline to prevent such races. The first thing that would help is to prevent start and stop operations from running simultaneously. We'll use a queue for that:

Op 1
Op 2
Op 3
Operations run one at a time, in submission order. No two operations overlap.

We also need to introduce one more state:

Sop
Target state of the operation. Sop is equal to the Sui that was set when the operation was submitted to the queue.

When we want to start or stop recording, we submit an operation to the queue. This way no two operations overlap and each operation waits for its time. As a result, we always have Sui equal to the Sop of the last operation.

However, this results in delayed work that doesn't start immediately. We still want to give immediate feedback to the user. To achieve that, we'll work with Sui from MainActor and with Sreal from Queue. This means that when we press the button, Sui changes immediately, and the work is submitted afterward. The solution gives us the following challenges:

  1. When the actual queue operation starts, the world could have changed, and the operation might not be necessary anymore.
  2. If the queue grows, there might be significant delay. Imagine a situation when the button is pressed 100 times in a row. We'll have 100 operations 0.5s each, resulting in 50 seconds of work.

The world could have changed during the time we waited for the operation to start. It means the user could have stopped the recording, started it again, or even in a corner case, done it several times. To determine if the operation still makes sense, we should compare each operation's Sop with the current Sui and Sreal. If Sop is started and Sui is stopped, we shouldn't start anymore, so we just exit. The same is true when Sop is stopped, but Sui is started. Additionally, if Sop already equals Sreal, the work is already done, so we exit as well.

This means that the first and the earliest operation whose Sop equals the current Sui and not equals Sreal will perform the work. This change results in a significantly reduced delay between submission and actual work start.

There's one more thing we should do to improve performance further. Imagine the following order of operations:

MainActor
The user presses start recording
MainActor
Sui is changed to started
MainActor
Permissions check
MainActor
Suspended
MainActor
The user presses stop
MainActor
Suspended
MainActor
Sui = stopped
AudioActor
Start microphone recording
queue
Waiting for the first operation to finish...
AudioActor
Start system audio recording
queue
Waiting for the first operation to finish...
AudioActor
...
queue
Waiting for the first operation to finish...
AudioActor
Stops the recording

If the user presses stop when the start operation is already in progress, we have to wait until the start operation finishes. It results in unnecessary delay and extra work.

To resolve this, we'll treat each suspension point where we schedule async work during our operation as a potential interruption point. After every step that awaits, we'll check if the target Sui is still the same. And if it changes, we'll drop the operation and return.

However, when we change state, like starting physical microphone recording, things become more complex since we should revert that. But that's already what the opposite operation will do. So for consistency, after any step that changes state, we must finish the operation and then the opposite operation will revert everything. In the end, we'll have the desired Sreal which is equal to Sui.

async work
Sui changed?
async work
Sui changed?
complete
At every suspension point we re-check the target state. If it changed, we drop and return.

In practice, there are more complexities because sometimes we have non-standard user flows. But this architecture, where every audio manipulation goes through the queue, allows us to maintain a consistent and reliable state and gives us a good background to improve the app.

If you liked this article, subscribe to my blog on Substack, Medium, or Dev.to.
* All product names, logos, and brands are property of their respective owners. Use of these names, logos, and brands does not imply endorsement.
← Home