Making a perfect recording button. Simple yet complex thing.

20 May 2026 · Dmitriy Tseyler

Intro

From the start of XSpeak, I wanted it to provide the best possible feel for the user: simple, fast, and responsive. Since it's a recording app, one of its main components is the button the user presses to record a conversation.

Actually, it's a control we're used to in many apps. The simplest example is the standard Voice Memos app on iPhone.

The button looks simple and does two things: starts recording and stops recording. However, behind the scenes, it takes many steps to start the whole pipeline, which is far from simple. In this article, I'm going to cover the technical and usability aspects of a recording button and will try to explain why it's important to make it work perfectly and why it's not as simple as it might look.

Perfect button

I'd call the recording button perfect if it is:

Functional
Responsive

When I say functional, I mean that it should be able to start and stop recording when it's enabled and I press it. The button should indicate that the state has switched. And for such an operation that usually takes hundreds of milliseconds, that feedback should be immediate.

Let's imagine that feedback is not immediate and there's some progress between states. In this case, the user:

Has to wait to make sure that recording has actually started.
Is out of their natural flow. Instead of focusing on their business, they are monitoring the app's status.
Experiences unnecessary cognitive load as a result: "What just happened? Did it start? How long should I wait?"
Loses control over the interface. They cannot correct the error and press stop.
Feels slight frustration, as the app temporarily prevents them from doing what they want to do.
Misses the instant, tactile feedback expected from a tool.

This discomfort might feel minor for some people. However, when the user does it so many times a day, it might add significant overhead. And that's not what we should expect from a helping tool.

Two qualities define a perfect recording button: it must be functional and responsive. The press should never wait on the pipeline.

Imperfect case:

User presses a button.
Button is disabled, shows a progress state or simply doesn't react.
After some time, the button is functional again, and recording is started.

Behind the scenes

Why is starting recording not that simple? When you press the button, the following happens in XSpeak:

Checks microphone and system audio permissions.
Shows a disclaimer reminding you to get consent from participants.
Checks if another recording is in progress and asks if you want to interrupt it.
Checks if the AI model and speech assets for the chosen language are available.
Tries to reserve the chosen language locale.
Checks if any meeting recording is playing now, and stops it.
Starts microphone recording.
Waits a bit to prevent races in CoreAudio and starts System Audio recording.
Starts the mixer, which mixes system audio with microphone input to produce a single audio stream for transcription.
Resolves the transcriber audio format.
Creates and starts the transcriber.

All these operations happen asynchronously. It means that we lose flow during each, and when we resume, the world could have changed: the user might have pressed the button several more times, previously available resources might have become unavailable, and so on.

Besides that, it launches side management threads that restart the mixer to prevent drift between two sources and restart the transcriber to prevent model context overflow.

Quite a start, isn't it? Probably, after that, your perception of this simple button will change, sorry for that :)

Let's see how different apps manage this or similar complexity.

Examples

iPhone Voice Memos

Voice Memos on iOS 26.5

stopped
→
started
→
stopped

When I press start, it starts. When I press stop, it stops. Nothing more.

Otter

Otter 1.4.2

stopped
→
delay
→
progress
→
started
→
delay
→
stopped

As you can see, the button becomes disabled while it starts recording. This makes me feel slightly uncomfortable every time I press it. I feel unresponsiveness and heaviness. And I need to wait before I can stop recording.

Talat

Talat 0.11.5

stopped
→
progress
→
started
→
progress
→
stopped

The button is disabled while recording is started. The good thing is that the recording start is quite fast here. However, it still produces a tiny unresponsiveness feeling.

MacWhisper

MacWhisper 13.21.1

stopped
→
delay
→
started
→
stopped

There's a slight delay between the press of the start button and the appearance of the stop button. Also, the button changes its position after I start recording, which requires additional cognitive effort from me to find it.

Fireflies

Fireflies 0.1.30

stopped
→
progress
→
started
→
progress
→
stopped

The button is locked during start.

XSpeak

XSpeak 3.7

stopped
→
started
→
stopped

As you can see, the button reacts instantly to user action. And if you change your mind, it reacts instantly back.

iPhone Voice Memos

Instant

Otter

Progress state

Talat

Progress state

MacWhisper

Delay

Fireflies

Progress state

XSpeak

Instant

Implementation

I'll not write a book here about all the approaches I considered and tried. Instead, I'll go from a naive approach to the solution I implemented.

Let's agree that we want instant feedback from the button and will not disable it during our startup chain. Also, let's declare our states:

S_ui

What the user sees. Changes instantly on press.

S_real

What actually happens in audio pipelines. Eventually catches up.

Each can be started or stopped. Our goal: keep them eventually consistent without ever blocking the user.

The naive approach would be when the user presses the button:

Change S_ui to started.
Launch startup pipeline.

However, the obvious problem would be a race condition. Imagine the following order of operations:

MainActor

The user presses start recording

MainActor

S_ui is changed to started

MainActor

Permissions check

AudioActor

Start microphone recording

MainActor

The user presses stop

AudioActor

Suspended

MainActor

S_ui = stopped

AudioActor

Suspended

AudioActor

Stopping all running recordings

AudioActor

Start system audio recording

In the end, we have S_ui = stopped and S_real = started.

We have to linearize this pipeline to prevent such races. The first thing that would help is to prevent start and stop operations from running simultaneously. We'll use a queue for that:

Op 1
→
Op 2
→
Op 3
→
…

Operations run one at a time, in submission order. No two operations overlap.

We also need to introduce one more state:

S_op

Target state of the operation. S_op is equal to the S_ui that was set when the operation was submitted to the queue.

When we want to start or stop recording, we submit an operation to the queue. This way no two operations overlap and each operation waits for its time. As a result, we always have S_ui equal to the S_op of the last operation.

However, this results in delayed work that doesn't start immediately. We still want to give immediate feedback to the user. To achieve that, we'll work with S_ui from MainActor and with S_real from Queue. This means that when we press the button, S_ui changes immediately, and the work is submitted afterward. The solution gives us the following challenges:

When the actual queue operation starts, the world could have changed, and the operation might not be necessary anymore.
If the queue grows, there might be significant delay. Imagine a situation when the button is pressed 100 times in a row. We'll have 100 operations 0.5s each, resulting in 50 seconds of work.

The world could have changed during the time we waited for the operation to start. It means the user could have stopped the recording, started it again, or even in a corner case, done it several times. To determine if the operation still makes sense, we should compare each operation's S_op with the current S_ui and S_real. If S_op is started and S_ui is stopped, we shouldn't start anymore, so we just exit. The same is true when S_op is stopped, but S_ui is started. Additionally, if S_op already equals S_real, the work is already done, so we exit as well.

This means that the first and the earliest operation whose S_op equals the current S_ui and not equals S_real will perform the work. This change results in a significantly reduced delay between submission and actual work start.

There's one more thing we should do to improve performance further. Imagine the following order of operations:

MainActor

The user presses start recording

MainActor

S_ui is changed to started

MainActor

Permissions check

MainActor

Suspended

MainActor

The user presses stop

MainActor

Suspended

MainActor

S_ui = stopped

AudioActor

Start microphone recording

queue

Waiting for the first operation to finish...

AudioActor

Start system audio recording

queue

Waiting for the first operation to finish...

AudioActor

...

queue

Waiting for the first operation to finish...

AudioActor

Stops the recording

If the user presses stop when the start operation is already in progress, we have to wait until the start operation finishes. It results in unnecessary delay and extra work.

To resolve this, we'll treat each suspension point where we schedule async work during our operation as a potential interruption point. After every step that awaits, we'll check if the target S_ui is still the same. And if it changes, we'll drop the operation and return.

However, when we change state, like starting physical microphone recording, things become more complex since we should revert that. But that's already what the opposite operation will do. So for consistency, after any step that changes state, we must finish the operation and then the opposite operation will revert everything. In the end, we'll have the desired S_real which is equal to S_ui.

async work
→
Sui changed?
→
async work
→
Sui changed?
→
complete

At every suspension point we re-check the target state. If it changed, we drop and return.

In practice, there are more complexities because sometimes we have non-standard user flows. But this architecture, where every audio manipulation goes through the queue, allows us to maintain a consistent and reliable state and gives us a good background to improve the app.

If you liked this article, subscribe to my blog on Substack, Medium, or Dev.to.

* All product names, logos, and brands are property of their respective owners. Use of these names, logos, and brands does not imply endorsement.

← Home