# Real-time mixing of two audio clips while only modifying the data of one

Signal Processing Asked by K. Claesson on December 13, 2020

Problem

I want to mix two audio sources in real-time: game audio and microphone audio. The audio is provided at 44 100 Hz in the form of vectors of 1024 numbers between $$-1$$ and $$1$$ (PCM data). The simplest way to mix the audio is by adding together the game audio vector $$vec{g}$$ and the microphone audio vector $$vec{m}$$ to form the vector of mixed audio $$vec{g} + vec{m}$$. The problem with this approach is that it often results in many entires of the resulting vector being greater than $$1$$ or less than $$-1$$. This is problematic as it causes clipping – a form of audio distortion that is brought about by instructing a speaker to move beyond its physical capability. Normally, clipping is avoided by reducing the gain (loudness) of both audio sources (this corresponds to multiplying the audio vectors by numbers less than 1) prior to adding them together, and then letting the resulting audio pass through a “compressor” that increases its loudness to the desired volume.

I want to mix the game audio and the microphone audio without modifying the game audio. In other words, I need to add together $$vec{g}$$ and $$vec{m}$$ in a way that avoids clipping, keeps both game audio and microphone audio clearly audible, and does not modify the game audio. Mathematically, this could be modeled as finding a value for $$alpha in mathbb{R}$$ that “optimizes” the linear combination inequality

$$begin{bmatrix} -1 \ vdots \ -1 end{bmatrix} leq begin{bmatrix} g_1 \ vdots \ g_n end{bmatrix} + alpha begin{bmatrix} m_1 \ vdots \ m_n end{bmatrix} leq begin{bmatrix} 1 \ vdots \ 1 end{bmatrix}$$

in the sense that it minimizes the number of entries with an absolute value greater than $$1$$ in the linear combination $$vec{g} + alphavec{m}$$. Furthermore, $$alpha$$ must be such that the microphone audio is clearly audible. I find it hard to convert this second condition into a mathematical expression, and this post largely revolves around finding a way to quantify “clearly audible” microphone audio.

Partial Solution

To solve the above problem, I analyzed the game audio and microphone audio data in Mathematica. My analysis yielded the following observations.

The game audio is very consistent. Qualitatively, this takes the form of you not hearing any significant volume fluctuations as you move between different parts of the game. Quantitatively, this is seen in that the standard deviation $$sigma approx 0.167$$ for all game audio, and hence, assuming the mean $$mu approx 0$$ (which is an accurate assumption), the game audio samples within three standard deviations (99.7% of the audio) have values between $$-0.5$$ and $$0.5$$. This means that if all entires of the vector $$alphavec{m}$$ had an absolute value of $$0.5$$, only 0.03% of the samples in mixed audio would give rise to clipping. With this realization I came up with the following model to minimize clipping.

I would let the amplification scalar $$alpha$$ be a function of the microphone audio’s standard deviation $$sigma_m$$ and a tolerance $$tau$$ according to the formula
$$alpha(sigma_m, tau) := frac{tau}{3sigma_m},$$
where the tolerance is defined as $$tau := 1 – 3sigma_g$$ and $$sigma_g$$ is the standard deviation of the game audio.

The $$1$$ in $$1 – 3sigma_m$$ arises from the fact that we do not want the absolute values of the resulting audio samples to exceed 1. You can verify that the given expression for $$tau$$ has the desired effect of preventing clipping by recalling that $$alpha$$ will scale $$vec{m}$$ and then realizing that the above formula for $$alpha$$ guarantees that 99.7% of the audio samples (entires) in $$vec{m}$$ will not exceed the tolerance $$tau$$. By extension, 99.7% of the sums $$|g_i + alpha m_i|, i in {1, 2, … n}$$, will not exceed the maximum tolerated sum 1. (Let me know if a more detailed explanation is needed to follow.)

Computing $$alpha$$ in this way resulted in the game and microphone audio being combined with virtually no clipping. My measurements indicate an average of 30 milliseconds of clipping per second; which, in practice, is completely unnoticeable and thus negligible. Nevertheless, another challenge awaited me.

While I had eliminated clipping, the resulting audio now suffered from tremendous fluctuations in volume. I tried to solve this by implementing a moving average of $$alpha$$ that uses $$130$$ microphone vectors of $$1024$$ entires each (which corresponds to roughly the last 3 seconds of audio being used to determine each amplification scalar). This almost eliminated all volume fluctuation, but the resulting volume was always way too loud. By analyzing the amplification scalars that were fed into the moving average I noticed that my programming language, C#, said that the value of some scalars was INFINITY. When working with 32-bit floating point numbers in C#, INFINITY is represented by the number $$3.14 times 10^{38}$$. This happened because the standard deviation $$sigma_m$$ of some microphone vectors was very close to zero. I solved this problem by introducing an upper limit that the amplification scalar could assume. In mathematical terms, I redefined $$alpha(sigma_, tau)$$ to be

$$alpha(sigma_m, tau) := text{min} Big( frac{tau}{3sigma_m}, c Big),$$
where $$c$$ is a manually chosen constant that represents the maximum allowed value of $$alpha$$.

Now, while there is no clipping and the volume is stable, the volume ends up being either too feint while whispering into the microphone, or, too loud while screaming into it. Which of the two cases manifest depends on the choice of $$c$$. For example, choosing $$c = 5$$ yields good results for audio clips with people shouting; while $$c = 10$$ works well someone whispers into the microphone.

I have tried to find a function $$f$$ that can determine a suitable value for $$c$$ when given some data about the microphone audio in real-time. Specifically, I tried using the average of the absolute values of all audio samples, the average of the absolute differences between all consecutive samples, and the root mean square of all samples as inputs to $$f$$. I did this by computing the above properties and manually determining a good value for $$c$$ for a set of seven audio clips. Then, I plotted them as points and tried to regress an equation that could find a good value for $$c$$ given some of the above data about the audio.

For example, here is a plot of the amplification scalar (on the y-axis) and the root mean square of the audio clip (on the x-axis):

The regressed equation is far from sufficiently good at determining $$c$$.

Question

How can I find a function that will return a suitable values for $$c$$ for real-time microphone audio that may contain whispering, conversational speaking and screaming? Alternatively, what is a different approach that I could use to solve the described problem.

Let me know if anything requires further explanation and I will happily provide it.