Video conference audio mixing algorithm and its implementation

With the rapid development of internet technology, the amount of data flowing across the web has significantly increased. This has led to the emergence of video conferencing systems, which enable real-time communication and interaction among users. These systems rely heavily on voice transmission, as it is a critical component in determining the overall performance of the system. Therefore, conducting research on audio mixing algorithms within video conferencing systems is of great importance. One of the main challenges faced by terminals when handling audio signals is how to mix and play multiple audio streams locally while maintaining synchronization. Factors such as delays and synchronization with video can complicate this process. In practical applications, one of the most common issues is buffer overflow in the sound card after the audio is mixed. To address these challenges, an improved mixing algorithm has been introduced. Compared to existing methods, this new approach offers superior mixing quality, lower stagnation rates, reduced delays, and better scalability. Experimental results demonstrate that the algorithm effectively enhances speech clarity, suppresses overflow, and provides high-quality audio output with minimal delay, making it a promising solution for real-world applications. **1. Analysis of Mixing Algorithm** Sound is a pressure wave created by the vibration of objects. The three main characteristics of sound are loudness, pitch, and timbre. In the natural world, what we hear is the result of sounds from various directions being combined. In video conferencing systems, the task involves mixing audio data from different sources in the time domain. Voice signals are typically sampled and quantized on the sound card chip, which usually operates at 16 bits. In many operating systems, including Linux, the data type used for the sound card buffer is often a 16-bit signed integer, ranging from -32768 to 32767. After multi-channel mixing, the amplitude may exceed the acceptable range of the sound card, leading to distortion. Several common solutions exist: - **Direct Clamping Method**: After mixing, if the amplitude exceeds the buffer’s capacity, it is clamped to the maximum value. However, this can cause artificial peaks in the waveform and introduce noise. - **Normalization Mixing**: The average of the mixed signals is calculated and divided by the number of channels. While this helps prevent overflow, it can reduce the volume of individual voices, especially when multiple people are speaking simultaneously. - **Alignment Mixing**: This method adjusts the mixing weights based on the intensity of the incoming signals. Strong alignment gives more weight to louder signals, potentially overpowering quieter ones, while weak alignment amplifies weaker signals, possibly increasing background noise. Although these methods are simple, they all have limitations in terms of audio quality and mixing efficiency. To overcome these issues, a new and improved mixing algorithm has been developed. **2. Improved Mixing Algorithm** In SIP-based video conferencing systems, there are different approaches to media mixing, such as centralized and terminal-based mixing. A distributed mixing model has been designed, where the server does not process media streams directly but instead manages the conference system. The terminals receive and decode the audio data before performing the mixing. This reduces the load on the server and minimizes delay, making it ideal for real-time applications. This algorithm is particularly suitable for small to medium-sized video conferencing systems, such as those used in schools or small businesses, where the number of participants is limited (typically under five). In such scenarios, it's less likely that multiple users will speak simultaneously, reducing the risk of overflow. The algorithm processes audio frames, considering both overflow and smoothing. The steps include: 1. Initialize the attenuation factor to 1. 2. Analyze the audio frame, including the maximum peak value and zero-crossing rate. 3. If the peak exceeds the threshold, adjust the attenuation factor accordingly. 4. If no overflow occurs, dynamically adjust the attenuation based on the zero-crossing rate. 5. Repeat the process for each subsequent frame. By using frame-based processing, the algorithm reduces computational overhead and improves performance. It also uses normalization and sets upper and lower bounds for the attenuation factor, ensuring smooth audio output without distortion or overflow. **3. Embedded Implementation and Result Analysis** To evaluate the performance of the improved algorithm, it was tested in an embedded environment using the TI DM6446-594 processor. The algorithm ran on the ARM side, with the ARM9 core operating at 297 MHz. Three types of audio were used: background noise, a human voice close to the overflow level, and a moderate-level human voice. All samples were male voices, which are harder to distinguish. The results showed that the improved algorithm outperformed previous methods. It produced smoother audio, reduced noise, and avoided popping sounds or overflow. Compared to fixed-step attenuation methods, the new algorithm minimized resource usage and improved calculation efficiency. **4. Conclusion** By analyzing the characteristics of voice signals and using frame-based attenuation, the improved algorithm effectively solves the problem of audio overflow. It also introduces short-term energy and zero-crossing rate analysis to enhance mixing quality. Users can choose between different algorithms depending on their network conditions. The implementation on the ARM9 processor demonstrates that the algorithm performs well in real-world applications, offering better performance and user experience.

Mini Inverter

China Mini Inverter,Mini Inverter 200W,Mini Car Power Inverter

GuangZhou HanFong New Energy Technology Co. , Ltd. , https://www.gzinverter.com

This entry was posted in on