As handheld voice communication devices become more and more popular, they are more likely to be used in noisy environments, such as airports, busy roads, and noisy bars. In this noisy environment, it is difficult for both parties to hear what the other party is saying.
In addition, many communication systems use computer-operated speech recognition, command and/or response systems. These systems are susceptible to background noise. If the noise is too large, it will cause a large deviation in the system. Therefore, it is necessary to improve the ratio of the speech signal to the background sound noise.
This article will explain the basic principles of using microphone arrays to eliminate background noise in voice communication systems, and quote National Semiconductor’s LMV1088 microphone array amplifier as an example.
Microphone arrayMicrophone array refers to arranging multiple microphones one after another into a special pattern and letting them work together to produce a composite output signal or multiple sets of signals.
Each microphone is a sensor or a spatial window for receiving (spatial sampling) input signals. The overall response of the array is the superposition of the individual responses of each microphone in the array and is related to the algorithm used.
The "array processing" algorithm used for the signals of multiple sets of microphones in the array is determined based on several factors, including the separation distance and arrangement style of the microphones, the number and type of microphones, and the principle of sound propagation.
The basic task of the microphone array is to eliminate the surrounding noise of the voice input signal, thereby improving the voice quality of auditory auxiliary systems, voice recognition equipment and telecommunications products. In addition, the microphone array can also be used to locate the direction and calculate the distance between the sound source and the array.
The main function of the microphone array in the voice communication system is to provide a high-quality voice signal while reducing the noise in the scene and the surrounding environment. The quality referred to here means that the final voice signal is very natural and real, without any artificial noise such as clicks and pops, unintentional muting, frequency distortion, echo, or caused by enhanced voice signal processing methods. Unscheduled signal level changes.
For the above reasons, signal/noise ratio improvement (SNRI) is not the only parameter when choosing a background noise suppression solution, but other issues must be considered.
Voice messageSound pressure level
The sound pressure level (SPL) will decrease as the distance from the sound source increases. Figure 1 and Figure 2 respectively show the reduction of SPL, which is measured in decibels (dB) and is a function of the distance "x" from the sound source. When people speak, generally take a position about 1 cm away from the lips as a reference point, and set the SPL at this position as 96 dB. Under these conditions, the formula of SPL should be:
dB=96-20 log(x/0.01)
Or can be written as
dB=96+20 log (0.01/x)
The (or) in the formula is the reference distance of 0.01m, that is, the distance "x" relative to the sound source in meters is 1 cm.
figure 1
figure 2
When the distance "x" is doubled, the SPL of both curves drops by 6dB. Figure 1 is 200 cm away from the sound source, while Figure 2 is a partial enlarged view of 50 cm away from the sound source. It can be seen from the figure that the sound pressure will drop rapidly due to the increase in the distance from the sound source, even if the distance is very short. . For example, when the distance from the sound source is 10 cm, the SPL is reduced by 20 dB, from 96 dB to about 76 dB.
Near-field versus far-field sound
The near field of the sound source means that the location is within a wavelength range of the relevant lowest frequency signal. Assuming that the lowest frequency of the relevant speech is 300 Hz, the wavelength λ is equal to c/f or 331.1/300, or 1.104 meters, where c represents the horizontal velocity of the sound wave at zero degrees Celsius. When the frequency is 3500 Hz, λ is equal to c/f or 331.1/3500, or 0.0946 meters (9.46 cm). Therefore, the typical near-field range of the voice signal is about 9.5 cm to 1.1 m from the sound source.
Over a distance of 1 meter, the voice signal will be considered as the far field of the voice source. For arrays with closely spaced microphones, the near-field sound source will show a spherical wavefront, with strong signal amplitude, pressure gradient, and the corresponding distance between the microphones in the array and the sound source. Frequency-related differences.
Now suppose that the distance between the two microphones is 3 cm, and the distance between the microphone closest to the sound source and the sound source is 5 cm. Figure 2 shows that the first microphone (that is the one closest to the sound source) feels an audio signal with an SPL of 82dB, while the second microphone (that is 8 cm away from the sound source) feels a signal of 78dB SPL . Even though there is only a 4 dB difference between the two, the difference is still quite large relative to the overall signal level.
From the perspective of spectral content, all near-field speech signals in the microphone array are closely related. Compared with the microphone closest to the sound source, the signal amplitude of the microphone signal with the farthest distance from the sound source will be reduced, and there will be a time delay for the signal from the closest microphone to the farthest microphone. However, it is not difficult to recover the voice signal in this case.
The sound source outside the near-field range of the microphone array will be regarded as the far-field sound source, and a substantially planar wavefront will be displayed to the microphones arranged closely in the array. Each microphone in the array feels almost the same sound wave energy and random phase signal, but there is no correspondence between these signals, unless the distance between the microphones is very close. If these signals are far away from the microphone, the absolute SPL value of the microphone will drop further.
Here is another example. If the same microphone array is placed at a distance of 150 cm (1.5 meters) from the sound source, the SPL value of the microphone of the nearest sound source will drop to 52.5 dB, which is 153 cm away from the sound source. The SPL value of the farthest microphone drops slightly to 52.3dB. Although the difference between the two is only 0.2dB, the overall signal level from the sound source to the nearest microphone will drop by 30dB.
The different signals between the microphone outputs, after proper processing and filtering, can eliminate the far-field noise, so that the composite output and processing circuit of the two microphones can provide high-definition voice signals.
The characteristics of sound noise
The noise field here can be divided into three types: coherent noise, incoherent noise and diffuse noise.
Coherent noise means that when sound waves reach the microphone, there is no reflection, scattering or attenuation in any form due to obstacles in the environment.
Incoherent noise means that the noise in a certain position has no relationship with the noise in other positions, and it is regarded as spatial white noise.
Diffusion noise means that noise with the same energy is projected in all directions at the same time. Examples include office noise, airport terminal and traffic noise, etc. In other words, it refers to all noisy environments.
There are two types of sound noise referred to here, namely steady-state noise and non-steady-state noise.
Steady-state noise means that the energy of the noise is relatively stable, and has a known and slow-changing spectral content, and is predictable. Examples include noise from engines, air-conditioning fans, random or "white" noise, etc. Noise suppression algorithms can effectively suppress this type of noise.
Unsteady state noise refers to changes in volume and sound content within a short period of time, such as loud speaking or yelling, passing cars or clapping hands, etc., and its occurrence is unpredictable. If such noises appear, they may disappear automatically before being identified and suppressed. Unsteady-state noise is generally included in steady-state noise.
The most troublesome situation is when the noise source and the speech signal have the same appearance time, frequency spectrum and coherence characteristics. This situation will appear when the background noise is in an unstable state and there are other people talking next to it, such as in restaurants and bars. The stations and parties are first class.
the second part
Microphone array solution
According to the selected method, the microphone array solution can be a very efficient technology to suppress steady-state and non-steady-state noise.
With appropriate algorithms, the individual microphone signals in the array are filtered and then combined to achieve the effect of beamforming or spatial filtering, thereby generating a complex polar response pattern of the microphone array, which can be directed to or away from a certain sound position. Therefore, the sound of a certain location can be isolated or strengthened, or it can be suppressed or rejected. Similarly, the signal correlation in the microphone channel can find the direction of the main signal and its correct position.
Depending on the complexity and application of the array, the array can be controlled by an analog circuit equipped with a digital signal processor, plus appropriate computer software and a series of methods.
Beamforming
Beamforming is divided into two technologies: adaptive and directional.
In the adaptive beamforming technology, the direction of the beam can be adjusted by data-related filtering and changing the time response to the data. Several methods have been developed for adaptive beamforming. Although the signal processing is more complicated, the advantage is that the design flexibility is higher, including the number, type and separation distance of the microphones. Adaptive beamforming generally requires a digital signal processor or computer software to implement.
As for directional beamforming, the walking direction of the beam will be optimized according to the azimuth of the relevant sound source, and noise from other directions will be excluded at the same time. Generally speaking, differential microphone end-fire arrays that are closely arranged and have inherent directivity rely on fixed time delays or other methods to change the direction of the beam. For such applications, any filtering and signal processing methods must be optimized for special mechanical designs. Directional beamforming generally requires analog circuits, digital signal processors or computer software to achieve.
For voice applications, it is better to use a directional beamforming solution, especially when the application involves voice recognition. If implemented in analog circuits, they should:
â— Real-time response to noise input
â— Easy to implement and no need to develop any algorithm program
â— Provide an acceptable signal/noise ratio improvement (SNRI) value to suppress steady-state and non-steady-state noise
â— It exhibits extremely low distortion when there is no voice, and can improve the overall mean opinion score of the voice quality test (ITU-T P.835)
â— Low computational complexity and low signal delay
â— Power consumption is smaller than other solutions
Compared with the directional scheme, the disadvantages of adaptive beamforming implemented by digital signal processor or software are:
â— When implementing and adjusting the suppression algorithm, it takes time to repeatedly identify and converge noise
â— Although it can provide a better SNRI value, it usually brings more problems to the voice output signal, including delays caused by noise convergence time, clicks and pops, unintentional muting, frequency distortion, Echo or irregular signal level changes related to sub-band frequency signal processing methods
â— Because of the need to develop a calculation program separately, it is difficult to implement
â— Need more power consumption
All beamforming solutions use very small arrays, and they are very sensitive to errors, including errors caused by microphone gain and phase mismatch, and because the audio signal path is embedded in the product rather than in the atmosphere. The path deviation. Therefore, the beam solution must have some form of compensation, and this compensation can be set within the beamforming system, or a suitable microphone and audio signal path can be added outside the system.
Microphone interval
The Nyquist spatial sampling rate is one-half the wavelength of the highest frequency of interest (d=λ/2). In order to obtain a wavelength sample of the relevant frequency from space, the two sensors (ie, microphones) must be separated by one-half of the wavelength.
However, when the distance between the sensors is less than one-half of the wavelength (d 1/2λ), spatial under-sampling will occur. At this time, after the first sensor completes the sampling of one wavelength, the second sensor Restart before sampling. Spatial under-sampling can alias higher frequency signals into relevant frequency bands, causing confusion in the results. In order to prevent aliasing, the bandwidth of the sampler must be limited above the highest relevant frequency.
Many studies have pointed out that if the distance between the sensors can be reduced as much as possible, an efficient microphone array can be created. The distance can be much smaller than the minimum requirement of the Nyquist rate. Let me give another example, where the interval between the sensors is one-eighth of the wavelength of the relevant sound wave.
In a pure speech system, the frequency range is 300Hz to 3500Hz, and the maximum sound energy can appear between 500Hz and 2500Hz. Under this condition, the interval of λ/8 is 1.18 cm at 3500 Hz and 1.65 cm at 2500 Hz.
Due to the increase in wavelength, audio signals below 3500Hz and 2500Hz will still be oversampled, so an interval of 1.18 cm or 1.65 cm can effectively obtain more signal samples.
Another calculation method sets the interval as two centimeters, so when the frequency is 2500Hz, the wavelength interval (λ)/(c/df) is: λ/(331.1/0.02*2500)=λ/6.62
If the spatial sampling rate is still lower than λ/2 at the highest relevant frequency, the microphone interval needs to be adjusted to meet the application requirements of the product. However, as the interval becomes more and more crowded (the spatial sampling rate becomes higher and higher), the coherence between the far-field signals in the microphone array becomes larger, so that the array can exert a better overall background noise suppression performance at each frequency. . Conversely, if the interval becomes wider, the overall suppression capability of the array will decrease, making it difficult to respond to lower frequency signals.
Once the sensor interval is determined, the array can be optimized according to frequency requirements. If a directional beamforming scheme is adopted, the response mode of the array also needs to be fixed at the same time.
Regardless of any product, some compromise decisions must be made in the design process, including the operating frequency range and the required noise suppression level, the theoretical and actual microphone spacing, and the overall array system cost and complexity. Wait.
Examples of microphone array solutions
The following uses National Semiconductor's far-field suppression microphone array amplifier LMV1088 as an example of a microphone array solution, which can provide up to 20 dB of background noise suppression for voice applications. LMV1088 is an analog directional beamforming solution, suitable for differential dual-microphone end-fire arrays using omnidirectional microphones.
The two microphones in the picture are located on two lines about 1.5 cm to 2.5 cm apart, or keep the same sound wave path distance. It is best to keep the distance between the speaker and the microphone of the mobile phone or headset between 2 cm and 10 cm. The loss of the voice signal with distance can be calculated by using Figure 1 and Figure 2.
LMV1088 can not only provide initial compensation for the difference between the sound on the two channels, the signal path of the microphone and the amplifier, but also can perform correction filtering to make the voice output more natural, and it can also provide bandwidth limiting filtering.
Since the gain of the internal amplifier can be adjusted by I2C commands, microphones with different sensitivities can be used, and the output signal level of LMV1088 can be matched to the requirements of analog input channel signals to target a variety of communication processors and devices.
LMV1088 can support four operating modes, which can be selected through I2C commands:
â— Preset mode-use two microphones at the same time for noise suppression
â— Independent mode-use microphone 1 or 2 independently (no noise suppression)
â— Total mode-the output of the two microphones are added together, so that the microphone signal gets 6dB gain (no noise suppression)
The analog characteristics of LMV1088 can provide some characteristics that traditional DSP solutions do not have:
â— There is no need to spend extra time to perform noise convergence calculations due to accommodating background noise levels and types, which can provide real-time responses to speech signals and background noises, and can eliminate annoying short-term speech disappearance;
â— Since the sub-band frequency processing algorithm is not used, it will not produce frequency distortion, clicks and pops or other artificial false information in the output;
â— It can strengthen the mono echo cancellation processing in the current system
Comparison and testing of different microphone array solutions
In order to accurately compare and measure the effects of different background noise suppression solutions, all test settings and conditions must be consistent in order to obtain credible results.
Based on the above reasons, several standard tests were specially arranged, most of which used the International Telecommunications Joint Standards ITU-T Rec. P0056e, 58e, 64e, 0830e and ITU-T P835.
ITU-T P835 is specifically used for subjective testing, which can effectively evaluate the voice output quality in the system, including the effectiveness of noise suppression. This specification clearly states the method for evaluating the subjective quality of speech in a noisy environment, and is particularly suitable for evaluating noise suppression algorithms. This method uses independent grade standards to divide the test into three independent parts, which are independent of the subjective quality of individual speech signals, the subjective quality of individual background noise, and the overall speech quality under background noise (average opinion score). evaluation of.
Figure 3 Noise, far field, speech, optimized speech
As for the IEEE standard, two standards of IEEE 1209-1994 and IEEE 269_1992 can be used for testing. The former is to measure the transmission effects of telephone handsets and headsets, while the latter is for the transmission effects of analog and digital phones. Both standard documents have been replaced by IEEE 269-2002 documents.
The above-mentioned standards can be integrated together to achieve objective numerical measurement, and can accurately evaluate the subjective voice quality and electronic voice recognition effect of different background noise suppression solutions.
Generally speaking, the noise suppression data of the system are provided by the manufacturer. They may be the best level that the system can achieve, but for some applications that require high voice quality, these preset levels may not meet the application. demand.
Therefore, it is very difficult to indicate the noise suppression value on the solution data sheet, and sometimes even misleading, unless all test conditions can be clearly stated. In this regard, the general data sheet will not provide very detailed data, even if it is provided, it is impractical, because it is difficult to imagine that the conditions applied by the customer are completely consistent with the test conditions on the data sheet.
Aluminum Alloy Creative Notebook Stand
Shenzhen ChengRong Technology Co.,Ltd. , https://www.laptopstandsuppliers.com