In order to raise the highest sound quality video to YouTube, you need to know the YouTube loudness normalization specification.
However, YouTube's loudness normalization specification is not published. Some people have already been investigated, but specific calculation formulas are not known.
I tried to estimate the formula for loudness normalization on YouTube.
YouTube loudness normalization specification
The following is a summary of the survey results.
The loudness normalization is performed in a manner that the loudness of the sound source is adjusted to the loudness target value as much as possible within a range where the peak does not clip.
The loudness of the sound source is calculated with its own specifications, but by replacing the weighting curve of Short-term loudness of EBU TECH 3341 with the following and taking the maximum value of Short-term loudness, it is possible to obtain the accuracy of 1 dB Can be approximated.
We will investigate the framework of YouTube's loudness normalization in detail and the details of loudness calculation.
A large frame of YouTube's loudness normalization
I think that probably it looks like the following when referring to here .
The loudness normalization on YouTube is done in a way that the loudness of the sound source is adjusted to the loudness target value as much as possible within the range where the peak does not clip. When written with an expression, it becomes the following.
Compensation (dB) = Min (- Peak, Target - Loudness)
Peak is the peak of the sound source, Loudness is the loudness of the sound source, Target is a constant, the loudness target value, and Compensation is the correction gain. The overall volume changes uniformly by the amount of Compensation.
Right click on a YouTube video and the content loudness seen from the detailed statistical information is equivalent to Loudness - Target.
Loudness calculation formula on YouTube
YouTube's loudness calculation formula seems to be using its own one. So, I need to guess.
Consider the following model with reference to ITU-R BS.1770-3.
Equalizer -> Cut by window -> Convert to LUFS -> Gating -> Aggregation
Weight each frequency by an equalizer.
In previous experiments, K-weighting adopted in ITU-R BS.1770-3 and other popular weighting did not apply, so estimate direct frequency characteristics.
Cut by window
Cut out the waveform with the Rect window.
Window length and overlap ratio are parameters.
For reference, the momentary and integrated parameters of ITU-R BS.1770-3 and EBU TECH 3341 have a window length of 400 ms and an overlap length of 100 ms (the overlap ratio is 75%). The short-term loudness parameter of EBU TECH 3341 has a window length of 3 seconds and an overlap length of 2.9 seconds or more (overlap ratio is 96.7% or more).
Convert to LUFS
Calculate the RMS of the extracted waveform and convert it to LUFS with Log 10 (RMS).
It also corrects to be 0 with stereo 1000 Hz sine wave. The correction amount for ITU-R BS.1770-3 is -0.691 dB.
In order to eliminate the influence of silence time on loudness, we discard small sounds among multiple RMS values obtained by cutting out.
Refer to ITU-R BS.1770-3 and EBU TECH 3342 and perform Absolute threshold gating and Relative threshold gating.
The parameters are the respective Threshold values. I also try patterns that do not do gating.
For reference, the parameters of ITU-R BS.1770-3 and EBU TECH 3341 are Absolute Threshold -70 LKFS and Relative Threshold -10 dB. Parameters for calculating the Loudness Range of EBU TECH 3342 are Absolute Threshold -70 LKFS and Relative Threshold -20 dB.
Take the average or maximum of multiple RMS values remaining in Gating.
ITU-R BS.1770-3 takes an average, but it seems there is a possibility of using the maximum value of Short-term according to this .
Test video used for parameter estimation
Prepare a test movie to estimate the parameters of the loudness calculation model.
According to here , it seems that there is a possibility that loudness normalization will not be applied if there is not a certain number of playback numbers, or it will not be applied unless some time has elapsed since posting. Without preparing test videos on their own, there are enough playback numbers, select some of the existing videos that have been posted enough times, and make them test videos.
A list of test videos is described in the Appendix.
Equalizer parameter estimation
By using a sinusoidal test movie with a constant volume, you can eliminate effects other than equalization on loudness. Using this we first estimate the frequency response of the equalizer.
For the sine wave sound source of various frequencies, measure the content loudness on YouTube and estimate the frequency characteristics by taking the difference from the RMS of the sound source. The estimation result is below. For detailed data please see Appendix.
The result was unstable, for example, the results were different depending on the animation even at the same frequency above 16 kHz, so in the following discussion, we will only use data below 15 kHz. Extrapolate with linear interpolation for 44 Hz or less and 15 kHz or more.
Parameter estimation other than equalizer
Next, fix the frequency characteristics of the equalizer and estimate parameters other than the equalizer.
Calculate the loudness of various videos with various parameters. Compare with the loudness (Content Loudess) calculated by YouTube and look for the parameter with the least error. The test video list is described in the Appendix.
|Window length||400 ms, 3 sec|
|Overlap ratio||75%, 96.7%|
|Absolute threshold||None, -70 LKFS|
|Relative threshold||None, -10 dB, -20 dB|
|Parameters||Estimated Target (LUFS)||Error Stddev (dB)||Error Max (dB)|
|abs threshold none, rel threshold none, window 0.4 sec, overlap 75%, mean||-16.15449408||5.51255362||10.73290254|
|abs threshold none, rel threshold none, window 3 sec, overlap 96.7%, mean||-14.97681484||4.908278646||11.91484089|
|abs threshold none, rel threshold - 10 dB, window 0.4 sec, overlap 75%, mean||-13.94987923||3.954370989||7.389401665|
|abs threshold none, rel threshold - 10 dB, window 3 sec, overlap 96.7%, mean||-13.68684721||3.684007274||7.647167492|
|abs threshold none, rel threshold - 20 dB, window 0.4 sec, overlap 75%, mean||-14.49831437||4.531255406||9.145055115|
|abs threshold none, rel threshold - 20 dB, window 3 sec, overlap 96.7%, mean||-14.01660691||4.048723057||9.667181199|
|abs threshold - 70 LUFS, rel threshold none, window 0.4 sec, overlap 75%, mean||-16.15449408||5.51255362||10.73290254|
|abs threshold - 70 LUFS, rel threshold none, window 3 sec, overlap 96.7%, mean||-14.97681484||4.908278646||11.91484089|
|abs threshold - 70 LUFS, rel threshold - 10 dB, window 0.4 sec, overlap 75%, mean||-13.89217514||3.911543318||7.447105751|
|abs threshold - 70 LUFS, rel threshold - 10 dB, window 3 sec, overlap 96.7%, mean||-13.66565863||3.666025972||7.668356069|
|abs threshold - 70 LUFS, rel threshold - 20 dB, window 0.4 sec, overlap 75%, mean||-14.47170654||4.52391958||9.171662946|
|abs threshold - 70 LUFS, rel threshold - 20 dB, window 3 sec, overlap 96.7%, mean||-14.00512426||4.038389533||9.678663846|
|abs threshold none, rel threshold none, window 0.4 sec, overlap 75%, max||-8.993721502||1.106961021||2.968119771|
|abs threshold none, rel threshold none, window 3 sec, overlap 96.7%, max||-10.31246414||0.90143559||1.746039964|
Parameter combination with the least error was window size 3 seconds, overlap rate 96.7%, Max aggregation, standard error of error was 0.9 dB, maximum error was 1.7 dB. It is the maximum value of Short-term loudness of EBU TECH 3341. The loudness target value is -10.3 LUFS.
With this, you can estimate the loudness calculation method of YouTube.
2018/12/09 Fixed a calculation error (latest version)
I looked up the formula for loudness normalization on YouTube. I found an expression that can be approximated with an accuracy of about 1 dB.