Speaker Verification Report on CCP Media Reports on "Tiananmen Self-Immolation"
1. Introduction
On March 12, 2003, the National Taiwan University Speech Processing Laboratory was asked to conduct tests on three episodes of a program entitled “Jiao Dian Fang Tan” broadcast by China Central TV (CCTV). The purpose of this test is to verify whether the two people who appeared repetitively in these three episodes, Liu Baorong and Wang Jindong, are actually the same people each time they appear.
The three videos used in this test are from the CCTV program "Jiao Dian Fang Tan." The videos consist of interviews of Liu and Wang regarding the Tiananmen self-immolation incident that occurred on January 23, 2001. Liu Baorong appears in Videos 1 and 2, and Wang Jindong appears in all three videos.
The recording environment for the interviews varies in the videos. When Liu Baorong is interviewed in Video 1, the recording environment is indoors and quiet. In Video 2, she is interviewed in her bedroom. The environment of Wang Jindong’s interview in Video 1 is a hospital room. The first part of Video 2 is a hallway with echoes and the second part is a large, quiet room. Different recording conditions present challenges for the test results of speaker verification. Later in this section, we will discuss the method adopted in this report to resolve this issue.
For many years, the Taiwan University Speech Processing Laboratory has been dedicated to enhancing technology in Chinese language recognition and verification, and has accumulated many achievements. This test is conducted on the basis of speaker verification technology researched and developed by Weiren Chung for his Master’s Thesis of June 2001.
Speaker verification is a technology that verifies a speaker’s identity based on the speaker’s voice. Similar research done around the world can be traced back many years. Popular applications include financial transactions and crime investigation and prevention, among others.
According to Reference [1], popular models for speaker verification include the Gaussian Mixture Model (GMM), the Hidden Markov Model (HMM), and Eigenvoice. The Gaussian Mixture Model is a simplification of the Hidden Markov Model. Its principle is to separate one speaker’s Training Corpus into groups according to the characteristics of the sound. A Gaussian distribution is used to describe each and every group of audio characteristics.
The Hidden Markov Model performs better in speaker verification than the Gaussian Mixture Model. But because its system is more complicated and requires more of a Training Corpus, it is not suitable for this test. The Eigenvoice Model was not adopted because its performance is not as good as the Gaussian Mixture Model.
As stated in the beginning of this section, different recording conditions present difficulties in speaker verification. Different recording conditions can produce verification results indicating that two speakers are different when, if fact, the two recordings came from the same person. This is due to differences in the environment (different microphones, noises and echoes, etc.). This situation is called False Rejection.
False Rejection occurs when the speaker is the same as the declared identity, but is rejected by the system. Subsequently, False Acceptance occurs when the speaker is different from the declared identity, but is accepted as being the same by the system. Usually, False Rejection and False Acceptance cannot be improved at the same time. There is a trade-off between the two. When one of them is lowered (by decreasing or increasing the threshold), the other one will increase.
To reach the requirement of high credibility, our test is designed to have minimal possibility of False Rejection and maximum possibility of False Acceptance. This way, because the possibility of False Rejection is very low, if the system still decides to reject, the possibility of accurate rejection is greatly improved.
This test has adopted a threshold as the criteria for acceptance or rejection. The system will accept if the score is higher than the threshold or reject if lower than the threshold. Thus by selecting a reasonable, but lower threshold, we can achieve the purpose of lower False Rejection and higher False Acceptance.
In observing the three videos, we can see that the female reporter conducting interviews appears multiple times in the programs. The recording conditions include many different settings (outdoors, hospital, bedroom, jail and hallway, etc.). If a threshold can be properly set to let these voice segments with different recording conditions be verified as the same person, that is, the threshold is set low enough for all voice segments to be accepted by the system, then maximum credibility can be achieved. (Note: The female reporters in all three videos are not necessarily the same person. This is not a factor because the system must accept under the worst conditions.)
2. Theoretical Background
2.1 Speaker Verification Device
The speaker verification device used in this report is a Log-Likelihood Ratio Detector. See diagram below:
When the test voice goes through the front-end processor, Feature Vectors are extracted. Then calculations of Log-Likelihood are made on Feature Vectors separately for the Speaker Dependent model and Background Speaker model. The final score is obtained by subtracting the two numbers from each other. The purpose is to lower the Inner-Speaker Variation and retain the Inter-Speaker Variation in the final score.
2.2 Background Speaker Model
The Background Speaker model is used to help the normalized movements of scores. It can lower Inner- Speaker Variation and retain Inter-Speaker Variation in the scores [1].
In larger-scale applications of speaker verification systems, in order to simplify the complexity of system design, a Speaker Independent model is usually used as every speaker’s Background Speaker model [1].
A Speaker Independent model can be obtained through the Training Corpus of all speakers.
2.3 Speaker Dependent Model
The purpose of the Speaker Dependent model is to simulate every speaker’s acoustic features. The model for each speaker should represent that speaker’s voice and acoustic features. The Speaker Dependent model is derived from the Speaker Independent model adjusted according to Bayesian Adaptation. The adapted corpus becomes that speaker’s corpus.
3. Test Methods and Results
3.1 Recordings
Three videos (zf1.rm, zf2.rm, zf3.rm) were played via RealPlayer. The sound card was activated at the same time and directly recorded the sound signals. (That is, played and recorded at the same time inside the sound card. No external wires were used). The sampling coefficients are
Sampling Rate |
8 kHz |
Sample Size |
16-bit |
Channels |
2 |
3.2 Audio Cuttings
The required segments were cut from the recorded audio as described previously:
Title |
Speaker |
Source |
Length (m:s) |
Time Distribution |
Zf1_liubaorong |
Liu Baorong |
Zf1.rm |
2:36 |
1:34-1:43* |
Zf2_liubaorong |
Liu Baorong |
Zf2.rm |
0:32 |
6:40-7:30* |
Zf1_wangjindong |
Wang Jindong |
Zf1.rm |
0:06 |
4:30-4:34 |
Zf2_wangjindong |
Wang Jindong |
Zf2.rm |
0:30 |
9:06-9:24* |
Zf2_wangjindong2 |
Wang Jindong |
Zf2.rm |
4:08 |
10:28-10:40* |
Zf3_wangjindong |
Wang Jindong |
Zf3.rm |
0:55 |
9:07-9:22 |
Zf1_reporter |
Reporter |
Zf1.rm |
0:05 |
9:11-9:18* |
Zf1_reporter2 |
Reporter |
Zf1.rm |
0:09 |
12:36-12:44* |
Zf1_reporter3 |
Reporter |
Zf1.rm |
0:07 |
13:07-13:18* |
Zf1_reporter4 |
Reporter |
Zf1.rm |
0:15 |
13:44-13:48* |
Zf1_reporter5 |
Reporter |
Zf1.rm |
0:05 |
15:22-15:28* |
Zf2_reporter |
Reporter |
Zf2.rm |
0:15 |
3:05-3:06 |
Zf2_reporter2 |
Reporter |
Zf2.rm |
0:11 |
3:48-3:50 |
Zf2_reporter3 |
Reporter |
Zf2.rm |
0:05 |
5:35-5:42* |
Zf2_reporter4 |
Reporter |
Zf2.rm |
0:03 |
6:51-6:53 |
Zf2_reporter5 |
Reporter |
Zf2.rm |
0:03 |
8:09-8:11 |
Zf2_reporter6 |
Reporter |
Zf2.rm |
0:03 |
9:01-9:05* |
Zf2_reporter7 |
Reporter |
Zf2.rm |
0:31 |
10:59-12:00* |
Zf3_reporter |
Reporter |
Zf3.rm |
0:13 |
2:04-2:13* |
Zf3_reporter2 |
Reporter |
Zf3.rm |
0:16 |
4:22-4:25 |
Other people’s voices are eliminated
The duration of Zf1_liubaorong is 2 minutes 36 seconds. Zf2_wangjindong2 is 4 minutes and 08 seconds. Because durations of these two are the longest, they are used separately as the Training Corpus of Liu Baorong’s and Wang Jindong’s Speaker Dependent model.
Because the female reporter’s corpus durations are all too short to train a model, we assembled the female reporter’s corpus according to the videos as follows:
Zf1_reporter_all |
Zf1_reporter + zf1_reporter2 + zf1_reporter3 + zf1_reporter4 + |
Zf2_reporter_all |
Zf2_reporter + zf2_reporter2 + zf2_reporter3 + zf2_reporter4 + |
Zf3_reporter_all |
Zf3_reporter + zf3_reporter2 |
Reporter- 1_2 |
Zf1_reporter_all + Zf2_reporter_all |
Reporter- 2_3 |
Zf2_reporter_all + Zf3_reporter_all |
Reporter- 1_3 |
Zf1_reporter_all + Zf3_reporter_all |
Reporter-1_2, Reporter-2_3, and Reporter-1_3 were used to train three different Speaker Dependent models. To train the value of the threshold, these models were tested and verified separately with Zf3_reporter_all, Zf1_reporter_all, and Zf2_reporter_all.
Finally, there is a corpus to train the Speaker Independent model:
ZFALL_vocal |
All voices in all three videos |
3.3 Obtaining the Feature Vector
The Feature Vector used in this report is 39 MFCC£¨Mel-Frequency Cepstral Coefficient
Pre-emphasis Filter |
1-0.97z-1 |
Frame Size |
32 ms |
Frame Shift |
10 ms |
Filter Bank |
Mel-Scale Triangular Filter Banks |
Number of Filter Banks |
26 |
Low Cut-off Frequency |
300 Hz |
High Cut-off Frequency |
3400 Hz |
Feature Vector |
12 Mel_Frequency Cepstral Coefficients and one short-time |
The program used to obtain the Feature Vector is from HCopy of HTK 3.0 in Reference [2].
3.4 Training the Speaker Independent Model
The Training Corpus is ZFAll_vocal. The training method is to obtain the initial model through Vector Quantization. When the number of clusters is less than 8, Modified K-means is used. When the number of clusters is greater than 8, Binary Split is used. After obtaining the initial model, Expectation Maximization was performed to obtain the final model [1].
According to [1], in speaker verification using the Gaussian Mixture Model, the error rate is the lowest when Number of Mixtures is 512 or 1024. To cut down on the amount of computation, this report adopted the following:
Number of Mixtures |
512 |
3.5 Speaker Dependent Model
The Speaker Dependent model comes from the Speaker Independent model described in the previous section, after adaptation using the Bayesian Adaptation method. Only the average vector is adapted. Mixture weight and variance are substituted using the coefficients in the Speaker Independent model.
The Speaker Dependent model and its corpus for adaptation used in this report are as follows:
Speaker Dependent Model |
Corpus for Adaptation |
Zf1_liubaorong.sd.modal |
Zf1_liubaorong |
Zf2_wangjindong2.sd.modal |
Zf2_wangjindong2 |
Reporter-1_2.sd.modal |
Reporter-1_2 |
Reporter-2_3.sd.modal |
Reporter-2_3 |
Reporter-1_3.sd.modal |
Reporter-1_3 |
3.6 Speaker Verification
Derived from the diagram in Section 2.1, the formula for calculating the verification score of each segment of test speech is:
T is the test speech’s number of frames, is time t’s Feature Vector. S and S¡¯ are the Speaker Dependent model and Speaker Independent model, respectively.
Verification Scores:
SD Model |
Test Corpus |
Verification Score |
Zf1_liubaorong |
Zf2_liubaorong |
-0.042003 |
Zf2_wangjindong2 |
Zf1_wangjindong |
-0.201615 |
Zf2_wangjindong |
0.128923 |
|
Zf3_wangjindong |
0.325247 |
|
Reporter-1_2 |
Zf3_reporter_all |
0.146295 |
Reporter-2_3 |
Zf2_reporter_all |
0.022340 |
Reporter-1_3 |
Zf1_reporter_all |
0.012399 |
The first verification score is the verification of Liu Baorong’s voice in the second video using the first video’s interview of Liu Baorong as the training model.
The second, third, and fourth verification scores are based on using Video 2’s second interview of Wang Jindong as the training model to verify the interview of Wang Jindong in Video 1, the first interview in Video 2, and Video 3.
The fifth verification score is based on using the reporter’s voice in Video 1 and Video 2 as the Training Corpus to verify the reporter’s voice in Video 3. The sixth and seventh verification scores are different combinations of the similar tests.
As described in Section 1, in order to achieve credibility, the value of the test threshold should be set so that all three test corpuses of the reporter will be accepted by the test. Accordingly, the smallest value of test scores 5, 6, and 7 is selected, which is 0.012399.
Threshold |
0.012399 |
Verification Results:
Reference Speaker |
Test Speaker |
Score |
Threshold |
Results |
Liu in video 1 |
Liu in video 2 |
-0.042003 |
o.o12399 |
Rejection |
Second interview in |
Wang Jindong in |
-0.201615 |
Rejection |
|
First interview in video 2 Wang Jindong in (Zf2_wangjindong) |
0.128923 |
Acceptance |
||
Wang Jindong in |
0.325247 |
Acceptance |
||
Female reporter in |
Female reporter in |
0.146295 |
Acceptance |
|
Female reporter in |
Female reporter in |
0.022340 |
Acceptance |
|
Female reporter in |
Female reporter in |
0.012399 |
Acceptance |
Acceptance in the Test Result means the test speaker and the reference speaker (speaker in the training model) are determined to be the same person. Rejection means they are not the same person.
From the table above, under the condition of “Minimizing the possibility of False Rejection” (that is “Try best not to reject” or “To accept if two voices have certain similarity”), based on the test result of the voices available to this experiment, the conclusion can be made that Liu Baorong in the first video and Liu Baorong in the second video are not the same person. The two Wang Jindongs in the second video and the Wang Jindong in the third video are determined to be the same person. Wang Jindong in the first video and Wang Jindong in the other two videos can be determined not to be the same person.
4. Conclusion
By using Gaussian Mixture Model speaker verification technology, this report has reached the conclusion that Liu Baorong and Wang Jindong in the first video are not the same as Liu Baorong and Wang Jindong in the second video.
In Section 3.3, this report used the model with a mixer of 512. Actually, this report also conducted an experiment using a mixer of 256 and 128. Except for the fact that the score was slightly different, the conclusion reached (Acceptance or Rejection) was completely the same.
References
[1] Weiren Chung, "An Initial Study on Speaker Recognition and Verification," 2001, National Taiwan University, Master's Degree Thesis (Back)
[2] Steve Young, Dan Kershaw, Julian Odell, et. Al, "The HTK Book (for HTK version 3.0"), July 2000 (Back)
World Organization to Investigate the Persecution of Falun Gong
Tel:1-347-448-5790;Fax:1-347-402-1444;
Mail Address:P.O. Box 84, New york, NY 10116
Website: http://www.upholdjustice.org/, http://www.zhuichaguoji.org/