This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Zhihang Ren, University of California, Berkeley and these authors contributed equally to this work (Email: [email protected]);
(2) Jefferson Ortega, University of California, Berkeley and these authors contributed equally to this work (Email: [email protected]);
(3) Yifan Wang, University of California, Berkeley and these authors contributed equally to this work (Email: [email protected]);
(4) Zhimin Chen, University of California, Berkeley (Email: [email protected]);
(5) Yunhui Guo, University of Texas at Dallas (Email: [email protected]);
(6) Stella X. Yu, University of California, Berkeley and University of Michigan, Ann Arbor (Email: [email protected]);
(7) David Whitney, University of California, Berkeley (Email: [email protected]).
A benefit of the VEATIC dataset is that it has multiple annotators for each video with the minimum number of annotators for any given video being 25 and the maximum being 73. Emotion perception is subjective and observers judgments can vary across multiple people. Many of the previously published emotion datasets have a very low number of annotators, often having only single digit (n < 10) number of annotators. Having so few annotators is problematic because of the increased variance across observers. To show this, we calculated how the average rating for each video in our dataset varied if we randomly sampled, with replacement, five versus all annotators. We repeated this process 1000 times for each video and calculated the standard deviation of the recalculated average rating. Figure 12a shows how the standard deviation of the consensus rating across videos varies if we use either five or all annotators for each video. This analysis shows that having more annotators leads to much smaller standard deviations in the consensus rating which can lead to more accurate representations of the ground truth emotion in the videos.
Additionally, We investigated how observers’ responses varied across videos by calculating the standard deviation across observers for each video. Figure 12b shows the standard deviations across videos. We find that the standard deviations for both valence and arousal dimensions were small with valence having an average standard deviation of µ = 0.248 and a median of 0.222 and arousal having an average standard deviation of µ = 0.248 and a median of 0.244, which are comparable with the valence and arousal rating variance from EMOTIC [32].