VISIOCITY

A New Dataset and Evaluation Framework for Realistic Video Summarization

Automatic video summarization is still an unsolved problem due to several challenges. Lack of a challenging dataset and a rich automatic evaluation framework are the two issues often talked about in literature. We introduce a new benchmarking video dataset called VISIOCITY (VIdeo SummarIzatiOn based on Continuity, Intent and DiversiTY). While currently available datasets either have very short videos or have few long videos of only a particular type (Table 1), VISIOCITY is a diverse collection of 67 long videos spanning across six different categories with dense concept annotations (Table 2). Due to its rich annotations, it supports different flavors of video summarization and other vision problems like event localization or action recognition as well. More details about VISIOCITY can be found in the paper here.

VISIOCITY compared with other datasets

Table 1: Comparison with other datasets

VISIOCITY Key Stats

Table 2: VISIOCITY Stats

VISIOCITY at a Glance

VISIOCITY Framework At a Glance

VISIOCITY is a diverse collection of 67 long videos spanning across six different domains: TV shows (Friends) , sports (soccer), surveillance, education (tech-talks), birthday videos and wedding videos.

A sample from each category is shown below.

Image from Friens

Figure 1: Clockwise from the top left : soccer_18, friends_1, surveillance_8, wedding_5, birthday_10, techtalk_4

Annotator tool

Figure 2: Annotation tool

Supervised automatic video-summarization techniques tend to work better than unsupervised techniques because of learning directly from human summaries. However, since there is no single 'right' answer (due to reasons highlighted in Figure 3), two human summaries could be quite different in their selections. For example, as shown in Figure 4, for soccer_18 video the peaks correspond to those seconds which got selected in summary by many humans. Yet there are many seconds which have been selected by only one human.




No single right answer

Figure 3: There can be many "right answers" or ideal summaries

(In)consistency in human summaries

Figure 4: Human selections for soccer_18 video show agreement as well as dis-agreement; the two peaks (selections where most humans agree) correspond to the two goals in this video

Thus, more the number of human summaries, better is the learning. Unfortunately, for long videos different human summaries with diverse characteristics and of different lengths are difficult to obtain. In VISIOCITY we use pareto optimality to automatically generate multiple reference summaries with different characteristics from indirect ground truth present in VISIOCITY. For example, maximizing a particular scoring function would yield a summary rich in that particular charcateristic. However, it may fall-short on other characteristics (Figure 5). Hence different weighted combinations of measures (each modeling certain characteristic like diversity or continuity or importance) are maximized to arrive at optimal ground truth summaries. We show that these summaries are at par with human summaries (Figure 6, 7, 8).

Figure 5: Automatic ground truth summary f soccer_18 produced by - (left) maximizing only importance score, (right) maximizing only mega-event continuity score

Figure 6: Some human summaries (1, 2, 6, 7) for a 23 mins friends video (friends_5)

Figure 7: Some automatically generated reference summaries for same video

Selections by some human summaries of friends_5
Selections by some auto summaries of friends_5

Figure 8: Shot numbers selected by human summaries (left) and by auto summaries (right) for the above summaries of friends_5

A video summary is typically evaluated by comparing it against a human (reference) summary. This has following limitations: Further, a single measure (say F1) to evaluate a summary, as is the current typical practice, falls short in some ways. One human (good) summary could contain more important but less diverse segments while another human (good) summary could contain more diverse and less imporant segments (Figure 9). In VISIOCITY we thus use a suite of measures (Figure 10) to capture various aspects of a summary like continuity, diversity, redundancy, importance etc. These are computed using the annotations provided in VISIOCITY (indirect ground truth), as against comparing the candidate summary to ground truth summaries.
Interplay of different measures across different human summaries of friends_5

Figure 9: Interplay of different measures across different human summaries of friends_5

Example of different measures modeling different desired characteristics in a summary

Figure 10: Evaluation measures / scoring functions used in VISIOCITY

Code

The code can be downloaded from this git repository
Tool Evaluation Utils

Videos, Annotations and Summaries

The links to download the videos, annotations and human summaries will be available after filling the following Google form.
VISIOCITY can serve as a challenging benchmark dataset. We test the performance of a few representative state-of-the-art techniques of video summarization on VISIOCITY assessed using various measures. We also leverage VISIOCITY to demonstrate that with multiple ground truth summaries possessing different characteristics, learning from a single oracle combined ground truth summary (as is a common practice) using a single loss function is not a good idea. A simple recipe VISIOCITY-SUM (called "Ours" in Figure 11) uses a simple weighted mixture model and learns the weights using individual ground truth summaries and a combination of losses (each measuring deviation from a different characteristic) outperforms the other techniques.

Figure 11: Results using mixture model on VISIOCITY

The videos were partially downloaded from YouTube and some may be subject to copyright. We don't own the copyright of those videos and only provide them for non-commercial research purposes only. The annotation data can be used freely for research purposes. If you use VISIOCITY or refer to it, please cite the following paper:
      
        @misc{kaushal2021good,
      title={How Good is a Video Summary? A New Benchmarking Dataset and Evaluation Framework Towards Realistic Video Summarization}, 
      author={Vishal Kaushal and Suraj Kothawade and Anshul Tomar and Rishabh Iyer and Ganesh Ramakrishnan},
      year={2021},
      eprint={2101.10514},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
      
    

For any communiction regarding VISIOCITY, please contact: Vishal Kaushal [vkaushal at cse dot iitb dot ac dot in]