VISIOCITY

A New Dataset and Evaluation Framework for Realistic Video Summarization

Automatic video summarization is still an unsolved problem due to several challenges. Lack of a challenging dataset and a rich automatic evaluation framework are the two issues often talked about in literature. We introduce a new benchmarking video dataset called VISIOCITY (VIdeo SummarIzatiOn based on Continuity, Intent and DiversiTY). While currently available datasets either have very short videos or have few long videos of only a particular type (Table 1), VISIOCITY is a diverse collection of 67 long videos spanning across six different categories with dense concept annotations (Table 2). Due to its rich annotations, it supports different flavors of video summarization and other vision problems like event localization or action recognition as well. More details about VISIOCITY can be found in the paper here.

Table 1: Comparison with other datasets

Table 2: VISIOCITY Stats

VISIOCITY Framework At a Glance

VISIOCITY is a diverse collection of 67 long videos spanning across six different domains: TV shows (Friends) , sports (soccer), surveillance, education (tech-talks), birthday videos and wedding videos.

TV shows contains videos from a popular TV series Friends. They are typically more aesthetic in nature and professionally shot and edited.
In sports category, VISIOCITY contains Soccer videos. These videos typically have well defined events of interest like goals or penalty kicks and are very similar to each other in terms of the visual features.
Under surveillance category, VISIOCITY covers diverse settings like indoor, outdoor, classroom, office and lobby. The videos were recorded using our own surveillance cameras. These videos are in general very long and are mostly from static continuously recording cameras.
Under educational category, VISIOCITY has tech talk videos with static views or inset views or dynamic views.
In personal videos category, VISIOCITY has birthday and wedding videos. These videos are typically long and unedited.

A sample from each category is shown below.

Figure 1: Clockwise from the top left : soccer_18, friends_1, surveillance_8, wedding_5, birthday_10, techtalk_4

Indirect ground truth

Concepts marked for each shot
Allows to "generate" different ground truth summaries required for supervised learning
Makes annotation process more objective and easier as compared to asking annotators to provide reference summaries directly or ratings or scores which becomes very difficult for long videos

Concepts

Carefully selected based on the type of video
Organized in categories rather than a flat list
Example categories include ’actor’, ’entity’, ’action’, ’scene’, ’number-of-people’, etc.
Categories provide a natural structuring to make the annotation process easier and also support for at least one level hierarchy of concepts for concept-driven summarization.

Mega events

To mark consecutive shots which together constitute a cohesive event
For example, a few shots preceeding a goal in a soccer video, the goal shot and a few shots after the goal shot together would constitute a 'mega-event'
A model trained to learn importance scores (only) would do well to pick up the ’goal’ snippet. However, such a summary will not be pleasing. The notion of 'mega-events' allows for modeling continuity.(Figure 5)

Protocol

13 professional annotators
Audio turned off
GUI tool to make the process easy and error free (Figure 2)

Gold standard

Guidelines and protocols were made as objective as possible
Annotators were trained through sample annotation tasks
Annotation round was followed by two verification rounds where both precision (how accurate the annotations were) and recall (whether all events of interest and continuity information has been captured in the annotations) were verified by another set of annotators
Whatever inconsistencies or inaccuracies were found and could be automatically detected, were included in our automatic sanity checks which were run on all annotations

Figure 2: Annotation tool

Supervised automatic video-summarization techniques tend to work better than unsupervised techniques because of learning directly from human summaries. However, since there is no single 'right' answer (due to reasons highlighted in Figure 3), two human summaries could be quite different in their selections. For example, as shown in Figure 4, for soccer_18 video the peaks correspond to those seconds which got selected in summary by many humans. Yet there are many seconds which have been selected by only one human.

Figure 3: There can be many "right answers" or ideal summaries

Figure 4: Human selections for soccer_18 video show agreement as well as dis-agreement; the two peaks (selections where most humans agree) correspond to the two goals in this video

Thus, more the number of human summaries, better is the learning. Unfortunately, for long videos different human summaries with diverse characteristics and of different lengths are difficult to obtain. In VISIOCITY we use pareto optimality to automatically generate multiple reference summaries with different characteristics from indirect ground truth present in VISIOCITY. For example, maximizing a particular scoring function would yield a summary rich in that particular charcateristic. However, it may fall-short on other characteristics (Figure 5). Hence different weighted combinations of measures (each modeling certain characteristic like diversity or continuity or importance) are maximized to arrive at optimal ground truth summaries. We show that these summaries are at par with human summaries (Figure 6, 7, 8).

Figure 5: Automatic ground truth summary f soccer_18 produced by - (left) maximizing only importance score, (right) maximizing only mega-event continuity score

Figure 6: Some human summaries (1, 2, 6, 7) for a 23 mins friends video (friends_5)

Figure 7: Some automatically generated reference summaries for same video

Selections by some human summaries of friends_5

Selections by some auto summaries of friends_5

Figure 8: Shot numbers selected by human summaries (left) and by auto summaries (right) for the above summaries of friends_5

A video summary is typically evaluated by comparing it against a human (reference) summary. This has following limitations:

Human summaries are themselves inconsistent with each other
A workaround is to report the max across many human summaries. This again falls short especially in context of long videos. A good candidate may get a low score just because it was not fortunate to have a matching human summary
Typical measure used is F1 which has its limitations. For example it is not made to measure aspects like continuity and diversity

Further, a single measure (say F1) to evaluate a summary, as is the current typical practice, falls short in some ways. One human (good) summary could contain more important but less diverse segments while another human (good) summary could contain more diverse and less imporant segments (Figure 9). In VISIOCITY we thus use a suite of measures (Figure 10) to capture various aspects of a summary like continuity, diversity, redundancy, importance etc. These are computed using the annotations provided in VISIOCITY (indirect ground truth), as against comparing the candidate summary to ground truth summaries.

Figure 9: Interplay of different measures across different human summaries of friends_5

Example of different measures modeling different desired characteristics in a summary

Figure 10: Evaluation measures / scoring functions used in VISIOCITY

Code

The code can be downloaded from this git repository
Tool

Pre-requisites: python3 and following python packages: tkinter, ffmpeg, opencv, pillow, imagetk, Pmw, bs4
For annotation: python tool.py soccer.json
As annotation viewer: python tool.py soccer.json vis
Summary viewer: GUI tool to view a summary, given its JSON. For example:
python3 summaryViewer.py --video ~/data/soccer/soccer_1.mp4 --summary summary.json --annotation ~/data/soccer/soccer_1.json --configfile soccer.json

Evaluation

GenerateFrameEvalNumbers.cc: code to compute all scores of a summary based on the frames contained in the summary JSON
GenerateVisContUniformNumbers.cc: code to compute the visual continuity score and uniformity score of a summary, given the summary JSON
GenerateAllEvalNumbers.cc: code to compute all scores of a summary based on the snippets contained in its summary JSON
computeScoresOfHuman.py: script to compute all scores of all human summaries given all human summary JSONs.
For example: python computeScoresOfHuman.py ~/data/ soccer build/ True 2>&1 | tee soccerHumanScores.log
computeScoresOfRandom.py: script to compute all scores of all random summaries given all random summary JSONs
GTSummaryGenerator.cc: code to automatically generate ground truth summary given a configuration of lambdas.

Utils

hsImageAndHistogram.py: Utility to create a video overlayed with all human summary selections for that video by all humans and to produce other statistics of human summaries given the human summary JSONs.
summaryFramesToVideoGenerator.py: Utility to create human summary video given a human summary JSON
summarySnippetsToVideoGenerator.py: Utility to create a summary video given the snippets information in a summary JSON.
For example, python2 summarySnippetsToVideoGenerator.py soccer_18_imp_mega.json soccer_18.json soccer.json soccer_18.mp4 soccer_18_imp_mega.mp4
generateAllHumanSummaryVideos.py: to generate human summary videos for all videos using all human summary JSONs
randomSummaryGenerator.py: code to generate random summaries

Videos, Annotations and Summaries

The links to download the videos, annotations and human summaries will be available after filling the following Google form.

VISIOCITY can serve as a challenging benchmark dataset. We test the performance of a few representative state-of-the-art techniques of video summarization on VISIOCITY assessed using various measures. We also leverage VISIOCITY to demonstrate that with multiple ground truth summaries possessing different characteristics, learning from a single oracle combined ground truth summary (as is a common practice) using a single loss function is not a good idea. A simple recipe VISIOCITY-SUM (called "Ours" in Figure 11) uses a simple weighted mixture model and learns the weights using individual ground truth summaries and a combination of losses (each measuring deviation from a different characteristic) outperforms the other techniques.

Figure 11: Results using mixture model on VISIOCITY

The videos were partially downloaded from YouTube and some may be subject to copyright. We don't own the copyright of those videos and only provide them for non-commercial research purposes only. The annotation data can be used freely for research purposes. If you use VISIOCITY or refer to it, please cite the following paper:

      
        @misc{kaushal2021good,
      title={How Good is a Video Summary? A New Benchmarking Dataset and Evaluation Framework Towards Realistic Video Summarization}, 
      author={Vishal Kaushal and Suraj Kothawade and Anshul Tomar and Rishabh Iyer and Ganesh Ramakrishnan},
      year={2021},
      eprint={2101.10514},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For any communiction regarding VISIOCITY, please contact: Vishal Kaushal [vkaushal at cse dot iitb dot ac dot in]

VISIOCITY

Overview

Videos

Annotations for Supervised Learning and Evaluation

Ground Truth Summaries

Evaluation Framework

Downloads

Code

Videos, Annotations and Summaries

Benchmark

License