Video Saliency Prediction Challenge 2024
Participate
Participation closed!
Dataset
Evaluation implementation and dataset description you can find at our github!
You can find the challenge dataset on Google Drive or on Synology [updated: test dataset added]
Videos.zip
- 1500 (1000 Train + 500 Test) .mp4 video (kindly reminder: many videos contain an audio stream and users watched the video with the sound turned ON!)
VideoInfo.json
- meta information about each video (e.g. license)
TrainTestSplit.json
- in this json we provide Train/Public Test/Private Test split of all videos
SaliencyTrain.zip
- almost-losslessly (crf 0, 10bit, min-max normalized) compressed continuous saliency maps videos for Train subset
FixationsTrain.zip
- contains following files for Train subset:
FixationsTrain/Train/video_name/fixations.json' - per-frame fixations coordinates, from which saliency maps were obtained, this json will be used for metrics calculation
FixationsTrain/Train/video_name/fixations/
- binary fixation maps in.png
format (since some fixations could share the same pixel, this is a lossy representation and is NOT used either in calculating metrics or generating Gaussians, however we provide them for visualization and frames count checks)
More details about data collection procedure:
- Continuous mouse movements were collected as the video played, and the sound was turned ON throughout the video (audio captchas were checked for every participant at the beginning and in the middle of the batch of videos). Thus, video playback and audio stream are perfectly aligned, and each participant viewed the whole video with full playback speed.
- Since different browsers and hardware can provide different mousemove event update loop speed, we resampled each mouse movement track with 100Hz frequency by linear interpolation. So for each video (with 30 FPS) we have about 50 aligned mouse fixations trajectories (from 50 observers) with 100Hz frequency. In the per-frame annotations, we project all the fixations corresponding to frame timestamp (i.e. within the duration of 1/30s.) to this frame.
- Since mouse movements obviously lag behind eye movements, we found the optimal shift through cross-validation with eye-tracking dataset and have already applied this shift (around 300ms) in the provided dataset to compensate lag. To further improve data quality, we trimmed the first second of all videos and annotations as well (since participants need time to initially navigate to the salient area and this has increased consistency with eye-tracking data).
- We also filtered out views that had low frequencies (i.e., the mouse did not move), as well as viewers that had low agreement with the eye-tracker on the validation videos (viewers did not know which of the videos inside the batch were validation ones).
Submitting
To participate in the Challenge, you need to send a link to zip archive with dense saliency maps (predictions of your model) for all Test videos (e.g. 500), however, during challenge you will observe in leaderboard only ‘Public Test’ metrics results, while the final ranking will take place on a ‘Private Test’:
submission.zip
- contains the following files for Test subset:
video_name1.mp4
- video with predicted per-frame saliency mapsvideo_name2.mp4
…
Your .mp4 will be extracted into .png frames for metrics computation, so double check that number of frames in your saliency videos matches number of frames in the corresponding video.
The link should provide the ability to download the archive from Google Drive using gdown
or from any other storage using wget
. Before submitting, check that the archive is available for download for everyone who has the link.
Here is an example of a submission link.
Evaluation
The evaluation consists of the comparison of the predicted by the participant models saliency maps with the ground truth saliency maps collected with the assistance of real people in crowdsource.
The comparison is carried out according to 4 popular metrics for the saliency prediction task:
- Area Under the Curve Judd (AUC Judd),
- Linear Correlation Coefficient (CC),
- Similarity (SIM),
- Normalized Scanpath Saliency (NSS).
The final score for a participant is calculated as the average rank on all four metrics on test set. Participants see scores during the challenge only on 30% of the test set, the final result will not be shown until the end of the challenge. If the final score is equal by different methods, the result of the first non-matching metric in the order of metrics in the leaderboard will be used. If all metrics are equal, the higher place is awarded to the earlier submission.