Options
Transfer Learning for Video Recognition with Scarce Training Data for Deep Convolutional Neural Network
Date Issued
2014-09-15
Author(s)
Abstract
Unconstrained video recognition and Deep Convolution Network (DCN) are two
active topics in computer vision recently. In this work, we apply DCNs as
frame-based recognizers for video recognition. Our preliminary studies,
however, show that video corpora with complete ground truth are usually not
large and diverse enough to learn a robust model. The networks trained directly
on the video data set suffer from significant overfitting and have poor
recognition rate on the test set. The same lack-of-training-sample problem
limits the usage of deep models on a wide range of computer vision problems
where obtaining training data are difficult. To overcome the problem, we
perform transfer learning from images to videos to utilize the knowledge in the
weakly labeled image corpus for video recognition. The image corpus help to
learn important visual patterns for natural images, while these patterns are
ignored by models trained only on the video corpus. Therefore, the resultant
networks have better generalizability and better recognition rate. We show that
by means of transfer learning from image to video, we can learn a frame-based
recognizer with only 4k videos. Because the image corpus is weakly labeled, the
entire learning process requires only 4k annotated instances, which is far less
than the million scale image data sets required by previous works. The same
approach may be applied to other visual recognition tasks where only scarce
training data is available, and it improves the applicability of DCNs in
various computer vision problems. Our experiments also reveal the correlation
between meta-parameters and the performance of DCNs, given the properties of
the target problem and data. These results lead to a heuristic for
meta-parameter selection for future researches, which does not rely on the time
consuming meta-parameter search.
active topics in computer vision recently. In this work, we apply DCNs as
frame-based recognizers for video recognition. Our preliminary studies,
however, show that video corpora with complete ground truth are usually not
large and diverse enough to learn a robust model. The networks trained directly
on the video data set suffer from significant overfitting and have poor
recognition rate on the test set. The same lack-of-training-sample problem
limits the usage of deep models on a wide range of computer vision problems
where obtaining training data are difficult. To overcome the problem, we
perform transfer learning from images to videos to utilize the knowledge in the
weakly labeled image corpus for video recognition. The image corpus help to
learn important visual patterns for natural images, while these patterns are
ignored by models trained only on the video corpus. Therefore, the resultant
networks have better generalizability and better recognition rate. We show that
by means of transfer learning from image to video, we can learn a frame-based
recognizer with only 4k videos. Because the image corpus is weakly labeled, the
entire learning process requires only 4k annotated instances, which is far less
than the million scale image data sets required by previous works. The same
approach may be applied to other visual recognition tasks where only scarce
training data is available, and it improves the applicability of DCNs in
various computer vision problems. Our experiments also reveal the correlation
between meta-parameters and the performance of DCNs, given the properties of
the target problem and data. These results lead to a heuristic for
meta-parameter selection for future researches, which does not rely on the time
consuming meta-parameter search.
Subjects
Computer Science - Computer Vision and Pattern Recognition; Computer Science - Computer Vision and Pattern Recognition; Computer Science - Learning
Type
conference paper
File(s)
Loading...
Name
1409.4127.pdf
Size
7.75 MB
Format
Adobe PDF
Checksum
(MD5):0839c716021c968733665bb259924318