Learning Key Evidence for Detecting Complex Events in Videos
Date Issued
2015
Date
2015
Author(s)
Lai, Kuan-Ting
Abstract
Video event detection is one of the most important, yet very challenging, research topics in computer science. The recognition of complex events, e.g. “birthday party”, “wedding ceremony” or “attempting a bike trick”, is even more difficult since complex events consist of various human interactions with different objects in diverse environments with variable time intervals. Currently the most common approach is to extract features from frames or video clips, and then to quantize and pool these features to form a single vector representation for the entire video. While this method is simple and efficient, the final pooling step may lead to the loss of temporally local information, and include many irrelevant features from noisy background. To approach this problem in a different way than in previous methods, we noticed that humans require only a small amount of evidence to recognize an event in a video. For example, a “birthday party” event can be identified by discovering “birthday cake” and “blowing candles”. Inspired by this idea, we propose a novel way to detect complex events, whereby one first identifies the key evidence that can prove the existence of an event, and then utilizes the evidence to recognize videos. Under our framework, each video is represented as multiple “instances”, which are defined as video segments of different temporal intervals. Then we apply learning methods to identify evidence (positive instances) first and utilize the evidence to recognize complex video events. In this thesis, we propose two learning methods. The first proposed method, called maximal evidence learning (MEL), is based on a large-margin formulation that treats instance labels as hidden latent variables, and infers the instance labels and the instance-level classification model simultaneously. MEL can infer optimal solutions by learning as many positive instances as possible from positive videos, and negative instances from negative videos. The second proposed method is called evidence selective ranking (ESR). ESR is based on static-dynamic instance embedding, and employs infinite push ranking to select the most distinctive evidence. Extensive analysis on large-scale video event datasets shows significant performance gains by both methods. In this study, we also demonstrate key selected evidence is meaningful to humans and can be used to locate video segments that signify an event.
Subjects
video event detection
large-margin framework
proportional SVM
infinite push ranking
multiple instance learning
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-104-D98921025-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):df9dc1878ef737e5813e8a16d94d9a6d