Hi all,
CVPR 2015 happened one month ago. Now, I have a little time to write about this top-tier conference in vision community.
First impression, many paper directly or indirectly apply deep learning (DL), in specific, convolutional neural network (CNN) into their works. Deep learning is so popular in CV that there are a separate session on only CNN architecture.
Yann Lecun, an important researcher in deep learning renaissance, gave a plenary talks on deep learning. The title of the talk is what's wrong with deep learning and the slide of the talk given at this
link. Obviously, we see a plethora of DL approaches in machine learning conferences (e.g., ICML 2015, ICLR 2015, NIPS 2014) and also in CV conferences (CVPR 2014, ECCV 2014). On this time of returning to CV conferences, Yann (previously had a
problem CVPR 2012) gave a talk to point out what is lacking in current CNN architecture. CNN learns feature representation from hierarchical structure of multiple layers (including convolutional layers and fully connected layers) but it does not capture learning structure of data. Yann pointed that we need to find a right model for combining CNN and structured predictions. CNN and SVM have a close relationship in history. CNN rose and sink as SVM appeared and matured. Now, CNN and neural networks currently blossoms again in many conferences. God father of DL,
Geoffrey Hinton, and an pioneer in DL, Yann Lecunn, recently indicate that there is no more winter in AI and neural networks because we have fast enough computing resources (GPU and heterogeneous computing) and big data to train huge neural networks.
The trend of applying CNN in CV problems has appeared in CVPR 2014 with some good works such as R-CNN [1], which is just over 1 year, it got more 500 citations. This paper uses object proposals to generate regions and rectifies these regions before feeding them into a CNN. Trained models from CNN are used to extract features on regions in testing time for object detection tasks. The paper also shows that in order to avoid overfitting problems in training CNN we should pre-train the network on large dataset (such as ILSVRC) and fine-tune it on our small domain-specific datasets. The work also notes that it is natural and inevitable to combine CNN with traditional tools in computer vision. However, DL really blossoms at CVPR 2015 which is really a good thing because it makes me seriously deploy CNN as a toolbox in my computer vision researches. In the first year of my PhD, I ever encountered the question of doing researches about DL but I feared. Now, I have no fears, all is about scientific works and a better life. This year, we can observe that CNN has been applied in many tasks such as action recognition [2,3,4], image classification (Inception [5]), action localization [6], object detection [7], event detection [8] to name a few.
Second, researches are now really fast, community is using arXiv (an e-print archive) more. It is now like a new channel to register an idea and deliver research results to more audiences quicker.
Recently, I also subscribe into arXiv in Computer Vision and Pattern Recognition and Learning. For instance, I collected a paper of using R-CNN for action recognition in images [9] which is probably published in ICCV 2015 (noted now is just in the phase of reviewing).
Third, action recognition (my area of research) have lifted its bar into new level with many good-performing works in this year. These good works are from both traditional computer vision and deep learning methods. Interestingly, Fernando et. al. [20] proposed a method to model temporal evolution of video and use information from temporal evolution as a feature for action recognition. It also improves accuracy of HMDB51 dataset to 61.8%, if we combine with BoW + FV features, we can achieve 61.8% which is comparable with BoW + FV features approach alone. However, it achieves new state-of-the-art result in Hollywood2 with 70.0% accuracy. Another simple idea is to concatenate BoW FV representation and responses of objects in a video to form a new video representation including motion and object features [3]. The object models is not surprisingly exploited from AlexNet trained on big dataset ILSVRC. The features achieve an impressive performance on HMDB51 dataset with 71.3%. However, if we just only use objects features, we moderately achieve 38.9% accuracy. It clearly shows complementary between object features and motion features. It also achieves high performance of Hollywood2 dataset with accuracy of 66.4% and UCF101 with mAP of 88.5%. Another interesting work is to use saliency of each trajectories to weight improved dense trajectories in FV representation [10]. It achieves 62.2% and 87.7% accuracy in action classification on HMDB51 and UCF101 dataset. In similar trend, Ni et. al. [11] deploy a motion part regulation technique based on sparse group selection. They achieves a new state-of-the-art result on Olympic dataset with accuracy of 92.3%, and other high results on HDMB51 with accuracy of 65.5% and Hollywood2 with mAP of 66.7%. Another important paper on action recognition is the idea of capturing variations of actions in frequency domain [21]. Its representation aims to make invariant in speed of executing actions of different actors. To be honest, I do not really know how they capture this invariant although they utilize multi-scales approaches. It is not like in SIFT features by representing metrics along dominant directions in resulting histogram to capture rotation invariant or using multi-scales approach to obtain scale-invariant. The approach obtains astonishing results on tested dataset such as 65.1% on HMDB51, 68.0% on Hollywood2, 94.4% on UCF50 and 91.4% Olympics dataset.
There are two important papers in CVPR 2015 applying directly trained CNN to extract features for action recognition. The first paper [12] from Google trains a recurrent neural network (RNN) called LSTM uses memory cells to store and capture long-term temporal relationships. The network is used to compare with traditional CNN (GoogLeNet [5]) with conv-pooling. Surprisingly, the conv-pooling CNN outperforms LSTM a little bit on Hit @ 5 accuracy with 90.8% compared to 90.5% on Sports-1M datasets. The LSTM network also achieves 88.6% accuracy on UCF101 which is better than DL approach last year on the same dataset ([13] with 66% on HDMB51 and 82.4% on Sports-1M). Another smart idea to exploit CNN is to utilize conv layers of trained CNN at feature extractors at testing time and pooling features along dense trajectories. This approach [14] reasonably achieves 65.9% and 91.5% accuracy on HMDB51 and UCF101 dataset.
Another problem of my interests is video analysis, in specific, action detection/foreground segmentation/video object segmentation. In this category, I like the work of Gkioxari et. al. [15] (Malik group) which train two-stream CNN [16] on single frames in a video. At the testing time, they run two these classifiers at each frame, and use Viterbi algorithm (dynamic programming) to link detections through different frames. I am really impressive about the results of this paper on action detection tasks in spite of severe variations in video frames. It indicates that in order to capture great variations, we should employ complex classifiers such as CNN. Focusing on actor in video analysis has been spotted by many researchers, however, this CVPR, the problem has been formalized by two papers from Jason Corso's group. Although I feel this problem not really a "clean" problem, it is indeed worth to try out. Lu et. al. [17] have utilized a MRF model based on supervoxels representation to segment out human action segmentation. Xu et. al. [18] deploy a CRF model on supervoxels to tackle with multi-label actor-action recognition on a newly built dataset. In video object segmentation category, superpixel-based representation is all the way to go. Giordano et. al. [19] inspired by the work "Fast Video Object Segmentation" [20] to propose a new method to estimate initial foreground not based on optical flow just based on position of superpixels and do local foreground segmentation around these local location priors. The advantage of the method is on bypassing optical flow computation and build appearance models around local priors at each frame which is more accurate than a global model. However, this method is susceptible to accuracy of initial location priors and I think these initial location priors are easily failed in the case of videos in action recognition dataset which contains complex camera motions.
The best way of thinking is to write, the best way of programming is to interactively visualize and the best way of learning is to disseminate to others.
References:
[1] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond Short Snippets: Deep Networks for Video Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Jain, M., van Gemert, J. C., & Snoek, C. G. M. (2015). What do 15,000 Object Categories Tell Us About Classifying and Localizing Actions? In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Wang, L., Qiao, Y., & Tang, X. (2015). Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going Deeper With Convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Gkioxari, G., & Malik, J. (2015). Finding Action Tubes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Yan, J., Yu, Y., Li, S. Z., Zhu, X., Lei, Z., & Li, S. Z. (2015). Object Detection by Labeling Superpixels. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Xu, Z., Yang, Y., & Hauptmann, A. G. (2015). A Discriminative CNN Video Representation for Event Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Gkioxari, G., Berkeley, U. C., Girshick, R., & Berkeley, U. C. (n.d.). Contextual Action Recognition with R*CNN.
[10] Feichtenhofer, C., Pinz, A., & Wildes, R. P. (2015). Dynamically Encoded Actions Based on Spacetime Saliency. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Ni, B., Moulin, P., Yang, X., & Yan, S. (2015). Motion Part Regularization: Improving Action Recognition via Trajectory Selection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond Short Snippets: Deep Networks for Video Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Karpathy, A., & Leung, T. (2014). Large-scale Video Classification with Convolutional Neural Networks. Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1725–1732. doi:10.1109/CVPR.2014.223
[14] Wang, L., Qiao, Y., & Tang, X. (2015). Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Gkioxari, G., & Malik, J. (2015). Finding Action Tubes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Simonyan, K., & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv Preprint arXiv:1406.2199, 1–11. Retrieved from http://arxiv.org/abs/1406.2199
[17] Lu, J., Xu, R., & Corso, J. J. (2015). Human Action Segmentation With Hierarchical Supervoxel Consistency. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Xu, C., Hsieh, S.-H., Xiong, C., & Corso, J. J. (2015). Can Humans Fly? Action Understanding With Multiple Classes of Actors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Giordano, D., Murabito, F., Palazzo, S., & Spampinato, C. (2015). Superpixel-Based Video Object Segmentation Using Perceptual Organization and Location Prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Fernando, B., Gavves, E., Oramas, M., Ghodrati, A., Tuytelaars, T., Leuven, K. U., … Tuytelaars, T. (2015). Modeling Video Evolution for Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Lan, Z., Lin, M., Li, X., Hauptmann, A. G., & Raj, B. (2015). Beyond Gaussian Pyramid: Multi-Skip Feature Stacking for Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).