In this paper, we primarily study the video-based cross-modal person Re-ID method. However, with respect to the probe-to-gallery, almost all existing RGB-IR based cross-modal person Re-ID methods focus on image-to-image matching, while the video-to-video matching which contains much richer spatial- and temporal-information remains under-explored. Thanks for the cross-modal retrieval techniques, visible-infrared (RGB-IR) person re-identification (Re-ID) is achieved by projecting them into a common space, allowing person Re-ID in 24-hour surveillance systems. The experimental results demonstrate that the proposed AMOC network outperforms state-of-the-arts for video-based re-identification significantly and confirm the advantage of exploiting long-range motion context for video based person re-identification, validating our motivation evidently.
Extensive experiments are conduced on three public benchmark datasets, i.e., the iLIDS-VID, PRID-2011 and MARS datasets, to investigate the performance of AMOC. The architecture of AMOC is end-to-end trainable and thus motion context can be adapted to complement appearance clues under unfavorable conditions (e.g.
Then AMOC accumulates clues from motion context by recurrent aggregation, allowing effective information flow among adjacent frames and capturing dynamic gist of the persons. Given a video sequence of the same or different persons, the proposed AMOC network jointly learns appearance representation and motion context from a collection of adjacent frames using a two-stream convolutional architecture. In this paper we propose a novel Accumulative Motion Context (AMOC) network for addressing this important problem, which effectively exploits the long-range motion context for robustly identifying the same person under challenging conditions. Video based person re-identification plays a central role in realistic security and video surveillance. Comprehensive experimental evaluations demonstrate that the proposed solution achieves state-of-the-art performances on multiple widely used datasets (iLIDS-VID, PRID 2011, and MARS). Our work also presents a promising way to bridge the gap between video and image based person re-identification. assumption, we provide an error bound that sheds light upon how could we improve VPRe-id. More specifically, we divide videos into individual images and re-identify person with ensemble of image based rankers. Based on this observation, we then present a simple yet surprisingly powerful approach for VPRe-id, where we treat VPRe-id as an efficient orderless ensemble of image based person re-identification problem. Specifically, with a diagnostic analysis, we show that the recurrent structure may not be effective to learn temporal dependencies than what we expected and implicitly yields an orderless representation.
Is recurrent network really necessary for learning a good visual representation for video based person re-identification (VPRe-id)? In this paper, we first show that the common practice of employing recurrent neural networks (RNNs) to aggregate temporal spatial features may not be optimal. iLIDS-VID, PRID-2011 and SDU-VID) show that our proposed method outperforms other state-of-the-art approaches. Extensive experiments on three different benchmark datasets (i.e. We establish a connection between them and modify FCN to generate attention scores to represent the importance of each frame. In this paper, we formulate video based person re-identification as a sequence labeling problem like semantic segmentation. Fully convolutional network (FCN) has been widely used in semantic segmentation for generating 2D output maps. We propose a fully convolutional temporal attention model for generating the attention scores. Our method is motivated by the fact that not all frames in a video are equally informative. In this paper, we propose a temporal attention approach for aggregating frame-level features into a video-level feature vector for re-identification. The video-level features of two videos can then be used to calculate the distance of the two videos. A common approach for person re-identification is to first extract image features for all frames in the video, then aggregate all the features to form a video-level feature. The goal of video-based person re-identification is to match two input videos, so that the distance of the two videos is small if two videos contain the same person.