一些关于video方向的论文收集

Mr.R2023/9/19大约 8 分钟

一些关于video方向的论文收集

本文主要记录一下近4年(2019年起)各顶会顶刊有关video的paper名字，以便后续video dialog工作的调研和展开
(本文档未经过任何筛选，仅通过关键词搜索得到paper名字)

2022 ECCV

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos
Video Dialog as Conversation About Objects Living in Space-Time
Actor-Centered Representations for Action Localization in Streaming Videos
AutoTransition: Learning to Recommend Video Transition Effects
Sports Video Analysis on Large-Scale Data
Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
Quantized GAN for Complex Music Generation from Dance Videos
Telepresence Video Quality Assessment
GAMa: Cross-View Video Geo-Localization
FAR: Fourier Aerial Video Recognition
Fabric Material Recovery from Video Using Multi-scale Geometric Auto-Encoder
Video Graph Transformer for Video Question Answering
Video Question Answering with Iterative Video-Text Co-tokenization
Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding
Selective Query-Guided Debiasing for Video Corpus Moment Retrieval
Learning Linguistic Association Towards Efficient Text-Video Retrieval
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video
Expanding Language-Image Pretrained Models for General Video Recognition
AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition
Delving into Details: Synopsis-to-Detail Networks for Video Recognition
Scale-Aware Spatio-Temporal Relation Learning for Video Anomaly Detection
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos
Geometric Features Informed Multi-person Human-Object Interaction Recognition in Videos
Neural Capture of Animatable 3D Human from Monocular Video
FAST-VQA: Efficient End-to-End Video Quality Assessment with Fragment Sampling
Real-RawVSR: Real-World Raw Video Super-Resolution with a Benchmark Dataset
Synthesizing Light Field Video from Monocular Video
Video Interpolation by Event-Driven Anisotropic Adjustment of Optical Flow
CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos
RayTran: 3D Pose Estimation and Shape Reconstruction of Multiple Objects from Videos with Ray-Traced Transformers
SALISA: Saliency-Based Input Sampling for Efficient Video Object Detection
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles
Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions
Contrast-Phys: Unsupervised Video-Based Remote Physiological Measurement via Spatiotemporal Contrast
Hierarchical Contrastive Inconsistency Learning for Deepfake Video Detection
Generative Adversarial Network for Future Hand Segmentation from Egocentric Video
My View is the Best View: Procedure Learning from Egocentric Videos
Self-supervised Sparse Representation for Video Anomaly Detection
Few-Shot Video Object Detection
Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments
Graph Neural Network for Cell Tracking in Microscopy Videos
Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories
Towards Generic 3D Tracking in RGBD Videos: Benchmark and Baseline
Tackling Background Distraction in Video Object Segmentation
Learned Variational Video Color Propagation
Ensemble Learning Priors Driven Deep Unfolding for Scalable Video Snapshot Compressive Imaging
Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection
LocVTP: Video-Text Pre-training for Temporal Localization
Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining
Static and Dynamic Concepts for Self-supervised Video Representation Learning
Neural Video Compression Using GANs for Detail Synthesis and Propagation
Is It Necessary to Transfer Temporal Knowledge for Domain Adaptive Video Semantic Segmentation?
Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
PolyphonicFormer: Unified Query Learning for Depth-Aware Video Panoptic Segmentation
Video Restoration Framework and Its Meta-adaptations to Data-Poor Conditions
SeqFormer: Sequential Transformer for Video Instance Segmentation
In Defense of Online Models for Video Instance Segmentation
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
Video Mask Transfiner for High-Quality Video Instance Segmentation
Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding
Waymo Open Dataset: Panoramic Video Panoptic Segmentation
One-Trimap Video Matting
Learning Quality-aware Dynamic Memory for Video Object Segmentation
Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation
BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation
Global Spectral Filter Memory Network for Video Object Segmentation
Video Instance Segmentation via Multi-Scale Spatio-Temporal Split Attention Transformer
Domain Adaptive Video Segmentation via Temporal Pseudo Supervision
GOCA: Guided Online Cluster Assignment for Self-supervised Video Representation Learning
Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition
Federated Self-supervised Learning for Video Understanding
NeuMan: Neural Human Radiance Field from a Single Video
Structure and Motion from Casual Videos
The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing
MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
earning Omnidirectional Flow in 360$^\circ $ Video via Siamese Representation
PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection
Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval
Multi-query Video Retrieval
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Learning Audio-Video Modalities from Image Captions
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment
CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification
Relighting4D: Neural Relightable Human from Videos
Real-Time Intermediate Flow Estimation for Video Frame Interpolation
Deep Bayesian Video Frame Interpolation
A Perceptual Quality Metric for Video Frame Interpolation
Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis
Temporally Consistent Semantic Video Editing
Error Compensation Framework for Flow-Guided Video Inpainting
Learning Cross-Video Neural Representations for High-Quality Frame Interpolation
A Style-Based GAN Encoder for High Fidelity Reconstruction of Images and Videos
Harmonizer: Learning to Perform White-Box Image and Video Harmonization
Text2LIVE: Text-Driven Layered Image and Video Editing
CANF-VC: Conditional Augmented Normalizing Flows for Video Compression
Video Extrapolation in Space and Time
Augmentation of rPPG Benchmark Datasets: Learning to Remove and Embed rPPG Signals via Double Cycle Consistent Learning from Unpaired Facial Videos
Layered Controllable Video Generation
Spatio-Temporal Deformable Attention Network for Video Deblurring
Sound-Guided Semantic Video Generation
Controllable Video Generation Through Global and Local Motion Dynamics
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Combining Internal and External Constraints for Unrolling Shutter in Videos
A Codec Information Assisted Framework for Efficient Compressed Video Super-Resolution
Diverse Generation from a Single Video Made Possible
Learning Shadow Correspondence for Video Shadow Detection
Flow-Guided Transformer for Video Inpainting
Learning Spatio-Temporal Downsampling for Effective Video Upscaling
Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution
Efficient Meta-Tuning for Content-Aware Neural Video Delivery
Towards Interpretable Video Super-Resolution via Alternating Optimization
Event-guided Deblurring of Unknown Exposure Time Videos
Unidirectional Video Denoising by Mimicking Backward Recurrent Modules with Look-Ahead Forward Ones
ERDN: Equivalent Receptive Field Deformable Network for Video Deblurring
RealFlow: EM-Based Realistic Optical Flow Dataset Generation from Videos
Efficient Video Deblurring Guided by Motion Magnitude
TempFormer: Temporally Consistent Transformer for Video Denoising
Rethinking Video Rain Streak Removal: A New Synthesis Model and a Deraining Network with Video Rain Prior
AlphaVC: High-Performance and Efficient Learned Video Compression
Source-Free Video Domain Adaptation by Learning Temporal Consistency for Action Recognition
Towards Open Set Video Anomaly Detection
EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound
Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Less Than Few: Self-shot Video Instance Segmentation
Real-Time Online Video Detection with Temporal Smoothing Transformers
Mining Relations Among Cross-Frame Affinities for Video Semantic Segmentation
TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation
PAC-Net: Highlight Your Video via History Preference Modeling
How Severe Is Benchmark-Sensitivity in Video Self-supervised Learning?
NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Video Activity Localisation with Uncertainties in Temporal Boundary
Temporal Saliency Query Network for Efficient Video Recognition
Efficient One-Stage Video Object Detection by Exploiting Temporal Consistency
Spotting Temporally Precise, Fine-Grained Events in Video
Efficient Video Transformers with Spatial-Temporal Token Selection
Long Movie Clip Classification with State-Space Video Models
Prompting Visual-Language Models for Efficient Video Understanding
Asymmetric Relation Consistency Reasoning for Video Relation Grounding
K-centered Patch Sampling for Efficient Video Recognition
GraphVid: It only Takes a Few Nodes to Understand a Video
Delta Distillation for Efficient Video Processing
COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context
TDViT: Temporal Dilated Video Transformer for Dense Video Tasks
Flow Graph to Video Grounding for Weakly-Supervised Multi-step Localization
MaCLR: Motion-Aware Contrastive Learning of Representations for Videos
Frozen CLIP Models are Efficient Video Learners
Panoramic Vision Transformer for Saliency Detection in 360$^\circ $ Videos
Bayesian Tracking of Video Graphs Using Joint Kalman Smoothing and Registration
Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Dynamic Temporal Filtering in Video Models
VTC: Improving Video-Text Retrieval with User Comments
Automatic Dense Annotation of Large-Vocabulary Sign Language Videos
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

2020 ECCV

Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring
CoTeRe-Net: Discovering Collaborative Ternary Relations in Videos
Visual Relation Grounding in Videos
SODA: Story Oriented Dense Video Captioning Evaluation Framework
Optical Flow Distillation: Towards Efficient and Stable Video Style Transfer
Learning Object Depth from Camera Motion and Video Object Segmentation
Localizing the Common Action Among a Few Videos
Two-Branch Recurrent Network for Isolating Deepfakes in Videos
World-Consistent Video-to-Video Synthesis
AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification
Temporal Coherence or Temporal Motion: Which Is More Critical for Video-Based Person Re-identification?
Learning Event-Driven Video Deblurring and Interpolation
VPN: Learning Video-Pose Embedding for Activities of Daily Living
Joint Learning of Social Groups, Individuals Action and Sub-group Activities in Videos
Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation
RhyRNN: Rhythmic RNN for Recognizing Events in Long and Complex Videos
MuCAN: Multi-correspondence Aggregation Network for Video Super-Resolution
Efficient Semantic Video Segmentation with Per-Frame Inference
TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video
Deep Space-Time Video Upsampling Networks
Fast Video Object Segmentation Using the Global Context Module
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos
Procedure Planning in Instructional Videos
Foley Music: Learning to Generate Music from Videos
Online Multi-modal Person Search in Videos
G-LBM: Generative Low-Dimensional Background Model Estimation from Video Sequences
Generating Videos of Zero-Shot Compositions of Actions and Objects
Video Super-Resolution with Recurrent Structure-Detail Network
Shuffle and Attend: Video Domain Adaptation
Flow-edge Guided Video Completion
Towards End-to-End Video-Based Eye-Tracking
Low Light Video Enhancement Using Synthetic Data Produced with an Intermediate Domain Mapping
ScribbleBox: Interactive Annotation Framework for Video Object Segmentation
MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection
AutoTrajectory: Label-Free Trajectory Extraction and Prediction from Videos Using Dynamic Points
Motion Guided 3D Pose Estimation from Videos
SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation
BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation
Video Object Detection via Object-Level Temporal Aggregation
READ: Reciprocal Attention Discriminator for Image-to-Video Re-identification
Multi-level Wavelet-Based Generative Adversarial Network for Perceptual Quality Enhancement of Compressed Video
Unsupervised Video Object Segmentation with Joint Hotspot Tracking
Memory Selection Network for Video Propagation
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark
Clustering Driven Deep Autoencoder for Video Anomaly Detection
Omni-Sourced Webly-Supervised Learning for Video Recognition
Learning Where to Focus for Efficient Video Object Detection
Learning Object Permanence from Video
Temporal Aggregate Representations for Long-Range Video Understanding
Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability
MotionSqueeze: Neural Motion Feature Learning for Video Understanding
Learning Joint Spatial-Temporal Transformations for Video Inpainting
Probabilistic Future Prediction for Video Scene Understanding
Interactive Video Object Segmentation Using Global and Local Transfer Modules
Is Sharing of Egocentric Video Giving Away Your Biometric Signature?
Conditional Entropy Coding for Efficient Video Compression
Self-supervised Video Representation Learning by Pace Prediction
Self-supervised Multi-task Procedure Learning from Instructional Videos
Key Frame Proposal Network for Efficient Pose Estimation in Videos
We Have So Much in Common: Modeling Semantic Relational Set Abstractions in Videos
Self-supervised Learning of Audio-Visual Objects from Video
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
Self-supervised Keypoint Correspondences for Multi-person Pose Estimation and Tracking in Videos
Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior
DVI: Depth Guided Video Inpainting for Autonomous Driving
Adaptive Video Highlight Detection by Learning from User History
dentity-Aware Multi-sentence Video Description
Mining Inter-Video Proposal Relations for Video Object Detection
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Kernelized Memory Network for Video Object Segmentation
Disentangling Multiple Features in Video Sequences Using Gaussian Processes in Variational Autoencoders
Kinematic 3D Object Detection in Monocular Video
Describing Unseen Videos via Multi-modal Cooperative Dialog Agents
DeepLandscape: Adversarial Modeling of Landscape Videos
BIRNAT: Bidirectional Recurrent Neural Networks with Adversarial Training for Video Snapshot Compressive Imaging
Cross-Identity Motion Transfer for Arbitrary Objects Through Pose-Attentive Video Reassembling
Aligning Videos in Space and Time
Proposal-Based Video Completion
Exploiting Temporal Coherence for Self-Supervised One-Shot Video Re-identification
Multi-view Action Recognition Using Cross-View Video Prediction
Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation
VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval
Video Representation Learning by Recognizing Temporal Transformations
Measuring the Importance of Temporal Features in Video Saliency
Representation Learning on Visual-Symbolic Graphs for Video Understanding
S3Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data
High-Quality Single-Model Deep Video Compression with Frame-Conv3D and Multi-frame Differential Modulation

一些关于video方向的论文收集

一些关于video方向的论文收集

2022 ECCV

2020 ECCV

2021 ICCV