FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing
Overview
FactorizePhys introduces a breakthrough in remote photoplethysmography (rPPG) through the novel Factorized Self-Attention Module (FSAM), which leverages nonnegative matrix factorization to jointly compute multidimensional attention across spatial, temporal, and channel dimensions. This work demonstrates that matrix factorization can effectively serve as a multidimensional attention mechanism, outperforming existing state-of-the-art methods, including transformer-based approaches, for remote physiological sensing.

Figure 1: Our poster at NeurIPS 2024, where we discussed FSAM with fellow researchers from the computer vision and machine learning community.
Novel Attention Mechanism
FSAM jointly computes spatial-temporal-channel attention using matrix factorization, unlike existing methods that compute attention disjointly.
Superior Generalization
Achieves state-of-the-art cross-dataset performance, demonstrating robust generalization across different rPPG datasets.
Computational Efficiency
Approximately 50× fewer parameters than existing methods while maintaining competitive performance.
Exceptional Performance
67% reduction in Mean Absolute Error (MAE) and 15% improvement in Signal-to-Noise Ratio (SNR) compared to state-of-the-art methods.
Technical Details
The Compression-as-Attention Paradigm
Using compression as an attention mechanism isn't new. In the CNN-dominated era, squeeze-and-excitation (SE) attention [2] was among the most popular mechanisms. However, a fundamental limitation emerges when working with multidimensional feature spaces: existing attention mechanisms compute attention disjointly across spatial, temporal, and channel dimensions.
For tasks like rPPG estimation that require joint modeling of these dimensions, squeezing individual dimensions can result in information loss, causing learned attention to miss comprehensive multidimensional feature relationships.
FSAM: Factorized Self-Attention Module
FSAM uses Non-negative Matrix Factorization (NMF) [3] to factorize multidimensional feature space into a low-rank approximation, serving as a compressed representation that preserves interdependencies across all dimensions.

Figure 2: Overview of the Factorized Self-Attention Module (FSAM) showing how multidimensional voxel embeddings are transformed into a 2D matrix, factorized using NMF, and reconstructed to provide attention weights.
1. Joint Multidimensional Attention
No dimension squeezing required; processes spatial, temporal, and channel dimensions simultaneously
2. Parameter-free Optimization
Uses classic NMF as proposed by Lee & Seung [3], 1999 , with an optimization algorithm approximated as multiplicative updates [4] implemented under 'no_grad' block.
3. Task-specific Design
Tailored for signal extraction tasks with rank-1 factorization

Figure 3: FactorizePhys Architecture. (A) Proposed FactorizePhys with FSAM; (B) FSAM Adapted for EfficientPhys showing 3D-CNN architecture and 2D-CNN adaptation.
Mathematical Formulation
The Critical Transformation
Input spatial-temporal data can be expressed as I ∈ ℝT×C×H×W, where T, C, H, and W represents total frames (temporal dimension), channels in a frame (e.g., for RGB frames, C=3), height and width of pixels in a frame, respectively. For this input I, we generate voxel embeddings ε ∈ ℝτ×κ×α×β through 3D feature extraction.
Existing 2D approach (Hamburger module [4]):
Our 3D spatial-temporal approach:
This transformation is crucial for rPPG estimation because:
- Physiological signal correlation: We need correlations between spatial/channel features and temporal patterns for BVP signal recovery
- Single signal source: Only one underlying BVP signal across facial regions justifies rank-1 factorization (L=1)
- Scale considerations: Temporal and spatial dimensions have vastly different scales
The NMF Attention Mechanism
The factorization process uses iterative multiplicative updates:
def fsam(E):
S = 1
rank = 1
MD_STEPS = 4
ε = 1e-4
batch, channel_dim, temporal_dim, alpha, beta = E.shape
spatial_dim = alpha * beta
# Preprocessing: ensure non-negativity for NMF
x = ReLU(Conv3D(E - E.min()))
# Transform to factorization matrix
V_st = reshape(x, (batch * S, temporal_dim, spatial_dim * channel_dim))
# Initialize bases and coefficients
bases = torch.ones(batch * S, temporal_dim, rank)
coef = softmax(V_st.transpose(-2, -1) @ bases)
# Iterative multiplicative updates (4-8 steps)
for step in range(MD_STEPS):
# Update coefficients
numerator = V_st.transpose(-2, -1) @ bases
denominator = coef @ (bases.transpose(-2, -1) @ bases)
coef = coef * (numerator / (denominator + ε))
# Update bases
numerator = V_st @ coef
denominator = bases @ (coef.transpose(-2, -1) @ coef)
bases = bases * (numerator / (denominator + ε))
# Reconstruct attention
V̂_st = bases @ coef.transpose(-2, -1)
Ê = reshape(V̂_st, (batch, channel_dim, temporal_dim, alpha, beta))
# Apply attention with residual connection
output = E + InstanceNorm(E * ReLU(Conv3D(Ê)))
return output
Why Rank-1 Factorization Works
The paper's ablation studies confirm that rank-1 factorization performs optimally for rPPG estimation. This aligns with the physiological assumption that there's a single underlying blood volume pulse signal across different facial regions.

Table 1: Ablation study results showing performance across different factorization ranks. Rank-1 achieves optimal performance, supporting the single signal source assumption for rPPG estimation.
Performance Analysis
Why FSAM Outperforms Transformers
1. Task-Specific vs Generic Design
Transformers use generic self-attention that treats all positions equally:
FSAM is specifically designed for spatial-temporal signal extraction:
- Temporal vectors as the primary dimension (signals evolve over time, directly supervised through loss function)
- Spatial-channel features as descriptors (different facial regions contribute differently)
- Rank-1 constraint enforces single signal source assumption
2. Computational Efficiency
Method | Complexity | Parameters | Update Steps |
---|---|---|---|
FSAM | O(n) | 52K | 4-8 multiplicative updates |
Transformer [5] | O(n²) | 7.38M | Full attention computation |
Efficiency Highlight
138× fewer parameters compared to PhysFormer [6] while achieving superior performance across all evaluation metrics.
Cross-Dataset Generalization Performance
FactorizePhys demonstrates superior performance across all evaluation metrics when tested on unseen datasets:

Table 2: Cross-dataset Performance Evaluation. Comprehensive comparison showing FactorizePhys outperforming SOTA methods across all datasets with MAE, RMSE, MAPE, Correlation, SNR, and MACC metrics.
Performance vs. Latency Analysis

Figure 4: Performance vs. Latency. Cumulative cross-dataset performance (MAE) vs. latency plot with model parameters as sphere size, showing FactorizePhys achieving best performance with fewer parameters.
Attention Visualization
In this work, we introduced an intuitive visualization of learned attention features, by computing cosine similarity between temporal embeddings and ground-truth PPG signals. In the below figure, higher correlation scores and better spatial selectivity can be observed for FactorizePhys trained with FSAM.

Figure 5: Attention visualization comparing baseline model (left) and FactorizePhys with FSAM (right). Higher cosine similarity scores (brighter regions) indicate better spatial selectivity for rPPG-rich facial regions.
The Inference-Time Advantage
A remarkable finding: FactorizePhys trained with FSAM retains performance even when FSAM is dropped during inference. This suggests FSAM enhances saliency of relevant features during training, guiding the network to learn such salient feature representations that persist without the attention module.
# Training: FSAM influences 3D convolutional kernels
factorized_embeddings = fsam(voxel_embeddings)
loss = compute_loss(head(factorized_embeddings), ground_truth)
# Inference: Can drop FSAM without performance loss
output = head(voxel_embeddings) # No FSAM needed!
This dramatically reduces inference latency while maintaining accuracy - ideal for real-time applications.
Key Contributions and Broader Implications
For the rPPG Community:
- First matrix factorization-based attention specifically designed for physiological signal extraction
- State-of-the-art cross-dataset generalization with dramatically fewer parameters
- Real-time deployment capability without performance degradation
For the Broader AI Community:
- Novel attention paradigm demonstrating that domain-specific designs can outperform generic mechanisms
- Efficiency breakthrough: 138× parameter reduction compared to transformers with superior performance
- New perspective on attention: Factorization as compression can be more effective than dimension squeezing
Future Research Directions
FSAM's success opens several promising avenues:
- Extended Applications: Video understanding, action recognition, medical time-series analysis
- Enhanced NMF Variants: Incorporating temporal smoothness or frequency domain constraints
- Multi-physiological Signals: Check out our subsequent work (MMRPhys) that explores this direction for robust estimation of multiple physiological signals
- Hybrid Architectures: Combining factorization-based attention with other mechanisms for different modalities
- Theoretical Analysis: Understanding why rank-1 factorization generalizes so well across datasets
Conclusion
FSAM demonstrates that deeper understanding of the problem domain can lead to more effective solutions than applying generic, computationally expensive methods. In an era of ever-growing model sizes, this work shows that thoughtful design trumps brute-force scaling.
Code & Resources
GitHub Repository
Complete implementation with training scripts, evaluation metrics, and pre-trained models.
View RepositoryrPPG-Toolbox Integration
Built on the comprehensive rPPG-Toolbox framework for fair comparison and reproducible results.
Datasets
Evaluated on iBVP, PURE, UBFC-rPPG, and SCAMPS datasets with standardized preprocessing.
Code and Data Availability
Resource | Description | Link |
---|---|---|
Paper | FactorizePhys: Matrix Factorization for Multidimensional Attention | NeurIPS 2024 |
Code | Complete implementation and training scripts | GitHub Repository |
SOTA Methods and Datasets | rPPG-Toolbox: Deep Remote PPG Toolbox (NeurIPS 2023) | rPPG Toolbox |
Citation
@inproceedings{joshi2024factorizephys, author = {Joshi, Jitesh and Agaian, Sos S. and Cho, Youngjun}, booktitle = {Advances in Neural Information Processing Systems}, editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang}, pages = {96607--96639}, publisher = {Curran Associates, Inc.}, title = {FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing}, url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/af1c61e4dd59596f033d826419870602-Paper-Conference.pdf}, volume = {37}, year = {2024} }
References
- Joshi, J., Agaian, S. S., & Cho, Y. (2024). FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing. In Advances in Neural Information Processing Systems (NeurIPS 2024).
- Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132-7141).
- Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.
- Geng, Z., Guo, M. H., Chen, H., Li, X., Wei, K., & Lin, Z. (2021). Is attention better than matrix decomposition? In International Conference on Learning Representations.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).
- Yu, Z., Shen, Y., Shi, J., Zhao, H., Torr, P. H., & Zhao, G. (2022). PhysFormer: Facial video-based physiological measurement with temporal difference transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4186-4196).
- Liu, X., Hill, B., Jiang, Z., Patel, S., & McDuff, D. (2023). EfficientPhys: Enabling simple, fast and accurate camera-based cardiac measurement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 5008-5017).
- Stricker, R., Müller, S., & Gross, H. M. (2014). Non-contact video-based pulse rate measurement on a mobile service robot. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication (pp. 1056-1062).
- McDuff, D., Wander, M., Liu, X., Hill, B., Hernandez, J., Lester, J., & Baltrusaitis, T. (2022). SCAMPS: Synthetics for camera measurement of physiological signals. Advances in Neural Information Processing Systems, 35, 3744-3757.
- Bobbia, S., Macwan, R., Benezeth, Y., Mansouri, A., & Dubois, J. (2019). Unsupervised skin tissue segmentation for remote photoplethysmography. Pattern Recognition Letters, 124, 82-90.
- Joshi J. and Cho Y. (2024), iBVP Dataset: RGB-Thermal rPPG dataset with high resolution signal quality labels, MDPI Electronics, vol. 13, no. 7, p. 1334.