FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing

Jitesh Joshi1, Sos S. Agaian2, Youngjun Cho1
1University College London, UK • 2City University of New York, USA
NeurIPS 2024

Overview

FactorizePhys introduces a breakthrough in remote photoplethysmography (rPPG) through the novel Factorized Self-Attention Module (FSAM), which leverages nonnegative matrix factorization to jointly compute multidimensional attention across spatial, temporal, and channel dimensions. This work demonstrates that matrix factorization can effectively serve as a multidimensional attention mechanism, outperforming existing state-of-the-art methods, including transformer-based approaches, for remote physiological sensing.

FactorizePhys Poster at NeurIPS 2024

Figure 1: Our poster at NeurIPS 2024, where we discussed FSAM with fellow researchers from the computer vision and machine learning community.

Novel Attention Mechanism

FSAM jointly computes spatial-temporal-channel attention using matrix factorization, unlike existing methods that compute attention disjointly.

Superior Generalization

Achieves state-of-the-art cross-dataset performance, demonstrating robust generalization across different rPPG datasets.

Computational Efficiency

Approximately 50× fewer parameters than existing methods while maintaining competitive performance.

Exceptional Performance

67% reduction in Mean Absolute Error (MAE) and 15% improvement in Signal-to-Noise Ratio (SNR) compared to state-of-the-art methods.

67%
MAE Reduction
15%
SNR Improvement
51K
Parameters
0.998
HR Correlation

Technical Details

The Compression-as-Attention Paradigm

Using compression as an attention mechanism isn't new. In the CNN-dominated era, squeeze-and-excitation (SE) attention [2] was among the most popular mechanisms. However, a fundamental limitation emerges when working with multidimensional feature spaces: existing attention mechanisms compute attention disjointly across spatial, temporal, and channel dimensions.

For tasks like rPPG estimation that require joint modeling of these dimensions, squeezing individual dimensions can result in information loss, causing learned attention to miss comprehensive multidimensional feature relationships.

FSAM: Factorized Self-Attention Module

FSAM uses Non-negative Matrix Factorization (NMF) [3] to factorize multidimensional feature space into a low-rank approximation, serving as a compressed representation that preserves interdependencies across all dimensions.

FSAM Overview

Figure 2: Overview of the Factorized Self-Attention Module (FSAM) showing how multidimensional voxel embeddings are transformed into a 2D matrix, factorized using NMF, and reconstructed to provide attention weights.

1. Joint Multidimensional Attention

No dimension squeezing required; processes spatial, temporal, and channel dimensions simultaneously

2. Parameter-free Optimization

Uses classic NMF as proposed by Lee & Seung [3], 1999 , with an optimization algorithm approximated as multiplicative updates [4] implemented under 'no_grad' block.

3. Task-specific Design

Tailored for signal extraction tasks with rank-1 factorization

FactorizePhys Architecture

Figure 3: FactorizePhys Architecture. (A) Proposed FactorizePhys with FSAM; (B) FSAM Adapted for EfficientPhys showing 3D-CNN architecture and 2D-CNN adaptation.

Mathematical Formulation

The Critical Transformation

Input spatial-temporal data can be expressed as I ∈ ℝT×C×H×W, where T, C, H, and W represents total frames (temporal dimension), channels in a frame (e.g., for RGB frames, C=3), height and width of pixels in a frame, respectively. For this input I, we generate voxel embeddings ε ∈ ℝτ×κ×α×β through 3D feature extraction.

Existing 2D approach (Hamburger module [4]):

$$V^{s} \in \mathbb{R}^{M \times N} = \Gamma^{\kappa\alpha\beta \mapsto MN}(\xi_{pre}(\varepsilon \in \mathbb{R}^{\kappa \times \alpha \times \beta})) \ni \kappa \mapsto M,\alpha \times \beta \mapsto N$$

Our 3D spatial-temporal approach:

$$V^{st} \in \mathbb{R}^{M \times N} = \Gamma^{\tau\kappa\alpha\beta \mapsto MN}(\xi_{pre}(\varepsilon \in \mathbb{R}^{\tau \times \kappa \times \alpha \times \beta})) \ni \tau \mapsto M, \kappa \times \alpha \times \beta \mapsto N$$

This transformation is crucial for rPPG estimation because:

  • Physiological signal correlation: We need correlations between spatial/channel features and temporal patterns for BVP signal recovery
  • Single signal source: Only one underlying BVP signal across facial regions justifies rank-1 factorization (L=1)
  • Scale considerations: Temporal and spatial dimensions have vastly different scales

The NMF Attention Mechanism

The factorization process uses iterative multiplicative updates:

def fsam(E):
    S = 1
    rank = 1
    MD_STEPS = 4
    ε = 1e-4
    batch, channel_dim, temporal_dim, alpha, beta = E.shape
    spatial_dim = alpha * beta

    # Preprocessing: ensure non-negativity for NMF
    x = ReLU(Conv3D(E - E.min()))

    # Transform to factorization matrix
    V_st = reshape(x, (batch * S, temporal_dim, spatial_dim * channel_dim))

    # Initialize bases and coefficients
    bases = torch.ones(batch * S, temporal_dim, rank)
    coef = softmax(V_st.transpose(-2, -1) @ bases)

    # Iterative multiplicative updates (4-8 steps)
    for step in range(MD_STEPS):
        # Update coefficients
        numerator = V_st.transpose(-2, -1) @ bases
        denominator = coef @ (bases.transpose(-2, -1) @ bases)
        coef = coef * (numerator / (denominator + ε))
        
        # Update bases
        numerator = V_st @ coef
        denominator = bases @ (coef.transpose(-2, -1) @ coef)
        bases = bases * (numerator / (denominator + ε))

    # Reconstruct attention
    V̂_st = bases @ coef.transpose(-2, -1)
    Ê = reshape(V̂_st, (batch, channel_dim, temporal_dim, alpha, beta))

    # Apply attention with residual connection
    output = E + InstanceNorm(E * ReLU(Conv3D(Ê)))
    
    return output

Why Rank-1 Factorization Works

The paper's ablation studies confirm that rank-1 factorization performs optimally for rPPG estimation. This aligns with the physiological assumption that there's a single underlying blood volume pulse signal across different facial regions.

Rank Ablation Study

Table 1: Ablation study results showing performance across different factorization ranks. Rank-1 achieves optimal performance, supporting the single signal source assumption for rPPG estimation.

Performance Analysis

Why FSAM Outperforms Transformers

1. Task-Specific vs Generic Design

Transformers use generic self-attention that treats all positions equally:

$$Attention(Q,K,V) = softmax(QK^T/\sqrt{d_k})V$$

FSAM is specifically designed for spatial-temporal signal extraction:

2. Computational Efficiency

Method Complexity Parameters Update Steps
FSAM O(n) 52K 4-8 multiplicative updates
Transformer [5] O(n²) 7.38M Full attention computation

Efficiency Highlight

138× fewer parameters compared to PhysFormer [6] while achieving superior performance across all evaluation metrics.

Cross-Dataset Generalization Performance

FactorizePhys demonstrates superior performance across all evaluation metrics when tested on unseen datasets:

Cross-dataset Performance Evaluation

Table 2: Cross-dataset Performance Evaluation. Comprehensive comparison showing FactorizePhys outperforming SOTA methods across all datasets with MAE, RMSE, MAPE, Correlation, SNR, and MACC metrics.

Performance vs. Latency Analysis

Performance vs. Latency

Figure 4: Performance vs. Latency. Cumulative cross-dataset performance (MAE) vs. latency plot with model parameters as sphere size, showing FactorizePhys achieving best performance with fewer parameters.

Attention Visualization

In this work, we introduced an intuitive visualization of learned attention features, by computing cosine similarity between temporal embeddings and ground-truth PPG signals. In the below figure, higher correlation scores and better spatial selectivity can be observed for FactorizePhys trained with FSAM.

Attention Visualization

Figure 5: Attention visualization comparing baseline model (left) and FactorizePhys with FSAM (right). Higher cosine similarity scores (brighter regions) indicate better spatial selectivity for rPPG-rich facial regions.

The Inference-Time Advantage

A remarkable finding: FactorizePhys trained with FSAM retains performance even when FSAM is dropped during inference. This suggests FSAM enhances saliency of relevant features during training, guiding the network to learn such salient feature representations that persist without the attention module.

# Training: FSAM influences 3D convolutional kernels
factorized_embeddings = fsam(voxel_embeddings)
loss = compute_loss(head(factorized_embeddings), ground_truth)

# Inference: Can drop FSAM without performance loss
output = head(voxel_embeddings)  # No FSAM needed!

This dramatically reduces inference latency while maintaining accuracy - ideal for real-time applications.

Key Contributions and Broader Implications

For the rPPG Community:

For the Broader AI Community:

Future Research Directions

FSAM's success opens several promising avenues:

  1. Extended Applications: Video understanding, action recognition, medical time-series analysis
  2. Enhanced NMF Variants: Incorporating temporal smoothness or frequency domain constraints
  3. Multi-physiological Signals: Check out our subsequent work (MMRPhys) that explores this direction for robust estimation of multiple physiological signals
  4. Hybrid Architectures: Combining factorization-based attention with other mechanisms for different modalities
  5. Theoretical Analysis: Understanding why rank-1 factorization generalizes so well across datasets

Conclusion

FSAM demonstrates that deeper understanding of the problem domain can lead to more effective solutions than applying generic, computationally expensive methods. In an era of ever-growing model sizes, this work shows that thoughtful design trumps brute-force scaling.

Code & Resources

GitHub Repository

Complete implementation with training scripts, evaluation metrics, and pre-trained models.

View Repository

rPPG-Toolbox Integration

Built on the comprehensive rPPG-Toolbox framework for fair comparison and reproducible results.

Datasets

Evaluated on iBVP, PURE, UBFC-rPPG, and SCAMPS datasets with standardized preprocessing.

Code and Data Availability

Resource Description Link
Paper FactorizePhys: Matrix Factorization for Multidimensional Attention NeurIPS 2024
Code Complete implementation and training scripts GitHub Repository
SOTA Methods and Datasets rPPG-Toolbox: Deep Remote PPG Toolbox (NeurIPS 2023) rPPG Toolbox

Citation

@inproceedings{joshi2024factorizephys,
  author = {Joshi, Jitesh and Agaian, Sos S. and Cho, Youngjun},
  booktitle = {Advances in Neural Information Processing Systems},
  editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
  pages = {96607--96639},
  publisher = {Curran Associates, Inc.},
  title = {FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing},
  url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/af1c61e4dd59596f033d826419870602-Paper-Conference.pdf},
  volume = {37},
  year = {2024}
}

References

  1. Joshi, J., Agaian, S. S., & Cho, Y. (2024). FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing. In Advances in Neural Information Processing Systems (NeurIPS 2024).
  2. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132-7141).
  3. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.
  4. Geng, Z., Guo, M. H., Chen, H., Li, X., Wei, K., & Lin, Z. (2021). Is attention better than matrix decomposition? In International Conference on Learning Representations.
  5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).
  6. Yu, Z., Shen, Y., Shi, J., Zhao, H., Torr, P. H., & Zhao, G. (2022). PhysFormer: Facial video-based physiological measurement with temporal difference transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4186-4196).
  7. Liu, X., Hill, B., Jiang, Z., Patel, S., & McDuff, D. (2023). EfficientPhys: Enabling simple, fast and accurate camera-based cardiac measurement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 5008-5017).
  8. Stricker, R., Müller, S., & Gross, H. M. (2014). Non-contact video-based pulse rate measurement on a mobile service robot. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication (pp. 1056-1062).
  9. McDuff, D., Wander, M., Liu, X., Hill, B., Hernandez, J., Lester, J., & Baltrusaitis, T. (2022). SCAMPS: Synthetics for camera measurement of physiological signals. Advances in Neural Information Processing Systems, 35, 3744-3757.
  10. Bobbia, S., Macwan, R., Benezeth, Y., Mansouri, A., & Dubois, J. (2019). Unsupervised skin tissue segmentation for remote photoplethysmography. Pattern Recognition Letters, 124, 82-90.
  11. Joshi J. and Cho Y. (2024), iBVP Dataset: RGB-Thermal rPPG dataset with high resolution signal quality labels, MDPI Electronics, vol. 13, no. 7, p. 1334.