About

I am a 2nd year Computer Science Ph.D. student at University of California, Santa Barbara, supervised by Prof. Zheng Zhang.

Before joining UCSB, I received my B.S. and M.S. from EIC@Huazhong University of Science and Technology, worked with Prof. Jun Sun, Prof. Xinggang Wang and Prof. Yingzhuang Liu on computer vision and interpretability of deep learning.

Research Interests

My research interests focus on efficient pre-training for LLMs/VLMs, AI for science, computer vision, and the mathematical & physical principles behind them. I have recently focused on

  • Low-rank pre-training foundation models.
  • Low-precision training.
  • Sparse attention.

News

[09/2025] One of our recent works has been accepted by NeurIPS 2025. Congratulations to all collaborators!

[08/2025] One of our recent works has been accepted by EMNLP 2025 (oral). Congratulations to all collaborators!

[09/2024] Joined Computer Science@UCSB Ph.D Program.

[09/2024] One of our recent works has been accepted by NeurIPS 2024. Congratulations to all collaborators!

Publications

  • Preprint Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
    Qiaozhe Zhang, Jun Sun, Ruijie Zhang, Yingzhuang Liu
    arXiv 2025 (Submitted to ICLR 2025)
    pdf code
    @misc{zhang2025renyisharpnessnovelsharpness,
        title={R\'enyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization}, 
        author={Qiaozhe Zhang and Jun Sun and Ruijie Zhang and Yingzhuang Liu},
        year={2025},
        eprint={2510.07758},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2510.07758}, 
    }
    Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{Rényi sharpness}, which is defined as the negative Rényi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{Rényi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (Rényi) entropy. To rigorously establish the relationship between generalization and (Rényi) sharpness, we provide several generalization bounds in terms of Rényi sharpness, by taking advantage of the reparametrization invariance property of Rényi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the Rényi sharpness and generalization. Moreover, we propose to use a variant of Rényi Sharpness as regularizer during training, i.e., Rényi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5%, compared against the classical SAM method.
  • NeurIPS 2025 LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing
    Ruijie Zhang*, Ziyue Liu*, Zhengyang Wang, Zheng Zhang (* Equal contributions)
    The 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
    pdf code
    @article{zhang2025lax,
    title={LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing},
    author={Zhang, Ruijie and Liu, Ziyue and Wang, Zhengyang and Zhang, Zheng},
    journal={arXiv preprint arXiv:2505.21732},
    year={2025}
    }
    Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce {\textbf{Latent Crossing (LaX)}} -- a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3 fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.
  • EMNLP 2025 CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation (oral)
    Ziyue Liu*, Ruijie Zhang*, Zhengyang Wang*, Mingsong Yan, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Sui Tang, Zheng Zhang (* Equal contributions)
    The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
    pdf code
    @article{liu2025cola,
    title={Cola: Compute-efficient pre-training of llms via low-rank activation},
    author={Liu, Ziyue and Zhang, Ruijie and Wang, Zhengyang and Yan, Mingsong and Yang, Zi and Hovland, Paul and Nicolae, Bogdan and Cappello, Franck and Tang, Sui and Zhang, Zheng},
    journal={arXiv preprint arXiv:2502.10940},
    year={2025}
    }
    The full-size MLPs and the projection layers in attention introduce tremendous model sizes of large language models (LLMs), consuming extensive computational resources in pre-training. We empirically observe that the activations of pre-trained LLMs exhibit low-rank property. Motivated by such observations, we propose CoLA and its memory-efficient implementation, CoLA-M, to replace these full-size layers with compute-efficient auto-encoders that naturally enforce low-rank activations throughout training. This fundamental architectural change eliminates the activation redundancy and significantly boosts model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by 2×× and improves training throughput by 1.86×× while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also 2×× smaller, enabling faster inference with lower memory cost on resource-constrained platforms.
  • NeurIPS 2024 How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective
    Qiaozhe Zhang, Ruijie Zhang, Jun Sun, Yingzhuang Liu
    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
    pdf code NeurIPS
    @article{zhang2024sparse,
        title={How sparse can we prune a deep network: A fundamental limit perspective},
        author={Zhang, Qiaozhe and Zhang, Ruijie and Sun, Jun and Liu, Yingzhuang},
        journal={Advances in Neural Information Processing Systems},
        volume={37},
        pages={91337--91372},
        year={2024}
    }
    Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. Moreover, we provide efficient countermeasures to address the challenges in the computation of the pruning limit, which mainly involves the accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can also provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed which demonstrate that our theoretical pruning ratio threshold coincides very well with the experiments.
  • Preprint Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
    Ruijie Zhang, Qiaozhe Zhang, Yingzhuang Liu, Hao Xin, Yan Liu, Xinggang Wang
    arXiv 2023
    pdf code
    @article{zhang2023multi,
          title={Multi-level multiple instance learning with transformer for whole slide image classification},
          author={Zhang, Ruijie and Zhang, Qiaozhe and Liu, Yingzhuang and Xin, Hao and Liu, Yan and Wang, Xinggang},
          journal={arXiv preprint arXiv:2306.05029},
          year={2023}
    }
    Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make employing deep learning methods for WSI-based digital diagnosis challenging. Recently integrating multiple instance learning (MIL) and Transformer for WSI analysis shows very promising results. However, designing effective Transformers for this weakly-supervised high-resolution image analysis is an underexplored yet important problem. In this paper, we propose a Multi-level MIL (MMIL) scheme by introducing a hierarchical structure to MIL, which enables efficient handling of MIL tasks involving a large number of instances. Based on MMIL, we instantiated MMIL-Transformer, an efficient Transformer model with windowed exact self-attention for large-scale MIL tasks. To validate its effectiveness, we conducted a set of experiments on WSI classification tasks, where MMIL-Transformer demonstrate superior performance compared to existing state-of-the-art methods, i.e., 96.80% test AUC and 97.67% test accuracy on the CAMELYON16 dataset, 99.04% test AUC and 94.37% test accuracy on the TCGA-NSCLC dataset, respectively.