Cao Zhenxiao
- Location: Xi’an, Shaanxi, P.R. China
- Email: realalanc@qq.com / realalanc029@gmail.com / alancao@stu.xjtu.edu.cn
News about me!
- “MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction” has been accepted by ICLR2025!!!!
- Re-submit out paper “Post-translation modification(PTM) prediction via deep learning” to ICLR2025
- Former work ‘Gene expression prediction in long sequence NLP algorithms’ is cancelled.
- Starting a new work ‘Protein dynamics generation model’ with Westlake University
- Starting a new work ‘3D protein image structural embedding’ with UNC
- Starting a new work ‘Rare cell identification via LLM’
Education
Bachelor of Science in Computer Science
Xi’an Jiaotong University, 2025
Grade: 85.54/100
- Major courses: Advanced Mathematics, Linear Algebra and Geometry, University Physics, Foundation of Life Science, Thinking of Big Data and Innovation, Data Structure and Algorithms, Discrete Mathematical Structures, Object-Oriented Programming (OOP), Introduction of Computer Systems (ICS), Probability Theory and Stochastic process, Bioinformatics, Artificial Intelligence, Computer Organization, Principles of Computer Networks, The Principle of Operating System
Publications
Conference Papers
- “MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction” has been accepted by ICLR2025
Papers Under Review
- “How Effective is In-Context Learning with Large Language Models for Rare Cell Identification in Single-Cell Expression Data?” is under review of ECML-PKDD 2025
Research Interests
AI for Biology
- LLMs for biologial moleculars
- Predictions of the feature and structure of biologial moleculars
- Explainable AIs in biology
- AIGC in biology (like protein design, small molecular design)
- Biological image analysis and CV for biology
Machine Learning
- DL network structure design
- Explainable AIs
- Causal inference
Feel free to connect me for any research interests and communications
Research Experience
Internship at SenseTime
Year: 2023
Status: Finished
- Developed a smart sensor to monitor Alzheimer patients via CNN on an embedding device, involving quantization from a Yolov5-based inference network. Tasks included face identification, landmark detection, and signal processing for heart rate.
First-author-level works
- MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction
Westlake University, with Cheng Tan Google Scholar and Stan Z.Li Google Scholar Year: 2024
Status: Accepted by ICLR2025
- Developed a deep learning framework for PTM prediction, aiming to achieve state-of-the-art results in multi-class prediction using sequence and structural data.
- Our model is based on VQ-VAE, which constructs graphs by capturing sequence neighbors and structural neighbors. It is divided into two tasks: pre-training the VQ-VAE and extracting intermediate embeddings for downstream tasks.
- Protein dynamics generation model
Westlake University, with Cheng Tan Google Scholar and Stan Z.Li Google Scholar Year: 2024
Status: Processing
- The research objective is to develop a model that, by learning from the molecular structure, chemical properties of protein-ligand complexes, and short-term molecular dynamics trajectories (the first few frames), can predict and output longer-term molecular dynamics trajectories of protein-ligand complexes.
- Data refine in protein prediction databasel
Westlake University, with Cheng Tan Google Scholar and Stan Z.Li Google Scholar Year: 2024
Status: Processing
- As can be known, protein prediction data often has biases against the ground truth. Our task is to build an algorithm to pair-to-pair refine and enhance the dataset.
- 3D protein image structural embedding
University of North Carolina at Chapel Hill, with Wenhao ZhengGoogle Scholar and Huaxiu Yao Google Scholar Year: 2024
Status: Processing
- Training a embedding vector by VAE-like models, and providing it for downstream tasks like atom prediction.
- How Effective is In-Context Learning with Large Language Models for Rare Cell Identification in Single-Cell Expression Data?
Year: 2024
Status: Under Review
- Using LLM as an extractor to identify rare cells from data, which can be used in different biological domains. We found that cross-query based in-context learning provides stable performance, independent of the data size, and achieves SOTA on some datasets.
- Causal Prediction in microsatellite instability (MSI) and gene network
Xi’an Jiaotong University, with Xiaofei Yang Google Scholar and Kai Ye Google Scholar Year: 2022
Status: Paused
- Developed a toolbox for causal prediction between gene networks and MSI, involving correlation filtering, causal graph generation, and correctness checking via causal inference algorithms.
Other works
- Breast Cancer Medecal Image Dataset in Rare Classes
Peiking University, Year: 2024
Status: Processing
- Developing a medical image dataset, including different types of medical images in different types of breast cancers. Focusing on rare cases, our dataset tends to provide a competative dataset and benchmarks for future downstream tasts.
- Uncertainty Guided Generalised Federated Learning
Rice University, Year: 2024
Status: Processing, expected to submit soon
- This study employs Gaussian noise in a federated learning model for domain generalization, enhancing the model’s adaptability and robustness across diverse domains while ensuring data privacy.
- Gene expression prediction in Mamba
University of North Carolina at Chapel Hill, Year: 2024
Status: Cancelled
- Trying to predict cis-acting elements by using Mamba that is capable to handle long sequences.
- Monomer inference in fruit fly’s centromere
Xi’an Jiaotong University, Year: 2023
Status: Processing, expected to submit soon
- Analyzed the structure of the centromere in Drosophila melanogaster, focusing on monomer analysis from long-read gap-free sequence data.
Awards
- China National Biology Olympics (CNBO) Silver
Year: 2020
- Awarded a Silver Medal in CNBO for proficiency in biology.
- International Genetic Machine Olympics (iGEM) Gold
Year: 2022-2023
- Awarded Gold as a member of the iGEM team for innovative work on cellular automata modeling.
- The whole project is focused on building a light-induced autolysis on engineered bacteria to emerge genocides,
while the cell lysis cycle controls the amount of genocides.
- China International College Students’ Innovation Campaign Honorable Mention
Year: 2022-2023
- Received Honorable Mention for research on causal prediction in microsatellite instability and gene networks.
Other Contributions
- Building py-Cicero as a free open-source developer
- Developing a Python-version Cicero software for calculating single-cell chromatin co-accessibility.
- PR and bug fix in PDBminer
- Contributed bug fixes and enhancements to PDBminer, an open-source software for retrieving structural data via UniProt IDs.
Professional Skills
Language
- Chinese (native)
- English (TOEFL 103)
Programming
- Familiar with basic Linux commands
- Proficient in Python; competent in R, C, C++, and Java
Biology
- Proficient in biochemistry, molecular biology, cellular biology, genetics, and other related fields
- Experienced in wet lab techniques such as PCR and electrophoresis.
Other Interesting Thinng About Me
Model UN
- I was a paticipant in Model UN in my middle school and high school times, with getting some prizes.
Language Learning
- I am a fan in language learning, I am learning Russian, and having further plan in learning Japanese.
Leadership Trainning
- I once joined a leadership trainning held by ITCILO, learned some skills and facts about life planning.
- I am interested in and willing to join or develop a AI4Bio community.