Daoan Zhang

I am currently a first-year Phd student at University of Rochester, advised by Albert Arendt Hopeman Professor Jiebo Luo. I was a graduate student (thesis-based master) at Southern University of Science and Technology, advised by Prof. Jianguo Zhang. Before that, I received my bachelor's degree from East China University Of Science And Technology.

I was a research intern at Tencent AI Lab, advised by Dr. Jianhua Yao and Chenchen Qin in 2023. I was also a research intern at Ping An Technology, advised by Dr. Lingyun Huang in 2022. I remotely joined CCVL (Computational Cognition, Vision, and Learning) research group at Johns Hopkins University as an intern, advised by Prof. Alan L. Yuille.

Research

My current research interests lie in the field of Generative Models . The most of my current works focused on Multi-Modality Agent and AI for Science .

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo
ECCV 2024
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation.

Domain-Regressive Continual Test-Time Adaptation
Anonymous Author
Under Review
Currently under review.

Emo-Avatar: Efficient Monocular Video Style Avatar through Texture Rendering
Pinxin Liu, Luchuan Song, Daoan Zhang, Hang Hua, Yunlong Tang, Huaijin Tu, Jiebo Luo, Chenliang Xu
Under Review
We propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering.

Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs
Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, Jiebo Luo
ICPR 2024
We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences.

Ntire 2024 challenge on image super-resolution
Zheng Chen, ... , Daoan Zhang, ... , Niki Martinel
CVPRW 2024
Won the championship of this CVPR challenge.

Video understanding with large language models: A survey
Yunlong Tang, ... , Daoan Zhang, ... , Chenliang Xu
Arxiv
This survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video under- standing. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey presents a comprehensive study of the tasks, datasets, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at this repo.

GPT-4V(ision) as A Social Media Analysis Engine
Hanjia Lyu*, Jinfa Huang*, Daoan Zhang*, Yongsheng Yu*, Xinyi Mou*, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo
Arxiv
We explore GPT-4V(ision)'s capabilities for social multimedia analysis including sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection.

DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks
Daoan Zhang, Weitong Wang, Bing He, Jianguo Zhang, Chenchen Qin, Jianhua Yao,
Under Review by Nature Methods
We propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a numerical regression task (guanine-cytosine content prediction), and a comprehensive token language, DNAGPT can handle versatile DNA analy- sis tasks while processing both sequence and numerical data.

Cross Contrastive Feature Perturbation for Domain Generalization
Chenming Li, Daoan Zhang, Wenjian Huang, Jianguo Zhang,
ICCV 2023
In this paper, we propose an online one-stage Cross Contrasting Feature Perturbation (CCFP) framework to simulate domain shift by generating perturbed features in the latent space while regularizing the model prediction against domain shift.

Rethinking Alignment and Uniformity in Unsupervised Image Semantic Segmentation
Daoan Zhang, Chenming Li, Haoquan Li, Wenjian Huang, Lingyun Huang, Jianguo Zhang,
AAAI 2023 (Oral)
In this paper, we address the critical properties from the view of feature alignments and feature uniformity for UISS models. We also make a comparison between UISS and image-wise representation learning. we proposed a robust network called Semantic Attention Network(SAN).

Feature Alignment and Uniformity for Test Time Adaptation
Shuai Wang, Daoan Zhang, Zipei Yan, Jianguo Zhang, Rui Li
CVPR 2023
We first address TTA as a feature revision problem due to the domain gap between source domains and target domains. After that, we follow the two measure- ments alignment and uniformity to discuss the test time fea- ture revision. For test time feature uniformity, we propose a test time self-distillation strategy to guarantee the consis- tency of uniformity between representations of the current batch and all the previous batches. For test time feature alignment, we propose a memorized spatial local cluster- ing strategy to align the representations among the neigh- borhood samples for the upcoming batch.

Prototype Knowledge Distillation for Medical Segmentation with Missing Modality
Shuai Wang, Zipei Yan, Daoan Zhang, Haining Wei, Zhongsen Li, Rui Li
ICASSP 2023
In this paper, we propose a prototype knowledge distillation (ProtoKD) method to tackle the challenging problem, especially for the toughest scenario when only single modal data can be accessed. Specifically, our ProtoKD can not only dis- tillate the pixel-wise knowledge of multi-modality data to single-modality data but also transfer intra-class and inter-class feature variations, such that the student model could learn more robust feature representation from the teacher model and inference with only one single modal- ity data.

TransVLAD: Focusing on Locally Aggregated Descriptors for Few-Shot Learning
Haoquan Li, Laoming Zhang, Daoan Zhang, Lang Fu, Peng Yang, Jianguo Zhang,
ECCV 2022
This paper presents a transformer framework for few-shot learning, termed TransVLAD, with one focus showing the power of locally aggregated descriptors for few-shot learning.

Aggregation of Disentanglement: Reconsidering Domain Variations in Domain Generalization
Daoan Zhang, Mingkai Chen, Chenming Li, Lingyun Huang, Jianguo Zhang,
In Submission to IJCV
We proposed a new perspective to utilize class-aware domain variant features in training, and in the inference period, our model effectively maps target domains into the latent space where the known domains lie. We also designed a contrastive learning based paradigm to calculate the weights for unseen domains.

Internship
May 2024 - Present	Tencent Americas
Feb 2023 - Aug 2023	Tencent(Shenzhen)
Jun 2021 - Nov 2022	Ping An Technology

Services

Review Service: ICCV2023, AISTATS2024, CVPR2024, EACL2024, TCSVT, Pattern Recognition

credits