Research | Haoyu Wang

My research interests lie in the intersection of data mining, natural language processing, and machine learning, with a strong focus on democratizing AI for broader accessibility. In particular, my research projects are:

Parameter-efficient Learning: Large-scale deep learning models have achieved success in numerous applications. However, large-scale models introduce significant computational complexity and demand extensive storage resources, which hinders model development, particularly for edge devices and latency-sensitive applications. Therefore, it is crucial to explore methods for learning parameter-efficient representations that can reduce computational complexity and storage consumption, thereby facilitating model development.

RoseLoRA: Row and Column-wise Sparse Low-rank Adaptation of Pre-trained Language Model for Knowledge Editing and Fine-tuning. EMNLP’24
LightLT: a Lightweight Representation Quantization Framework for Long-tail Data. ICDE’24
HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference. EMNLP’23
LightToken: a Task and Model-agnostic Lightweight Token Embedding Framework for Pre-trained Language Models. KDD’23
A Lightweight Knowledge Graph Embedding Framework for Efficient Inference and Storage. CIKM’21
xLightFM: Extremely Memory-Efficient Factorization Machine. SIGIR’21
LightRec: a Memory and Search-Efficient Recommender System. WWW’20
Binarized Collaborative Filtering with Distilling Graph Convolutional Network. IJCAI’19
Adversarial Binary Collaborative Filtering For Implicit Feedback. AAAI’19
Discrete Ranking-based Matrix Factorization with Self-Paced Learning. KDD’18

Data-efficient Deep Learning: Large-scale deep learning models have demonstrated exceptional performance across diverse tasks. Nevertheless, in scenarios characterized by data limitations, such as multilingual or cross-lingual applications, many domains, particularly those involving low-resource languages, face data scarcity. Consequently, models trained on such restricted datasets may exhibit suboptimal performance. Hence, there arises an imperative to investigate strategies for training effective models with minimal data—a crucial necessity for real-world applications.

BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering. EMNLP’24
Macedon: Minimizing Representation Coding Rate Reduction for Cross-Lingual Natural Language Understanding. EMNLP’23
Macular: a Multi-Task Adversarial Framework for Cross-Lingual Natural Language Understanding. KDD’23
FedKC: Federated Knowledge Composition for Multilingual Natural Language Understanding. WWW’22
Multi-modal Emergent Fake News Detection via Meta Neural Process Networks. KDD’21

Miscellaneous: I also explore the following topics: interpretable models for medical data mining, model fairness, and knowledge-enhanced language models.

LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models. ICML’24
Towards Poisoning Fair Representations. ICLR’24
SimFair: A Unified Framework for Fairness-Aware Multi-Label Classification. AAAI’23
InterHG: an Interpretable and Accurate Model for Hypothesis Generation. BIBM’21
Knowledge-Guided Paraphrase Identification. EMNLP’21
Fair Classification Under Strict Unawareness. SDM’21