The 40th Annual AAAI Conference on Artificial Intelligence (AAAI), January 2026 Cheng-Chang Tsai, Kevin Cheng, and Chun-Shien Lu Federated learning (FL) has shown success in collaboratively
training a model among decentralized data resources without
directly sharing privacy-sensitive training data. Despite
recent advances, non-IID (non-independent and identically
distributed) data poses an inevitable challenge that hinders
the use of FL. In this work, we address the issue of non-IID
histopathological images with feature distribution shifts from
an intuitive perspective that has only received limited attention.
Specifically, we address this issue from the perspective
of data distribution by solely adjusting the data distributions
of all clients. Building on the success of diffusion models
in fitting data distributions and leveraging stain separation
to extract the pivotal features that are closely related to the
non-IID properties of histopathological images, we propose
a Federated Stain Distribution Alignment (FedSDA) method.
FedSDA aligns the stain distribution of each client with a target
distribution in an FL framework to mitigate distribution
shifts among clients. Furthermore, considering that training
diffusion models on raw data in FL has been shown to be susceptible
to privacy leakage risks, we circumvent this problem
while still effectively achieving alignment. Extensive experimental
results show that FedSDA is not only effective in improving
baselines that focus on mitigating disparities across
clients’ model updates but also outperforms baselines that address
the non-IID data issues from the perspective of data distribution.
We show that FedSDA provides valuable and practical
insights for the computational pathology community. GigaScience, November 2025 Yu-Hsin Chen, Chien-Fu Liu, Jun-Yi Leu*, and Huai-Kuang Tsai* Co-fractionation coupled with mass spectrometry (CF-MS) is a powerful strategy for mapping protein-protein interactions (PPIs) under near-physiological conditions. Despite recent progress, existing analysis pipelines remain constrained by reliance on handcrafted features, sensitivity to experimental noise, and an inherent focus on pairwise interactions, which limit their scalability and generalizability. To address these difficulties, we introduce FREEPII (Feature Representation Enhancement End-to-End Protein Interaction Inference), a unified deep learning framework that integrates CF-MS data with sequence-derived features to learn biologically meaningful protein-level representations for accurate and efficient inference of PPIs and protein complexes. FREEPII employs a convolutional neural network (CNN) architecture to learn protein-level representations directly from raw data, enabling feature sharing across interaction pairs and reducing computational complexity. To enhance robustness against CF-MS noise, protein sequences are introduced as auxiliary input to enrich the feature space with complementary biological cues. The supervised protein embeddings further encode network-level context derived from complex annotations, allowing the model to capture higher-order interactions and enhance the expressive power of protein representations. Extensive benchmarking demonstrates that FREEPII consistently outperforms state-of-the-art CF-MS analysis tools, capturing more biologically coherent and discriminative protein features. Cross-dataset evaluations further reveal that integrating multi-modal data from diverse experimental contexts substantially improves the generalization and sensitivity of data-driven models, offering a scalable, cross-species strategy for reliable protein interaction inference. IEEE Transactions on Information Forensics and Security , November 2025 Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, and Isao Echizen Deep neural networks are highly vulnerable to
adversarial examples, which are inputs with small, carefully
crafted perturbations that cause misclassification—making
adversarial attacks a critical tool for evaluating robustness.
Existing black-box methods typically entail a trade-o between
precision and flexibility: pixel-sparse attacks (e.g., single- or fewpixel
attacks) provide fine-grained control but lack adaptability,
whereas patch- or frequency-based attacks improve eciency or
transferability, but at the cost of producing larger and less precise
perturbations. We present GreedyPixel, a fine-grained black-box
attack method that performs brute-force-style, per-pixel greedy
optimization guided by a surrogate-derived priority map and
refined by means of query feedback. It evaluates each coordinate
directly without any gradient information, guaranteeing
monotonic loss reduction and convergence to a coordinate-wise
optimum, while also yielding near white-box-level precision and
pixel-wise sparsity and perceptual quality. On the CIFAR-10
and ImageNet datasets, spanning convolutional neural networks
(CNNs) and Transformer models, GreedyPixel achieved state-ofthe-
art success rates with visually imperceptible perturbations,
eectively bridging the gap between black-box practicality and
white-box performance. The implementation is available at
https://github.com/azrealwang/greedypixel IEEE Transactions on Human-Machine System, December 2025 Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, and Hsin-Min Wang Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pretraining for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pretrained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multiscale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets. the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2025 Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang, Ti-Rong Wu, I-Chen Wu Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ. Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2025 Cheng-Yao Hong, Li-Heng Wang, and Tyng-Luh Liu Accurate identification and localization of objects in 3-D scenes are essential for advancing comprehensive 3-D scene understanding. Although diffusion models have demonstrated impressive capabilities across a broad spectrum of computer vision tasks, their potential in both 2-D and 3-D object detection remains underexplored. Existing approaches typically formulate detection as a "noise-to-box" process, but they rely heavily on direct coordinate regression, which limits adaptability for more advanced tasks such as grounding-based object detection. To overcome these challenges, we propose a promptable 3-D object recognition framework, which introduces a diffusion-based paradigm for flexible and conditionally guided 3-D object detection. Our approach encodes bounding boxes into latent representations and employs latent diffusion models to realize a "promptable noise-to-box" transformation. This formulation enables the refinement of standard 3-D object detection using textual prompts, such as class labels. Moreover, it naturally extends to grounding object detection through conditioning on natural language descriptions, and generalizes effectively to few-shot learning by incorporating annotated exemplars as visual prompts. We conduct thorough evaluations on three key 3-D object recognition tasks: general 3-D object detection, few-shot detection, and grounding-based detection. Experimental results demonstrate that our framework achieves competitive performance relative to state-of-the-art methods, validating its effectiveness, versatility, and broad applicability in 3-D computer vision. Scientific Data, November 2025 Po-Cheng Hsu, Chung-Yen Lin, Ping-Heng Hsieh, Wei-Hsuan Chuang, Mei-Yeh Lu, Chaolun A llen Chen, Shu-Hwa Chen The Japanese cutlassfish (Trichiurus japonicus) is a commercially important marine species across Asia. Here, we present a high-quality, chromosome-level genome assembly generated using PacBio HiFi, Hi-C, and Nanopore ONT reads. The nuclear genome comprised 24 chromosomes with 160 scaffolds totaling 1,138 Mb, with a scaffold N50 of 47.10 Mb and an average scaffold length of 6.18 Mb. A complete mitochondrial genome of 16,796 bp was also assembled, comprising 13 protein-coding and 23 non-coding RNA (ncRNA) genes, with 99.32% sequence identity to the reference in the NCBI database. The nuclear genome encodes 26,541 protein-coding genes (median length: 7,391 base pairs) and 16,383 non-coding RNA (ncRNA) genes. The ncRNA genes account for approximately 0.1694% of the genome's total length. BUSCO analysis indicated 99.4% and 99.2% completeness against the Actinopterygii ortholog set for the genome and proteome. Functional annotation covered 98.15% of genes. Recognized repeat elements and ncRNA regions accounted for 61.10% of the nuclear genome. With high mapping rates from external datasets, this assembly offers a valuable foundation for future sequencing-based studies. The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (Spotlight), December 2025 Hsi-Ling Chen, Chun-Shien Lu, and Pau-Choo Chung The effectiveness of domain translation in addressing image-based problems of Unsupervised Domain Adaptation (UDA) depends on the quality of the translated images and the preservation of crucial discriminative features. However, achieving high-quality and stable translations typically requires paired data, which poses a challenge in scenarios with limited annotations in the target domain. To address this issue, this paper proposes a novel method termed Stain-Guided Cycle Diffusion (SGCD), employing a dual diffusion model with bidirectional generative constraints to synthesize highly realistic data for downstream task fine-tuning. The bidirectional generative constraints ensure that the translated images retain the features critical to the downstream model in properly controlling the generation process. Additionally, a stain-guided consistency loss is introduced to enhance the denoising capability of the dual diffusion model, thereby improving the quality of images translated between different domains using latents from one domain and a diffusion model trained on another. Experiments conducted on four public datasets demonstrate that SGCD can effectively enhance the performance of downstream task models on the target domain. The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2025 Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, and Chu-Song Chen Large Language Models (LLMs) are increasingly adopted in high-stakes scenarios, yet their safety mechanisms often remain fragile. Simple jailbreak prompts or even benign fine-tuning can bypass internal safeguards, underscoring the need to understand the failure modes of current safety strategies.
Recent findings suggest that vulnerabilities emerge when alignment is confined to only the initial output tokens. To address this, we introduce the notion of \textit{safety depth}, a designated output position where the model refuses to generate harmful content. While deeper alignment appears promising, identifying the optimal safety depth remains an open and underexplored challenge.
We leverage the equivalence between autoregressive language models and Markov chains to derive the first theoretical result on identifying the optimal safety depth. To reach this safety depth effectively, we propose a cyclic group augmentation strategy that improves safety scores across six LLMs. In addition, we uncover a critical interaction between safety depth and ensemble width, demonstrating that larger ensembles can offset shallower alignments. These results suggest that test-time computation, often overlooked in safety alignment, can play a key role. Our approach provides actionable insights for building safer LLMs. Annual Conference on Neural Information Processing Systems (NeurIPS), December 2025 Scott Cheng, Meng-Yu Tsai, Ding-Yong Hong, Mahmut Kandemir AlphaZero has achieved remarkable success in complex decision-making problems through self-play and neural network training. However, its self-play process remains inefficient due to limited exploration of high-uncertainty positions, the overlooked runner-up decisions in Monte Carlo Tree Search (MCTS), and high variance in value labels. To address these challenges, we propose and evaluate uncertainty-guided exploration by branching from high-uncertainty positions using our proposed Label Change Rate (LCR) metric, which is further refined by a Bayesian inference framework. Our proposed approach leverages runner-up MCTS decisions to create multiple variations, and ensembles value labels across these variations to reduce variance. We investigate three key design parameters for our branching strategy: where to branch, how many variations to branch, and which move to play in the new branch. Our empirical findings indicate that branching with 10 variations per game provides the best performance-exploration balance. Overall, our end-to-end results show an improved sample efficiency over the baseline by 58.5% on 9x9 Go in the early stage of training and by 47.3% on 19x19 Go in the late stage of training. IEEE International Conference on Computers, Software, and Applications (COMPSAC), July 2025 Cai-Feng Lin, Ding-Yong Hong, Tzu-Hsien Tsai, Pangfeng Liu, Jan-Jan Wu Graph Neural Network (GNN) is an important tool in deep learning to handle structured data, where graphs with nodes and edges represent entities and their relationships. Various
challenges arise when GNN is tree-shaped, with irregular connectivity patterns and varying depth. It is difficult to distribute and process the dynamic structure for parallel execution on multiple GPUs. In addition, tree data dependency demands the processing of parent nodes before their children, severely limiting execution parallelism.
This research aims to improve the training speed of treeshaped GNN on multi-GPU systems. First, we introduce a cost model that estimates the running time of the training across
multiple GPUs. Then, we demonstrate that finding an optimal way to distribute tree-structured data across GPUs is an NP-complete problem on this cost model. We then propose a practical
heuristic method for distributing data that improves efficiency while maintaining training quality. The heuristic method first assigns data to batches based on our cost model and then assigns
data in each batch to the devices. We also show that our device assigning algorithm is a 4-approximation algorithm. That is, it guarantees that its cost is four times the optimal running time in each training batch, ensuring that it performs effectively in practice.
We implement the algorithm and conduct the experiments. The results show that our algorithm achieves a significant increase in training time. The speedup is up to 1.86 for two GPUs, 3.43 for
four GPUs, and 7.25 for eight GPUs.
IEEE Transaction on Audio, Speech and Language Processing, February 2025 Dyah A. M. G. Wisnu, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, and Yu Tsao This paper introduces HAAQI-Net, a non-intrusive music audio quality assessment model for hearing aid users. Unlike traditional methods such as Hearing Aid Audio Quality Index (HAAQI), which requires intrusive reference signal comparisons, HAAQI-Net offers a more accessible and computationally efficient alternative. Leveraging a bidirectional long short-term memory architecture with attention mechanisms and features extracted from a pre-trained BEATs model, it can predict HAAQI scores directly from music audio clips and hearing loss patterns. The experimental results demonstrate that, compared to the traditional HAAQI as the reference, HAAQI-Net achieves a linear correlation coefficient (LCC) of 0.9368, a Spearman's rank correlation coefficient (SRCC) of 0.9486, and a mean squared error (MSE) of 0.0064, while significantly reducing the inference time from 62.52 seconds to 2.54 seconds. Furthermore, a knowledge distillation strategy was applied, reducing the parameters by 75.85% and inference time by 96.46%, while maintaining strong performance (LCC: 0.9071, SRCC: 0.9307, MSE: 0.0091). To expand its capabilities, HAAQI-Net was adapted to predict subjective human scores, mean opinion score (MOS), by fine-tuning. This adaptation significantly improved the prediction accuracy. Furthermore, the robustness of HAAQI-Net was evaluated under varying sound pressure level (SPL) conditions, revealing optimal performance at a reference SPL of 65 dB, with the accuracy gradually decreasing as SPL deviated from this point. The advancements in subjective score prediction, SPL robustness, and computational efficiency position HAAQI-Net as a reliable solution for music audio quality assessment, significantly contributing to the development of efficient and accurate models in audio signal processing and hearing aid technology.FedSDA: Federated Stain Distribution Alignment for Non-IID Histopathological Image Classification
Abstract
Complete end-to-end learning from protein feature representation to protein interactome inference
Abstract
GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
Abstract
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
Abstract
Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization
Abstract
Promptable 3-D Object Localization with Latent Diffusion Models
Abstract
Chromosome-Level Genome Assembly and Annotation of the Japanese Cutlassfish (Trichiurus japonicus): A High-Quality Genomic Resource Featuring Nuclear and Mitochondrial Completeness for Future Studies
Abstract
SGCD: Stain-Guided CycleDiffusion for Unsupervised Domain Adaptation of Histopathology Image Classification
Abstract
Safety Alignment Depth in Large Language Models: A Markov Chain Perspective
Abstract
Uncertainty-Guided Exploration for Efficient AlphaZero Training
Abstract
A Grouping Algorithm for Training Tree-Shaped Models on Multiple GPUs with High Efficiency
Abstract
HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids
Abstract