中央研究院 資訊科學研究所

研究

友善列印

列印可使用瀏覽器提供的(Ctrl+P)功能

近期研究成果

:::

Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model

IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP2024), April 2024

Ryandhimas Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, and Yu Tsao

Hsin-Min Wang Yu Tsao

Abstract

This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudolabel scores from a pretrained model and performing multitask learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (GMOS), are the assessment targets. The pretrained MOSA-Net model is utilized to estimate three pseudo labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning is then employed to train MTQ-Net by combining a supervised loss (derived from the difference between the estimated score and the ground-truth label) and a semi-supervised loss (derived from the difference between the estimated score and the pseudo label), where the Huber loss is employed as the loss function. Experimental results first demonstrate the advantages of MPL compared to training a model from scratch and using a direct knowledge transfer mechanism. Second, the benefit of the Huber loss for improving the predictive ability of MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.

Is Explanation the Cure? Misinformation Mitigation in the Short-term and Long-term

in Proceedings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), December 2023

Yi-Li Hsu, Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku

Lun-Wei Ku

Abstract

With advancements in natural language processing (NLP) models, automatic explanation generation has been proposed to mitigate misinformation on social media platforms in addition to adding warning labels to identified fake news. While many researchers have focused on generating good explanations, how these explanations can really help humans combat fake news is under-explored. In this study, we compare the effectiveness of a warning label and the state-of- the-art counterfactual explanations generated by GPT-4 in debunking misinformation. In a two-wave, online human-subject study, participants (N = 215) were randomly assigned to a control group in which false contents are shown without any intervention, a warning tag group in which the false claims were labeled, or an explanation group in which the false contents were accompanied by GPT-4 generated explanations. Our results show that both interventions significantly decrease participants’ self-reported belief in fake claims in an equivalent manner for the short-term and long-term. We discuss the implications of our findings and directions for future NLP-based misinformation debunking strategies.

LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis

in Proceedings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), December 2023

Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku

Lun-Wei Ku

Abstract

Thematic analysis (TA) has been widely used for analyzing qualitative data in many disciplines and fields. To ensure reliable analysis, the same piece of data is typically assigned to at least two human coders. Moreover, to produce meaningful and useful analysis, human coders develop and deepen their data interpretation and coding over multiple iterations, making TA labor-intensive and time-consuming. Recently the emerging field of large language models (LLMs) research has shown that LLMs have the potential replicate human-like behavior in various tasks: in particular, LLMs outperform crowd workers on text-annotation tasks, suggesting an opportunity to leverage LLMs on TA. We propose a human–LLM collaboration framework (i.e., LLM-in-the-loop) to conduct TA with in-context learning (ICL). This framework provides the prompt to frame discussions with a LLM (e.g., GPT-3.5) to generate the final codebook for TA. We demonstrate the utility of this framework using survey datasets on the aspects of the music listening experience and the usage of a password manager. Results of the two case studies show that the proposed framework yields similar coding quality to that of human coders but reduces TA’s labor and time demands.

Location-Aware Visual Question Generation with Lightweight Models

in Proceedings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), December 2023

Nicholas Collin Suwono, Justin Chen, Tun Min Hung, Ting-Hao Kenneth Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua Sun

Ting-Hao Huang Lun-Wei Ku

Abstract

This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.

A formal treatment of bidirectional typing

33rd European Symposium on Programming (ESOP 2024), April 2024

Chen, Liang-Ting and Ko, Hsiang-Shang

Liang-Ting Chen Hsiang-Shang Ko

Abstract

There has been much progress in designing bidirectional type systems and associated type synthesis algorithms, but mainly on a case-by-case basis. To remedy the situation, this paper develops a general and formal theory of bidirectional typing, and, as a by-product of our formalism, provides a verified generator of proof-relevant type synthesisers for simply typed languages: for every signature that specifies a mode-correct bidirectionally typed language, there exists a proof-relevant type synthesiser that for an input abstract syntax tree constructs a typing derivation if any, gives its refutation if not, or reports that the input does not have enough type annotations. Soundness, completeness, and mode-correctness are studied universally for all signatures, which are sufficient conditions for deriving a type synthesiser. We propose a preprocessing step called mode decoration, which helps the user to deal with missing type annotations in a given abstract syntax tree. The entire development is formalised in Agda and can be further integrated with other language-formalisation frameworks.

Game Solving with Online Fine-Tuning

The Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), December 2023

Ti-Rong Wu, Hung Guei, Ting Han Wei, Chung-Chin Shih, Jui-Te Chin, I-Chen Wu

Ti-Rong Wu

Abstract

Game solving is a similar, yet more difficult task than mastering a game. Solving a game typically means to find the game-theoretic value (outcome given optimal play), and optionally a full strategy to follow in order to achieve that outcome. The AlphaZero algorithm has demonstrated super-human level play, and its powerful policy and value predictions have also served as heuristics in game solving. However, to solve a game and obtain a full strategy, a winning response must be found for all possible moves by the losing player. This includes very poor lines of play from the losing side, for which the AlphaZero self-play process will not encounter. AlphaZero-based heuristics can be highly inaccurate when evaluating these out-of-distribution positions, which occur throughout the entire search. To address this issue, this paper investigates applying online fine-tuning while searching and proposes two methods to learn tailor-designed heuristics for game solving. Our experiments show that using online fine-tuning can solve a series of challenging 7x7 Killall-Go problems, using only 23.54% of computation time compared to the baseline without online fine-tuning. Results suggest that the savings scale with problem size. Our method can further be extended to any tree search algorithm for problem solving. Our code is available at https://rlg.iis.sinica.edu.tw/papers/neurips2023-online-fine-tuning-solver.

Exploiting Fine-Grained Structured Pruning for Efficient Inference on CNN Model

IEEE International Conference on Parallel and Distributed Systems, December 2023

Cheng-Hung Wu, Ding-Yong Hong, Pangfeng Liu and Jan-Jan Wu

Ding-Yong Hong Jan-Jan Wu

Abstract

Convolutional neural network (CNN) is a deep learning technique that has revolutionized the field of computer vision. In modern CNN models, convolution typically accounts for the majority of the computation time. Model compression is a method used in deep learning to reduce the size of a neural network while preserving its accuracy. Weight pruning removes redundant or unimportant weights from the network. These methods can help reduce the size and computational cost of neural networks while preserving their accuracy. In this work, we propose a dynamic programming algorithm to find a good sparsity ratio for every layer individually under a total time budget based on the execution times and L1 norm of layers. After deciding the sparsity ratio for every layer, we modify TVM to generate code that uses a mask to indicate the data to load for processing. Furthermore, we propose the CHWN layout, where we move the dimension of the batch of data (N) to the innermost dimension to get rid of the varying size in the innermost dimension and make the memory access pattern contiguous. The experiment result shows that our scheme can achieve 0.35% accuracy improvement and a 1.55x speedup on VGG-16 with the ImageNet dataset than the dense model. Convolutional neural network (CNN) is a deep learning technique that has revolutionized the field of computer vision. In modern CNN models, convolution typically accounts for the majority of the computation time. Model compression is a method used in deep learning to reduce the size of a neural network while preserving its accuracy. Weight pruning removes redundant or unimportant weights from the network. These methods can help reduce the size and computational cost of neural networks while preserving their accuracy. In this work, we propose a dynamic programming algorithm to find a good sparsity ratio for every layer individually under a total time budget based on the execution times and L1 norm of layers. After deciding the sparsity ratio for every layer, we modify TVM to generate code that uses a mask to indicate the data to load for processing. Furthermore, we propose the CHWN layout, where we move the dimension of the batch of data (N) to the innermost dimension to get rid of the varying size in the innermost dimension and make the memory access pattern contiguous. The experiment result shows that our scheme can achieve 0.35% accuracy improvement and a 1.55x speedup on VGG-16 with the ImageNet dataset than the dense

LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models

IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2023), December 2023

Chi-Chang Lee, Hong-Wei Chen, Chu-Song Chen, Hsin-Min Wang, Tsung-Te Liu, and Yu Tsao

Chu-Song Chen Hsin-Min Wang Yu Tsao

Abstract

The performance of speaker verification (SV) models may drop dramatically in noisy environments. A speech enhancement (SE) module can be used as a front-end strategy. However, existing SE methods may fail to bring performance improvements to downstream SV systems due to artifacts in the predicted signals of SE models. To compensate for artifacts, we propose a generic denoising framework named LC4SV, which can serve as a pre-processor for various unknown downstream SV models. In LC4SV, we employ a learning-based interpolation agent to automatically generate the appropriate coefficients between the enhanced signal and its noisy input to improve SV performance in noisy environments. Our experimental results demonstrate that LC4SV consistently improves the performance of various unseen SV systems. To the best of our knowledge, this work is the first attempt to develop a learning-based interpolation scheme aiming at improving SV performance in noisy environments.

The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2023), December 2023

Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, and Junichi Yamagishi

Yu Tsao Hsin-Min Wang

Abstract

We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios.  Ten teams from industry and academia in seven different countries participated.  Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected.  Use of diverse datasets and listener information during training appeared to be successful approaches.

Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

Interspeech2023, August 2023

Yung-Lun Chien, Hsin-Hao Chen, Ming-Chi Yen, Shu-Wei Tsai, Hsin-Min Wang, Yu Tsao, and Tai-Shih Chi

Hsin-Min Wang Yu Tsao

Abstract

Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC.

A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech

Interspeech2023, August 2023

Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, and Hsin-Min Wang

Yu Tsao Hsin-Min Wang

Abstract

The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a training and inference strategy that additionally uses enhanced speech as a target by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing 1) the teacher model’s estimated speech and noise for enhanced-target training or 2) raw noisy speech and the teacher model’s estimated noise for noisy-target training. Experimental results show that our proposed method outperforms several baselines, especially with the teacher/student inference, where predicted clean speech is derived successively through the teacher and final student models.

Function Clustering to Optimize Resource Utilization on Container Platform

IEEE International Conference on Parallel and Distributed Systems, December 2023

Chao-Yu Lee, Ding-Yong Hong, Pangfeng Liu and Jan-Jan Wu

Ding-Yong Hong Jan-Jan Wu

Abstract

In recent years, container technology has gained significant attention in the software industry, with many businesses opting for its elasticity, cost-effectiveness, is not without challenges. The ”cold-start” problem is the most critical issue in deploying containers. The cold-start time is the delay from a container being provisioned on a physical server to getting ready to run the application. The end users need to endure delays when they run the application at the time the container has just started. These delays cause a negative user experience and may deteriorate the business’s profitability. The most common way to ensure a seamless user experience is to keep a substantial number of containers active throughout the day, which causes resource over-provision. Conversely, closing the container right after handling the requests can reduce memory consumption but generate a cold start whenever the request arrives. Cold start occurrence and resource usage is a trade-off and presents a significant challenge on the container platform. To address this challenge, we observe that serving consecutive requests with the same container can notably decrease the number of cold starts. We propose TAC, a Temporal Adjacency Function Clustering algorithm, to meet the challenge. TAC selects the functions with time adjacency requests into a cluster from the historical data. TAC packs functions serving time adjacency requests into a cluster to reduce cold starts and enable efficient resource utilization. The experiment result shows that TAC reduces 8% cold start occurrences and 53% memory usage with the real-world traces compared to the state-of-the-art methods, e.g., Defuse and Hybrid histogram policy.

Enabling Highly-Efficient DNA Sequence Mapping via ReRAM-based TCAM

ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), August 2023

Yu-Shao Lai, Shuo-Han Chen, and Yuan-Hao Chang

Shuo-Han Chen Yuan-Hao Chang

Abstract

In the post-pandemic era, third-generation DNA sequencing (TGS) has received increasing attention from both academics and industries. As TGS technologies have become a requisite for extracting DNA sequences, the DNA sequence mapping, which is the most basic bioinformatics application and the core of polymerase chain reaction (PCR) tests, receives great challenges, due to the large size and noisy nature of TGS technologies. In addition, the ever-increasing data volume of DNA sequences also induces the issue of memory wall while large datasets are moved between the memory and the computing units. However, much less effort has been devoted to DNA sequence mapping acceleration while considering both the memory wall issue and the challenges of TGS technologies. To enable highly-efficient DNA sequence mapping, this study proposes a novel resistive random-access memory (ReRAM)-based ternary content-addressable memory (TCAM) and exploits the intrinsic parallelity of ReRAM crossbar for efficient mapping acceleration. Promising results have been demonstrated through a series of experiments with different scales of datasets.

Sky-NN: Enabling Efficient Neural Network Data Processing with Skyrmion Racetrack Memory

ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), August 2023

Yong-Cheng Liao, Shuo-Han Chen, Yuan-Hao Chang, and Yu-Pei Liang

Shuo-Han Chen Yuan-Hao Chang

Abstract

The thriving of artificial intelligence has brought numerous efforts to build strengthened and sophisticated neural network models to resolve almost all kinds of problems in different academic fields. Owing to the growing complexity and size of neural networks, nonvolatile random access memory (NVRAM) has been utilized to avoid excessive data movements between volatile memory and persistent storage. Among various NVRAM alternatives, skyrmion racetrack memory (SK-RM) is regarded as a promising candidate owing to its high memory density and efficient reads and writes. Nevertheless, due to the distinct shift operation of SK-RM, directly applying existing data process methods of neural networks on SK-RM hinders the benefits and performance of both SK-RM and neural networks. To resolve this issue, this paper proposes Sky-NN to enable efficient NN data processing methods on SK-RM by utilizing the distinct shift and re-assemblability capability of skyrmions. A series of experiments were conducted to demonstrate the capability of Sky-NN.

Improving quantitation accuracy in isobaric-labeling mass spectrometry experiments with spectral library searching and feature-based peptide-spectrum match filter

Scientific Reports, August 2023

Tzu-Yun Kuo, Jen-Hung Wang, Yung-Wen Huang, Ting-Yi Sung* and Ching-Tai Chen*

Jen-Hung Wang Ting-Yi Sung Ching-Tai Chen

Abstract

Isobaric labeling relative quantitation is one of the dominating proteomic quantitation technologies. Traditional quantitation pipelines for isobaric-labeled mass spectrometry data are based on sequence database searching. In this study, we present a novel quantitation pipeline that integrates sequence database searching, spectral library searching, and a feature-based peptide-spectrum-match (PSM) filter using various spectral features for filtering. The combined database and spectral library searching results in larger quantitation coverage, and the filter removes PSMs with larger quantitation errors, retaining those with higher quantitation accuracy. Quantitation results show that the proposed pipeline can improve the overall quantitation accuracy at the PSM and protein levels. To our knowledge, this is the first study that utilizes spectral library searching to improve isobaric labeling-based quantitation. For users to conveniently perform the proposed pipeline, we have implemented the feature-based filter being executable on both Windows and Linux platforms; its executable files, user manual, and sample data sets are freely available at https://ms.iis.sinica.edu.tw/comics/Software_FPF.html. Furthermore, with the developed filter, the proposed pipeline is fully compatible with the Trans-Proteomic Pipeline.