Publications
(†) denotes equal contribution
(*) denotes correspondance
denotes journal
denotes conference
2024
- ac4C-AFL: A high-precision identification of human mRNA N4-acetylcytidine sites based on adaptive feature representation learningNhat Truong Pham, Annie Terrina Terrance, Young-Jun Jeon, Rajan Rakkiyappan, and Balachandran ManavalanMolecular Therapy-Nucleic Acids, 2024
RNA N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in controlling mRNA stability, processing, and translation. Consequently, accurate identification of ac4C sites across the genome is critical for understanding gene expression regulation mechanisms. In this study, we have developed ac4C-AFL, a bioinformatics tool that precisely identifies ac4C sites from primary RNA sequences. In ac4C-AFL, we identified the optimal sequence length for model building and implemented an adaptive feature representation strategy that is capable of extracting the most representative features from RNA. To identify the most relevant features, we proposed a novel ensemble feature importance scoring strategy to rank features effectively. We then used this information to conduct the sequential forward search, which individually determine the optimal feature set from the 16 sequence-derived feature descriptors. Utilizing these optimal feature descriptors, we constructed 176 baseline models using 11 popular classifiers. The most efficient baseline models were identified using the two-step feature selection approach, whose predicted scores were integrated and trained with the appropriate classifier to develop the final prediction model. Our rigorous cross-validations and independent tests demonstrate that ac4C-AFL surpasses contemporary tools in predicting ac4C sites. Moreover, we have developed a publicly accessible web server at https://balalab-skku.org/ac4C-AFL/.
- H2Opred: a robust and efficient hybrid deep learning model for predicting 2’-O-methylation sites in human RNABriefings in Bioinformatics, 2024
2’-O-methylation (2OM) is the most common post-transcriptional modification of RNA. It plays a crucial role in RNA splicing, RNA stability and innate immunity. Despite advances in high-throughput detection, the chemical stability of 2OM makes it difficult to detect and map in messenger RNA. Therefore, bioinformatics tools have been developed using machine learning (ML) algorithms to identify 2OM sites. These tools have made significant progress, but their performances remain unsatisfactory and need further improvement. In this study, we introduced H2Opred, a novel hybrid deep learning (HDL) model for accurately identifying 2OM sites in human RNA. Notably, this is the first application of HDL in developing four nucleotide-specific models [adenine (A2OM), cytosine (C2OM), guanine (G2OM) and uracil (U2OM)] as well as a generic model (N2OM). H2Opred incorporated both stacked 1D convolutional neural network (1D-CNN) blocks and stacked attention-based bidirectional gated recurrent unit (Bi-GRU-Att) blocks. 1D-CNN blocks learned effective feature representations from 14 conventional descriptors, while Bi-GRU-Att blocks learned feature representations from five natural language processing-based embeddings extracted from RNA sequences. H2Opred integrated these feature representations to make the final prediction. Rigorous cross-validation analysis demonstrated that H2Opred consistently outperforms conventional ML-based single-feature models on five different datasets. Moreover, the generic model of H2Opred demonstrated a remarkable performance on both training and testing datasets, significantly outperforming the existing predictor and other four nucleotide-specific H2Opred models. To enhance accessibility and usability, we have deployed a user-friendly web server for H2Opred, accessible at https://balalab-skku.org/H2Opred/. This platform will serve as an invaluable tool for accurately predicting 2OM sites within human RNA, thereby facilitating broader applications in relevant research endeavors.
- Enhanced sliding mode controller design via meta-heuristic algorithm for robust and stable load frequency control in multi-area power systemsAnh-Tuan Tran, Minh Phuc Duong, Nhat Truong Pham, and Jae Woong ShimIET Generation, Transmission & Distribution, 2024
This article introduces a novel approach named HBA-dHoSMO, which combines a continuous decentralized higher-order sliding mode controller-based observer (dHoSMO) with the honey badger algorithm (HBA), specifically designed for load frequency control (LFC) in multi-area power systems (MAPSs). Traditional sliding mode controllers (SMCs) employed in LFC of MAPSs often face challenges related to chattering and oscillations, leading to decreased robustness and stability. Additionally, tuning the parameters for these SMC designs to achieve optimal performance in MAPSs can be challenging. The HBA-dHoSMO is proposed to address the issues of chattering and oscillations, while the optimal parameters for SMC design are obtained using HBA. The stability analysis of the entire system is conducted using linear matrix inequality and the Lyapunov stability theory, affirming the reliability and feasibility of the approach. A comprehensive set of case studies is performed under various configurations and conditions. Additionally, particle swarm optimization and tuna swarm optimization, in conjunction with SMC-based and proportional-integral-derivative controllers, are examined for performance comparison. Simulation results demonstrate the superior performance of the proposed controller across all case studies. This is evidenced by the lowest integral time absolute error values recorded as 0.0133, 0.0006, and 0.0167 for single-, two-, and three-area power systems, respectively.
- Advancing the accuracy of SARS-CoV-2 phosphorylation site detection via meta-learning approachNhat Truong Pham(†), Le Thi Phan(†), Jimin Seo, Yeonwoo Kim, Minkyung Song, Sukchan Lee, Young-Jun Jeon, and Balachandran ManavalanBriefings in Bioinformatics, 2024
The worldwide appearance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated significant concern and posed a considerable challenge to global health. Phosphorylation is a common post-translational modification that affects many vital cellular functions and is closely associated with SARS-CoV-2 infection. Precise identification of phosphorylation sites could provide more in-depth insight into the processes underlying SARS-CoV-2 infection and help alleviate the continuing coronavirus disease 2019 (COVID-19) crisis. Currently, available computational tools for predicting these sites lack accuracy and effectiveness. In this study, we designed an innovative meta-learning model, Meta-Learning for Serine/Threonine Phosphorylation (MeL-STPhos), to precisely identify protein phosphorylation sites. We initially performed a comprehensive assessment of 29 unique sequence-derived features, establishing prediction models for each using 14 renowned machine learning methods, ranging from traditional classifiers to advanced deep learning algorithms. We then selected the most effective model for each feature by integrating the predicted values. Rigorous feature selection strategies were employed to identify the optimal base models and classifier(s) for each cell-specific dataset. To the best of our knowledge, this is the first study to report two cell-specific models and a generic model for phosphorylation site prediction by utilizing an extensive range of sequence-derived features and machine learning algorithms. Extensive cross-validation and independent testing revealed that MeL-STPhos surpasses existing state-of-the-art tools for phosphorylation site prediction. We also developed a publicly accessible platform at https://balalab-skku.org/MeL-STPhos. We believe that MeL-STPhos will serve as a valuable tool for accelerating the discovery of serine/threonine phosphorylation sites and elucidating their role in post-translational regulation.
2023
- Comparative analysis of multi-loss functions for enhanced multi-modal speech emotion recognitionPhuong-Nam Tran, Thuy-Duong Thi Vu, Nhat Truong Pham, Hanh Dang-Ngoc, and Duc Ngoc Minh DangIn 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), 2023
In recent years, multi-modal analysis has gained significant prominence across domains such as audio/speech processing, natural language processing, and affective computing, with a particular focus on speech emotion recognition (SER). The integration of data from diverse sources, encompassing text, audio, and images, in conjunction with classifier algorithms has led to the realization of enhanced performance in SER tasks. Traditionally, the cross-entropy loss function has been employed for the classification problem. However, it is challenging to discriminate the feature representations among classes for multi-modal classification tasks. In this study, we focus on the impact of the loss functions on multi-modal SER rather than designing the model architecture. Mainly, we evaluate the performance of multi-modal SER with different loss functions, such as cross-entropy loss, center loss, contrastive-center loss, and their combinations. Based on extensive comparative analysis, it is proven that the combination of cross-entropy loss and contrastive-center loss achieves the best performance for multi-modal SER. This combination reaches the highest accuracy of 80.27% and the highest balanced accuracy of 81.44% on the IEMOCAP dataset.
- SER-Fuse: An Emotion Recognition Application Utilizing Multi-Modal, Multi-Lingual, and Multi-Feature FusionNhat Truong Pham(*), Le Thi Phan, Duc Ngoc Minh Dang, and Balachandran ManavalanIn Proceedings of the 12th International Symposium on Information and Communication Technology (SOICT), 2023
Speech emotion recognition (SER) is a crucial aspect of affective computing and human-computer interaction, yet effectively identifying emotions in different speakers and languages remains challenging. This paper introduces SER-Fuse, a multi-modal SER application that is designed to address the complexities of multiple speakers and languages. Our approach leverages diverse audio/speech embeddings and text embeddings to extract optimal features for multi-modal SER. We subsequently employ multi-feature fusion to integrate embedding features across modalities and languages. Experimental results archived on the English-Chinese emotional speech (ECES) dataset reveal that SER-Fuse attains competitive performance in the multi-lingual approach compared to the single-lingual approaches. Furthermore, we provide the implementation of SER-Fuse for download at https://github.com/nhattruongpham/SER-Fuse to support reproducibility and local deployment.
- Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head AttentionPhuong-Nam Tran, Thuy-Duong Thi Vu, Duc Ngoc Minh Dang, Nhat Truong Pham, and Anh-Khoa TranIn International Conference on Industrial Networks and Intelligent Systems (INISCOM), 2023
Recent research has shown that multi-modal learning is a successful method for enhancing classification performance by mixing several forms of input, notably in speech-emotion recognition (SER) tasks. However, the difference between the modalities may affect SER performance. To overcome this problem, a novel approach for multi-modal SER called 3M-SER is proposed in this paper. The 3M-SER leverages multi-head attention to fuse information from multiple feature embeddings, including audio and text features. The 3M-SER approach is based on the SERVER approach but includes an additional fusion module that improves the integration of text and audio features, leading to improved classification performance. To further enhance the correlation between the modalities, a LayerNorm is applied to audio features prior to fusion. Our approach achieved an unweighted accuracy (UA) and weighted accuracy (WA) of 79.96% and 80.66%, respectively, on the IEMOCAP benchmark dataset. This indicates that the proposed approach is better than SERVER and recent methods with similar approaches. In addition, it highlights the effectiveness of incorporating an extra fusion module in multi-modal learning.
- ADP-Fuse: A novel dual layer machine learning predictor to identify antidiabetic peptides and diabetes types using multiview informationShaherin Basith, Nhat Truong Pham, Minkyung Song, Gwang Lee, and Balachandran ManavalanComputers in Biology and Medicine, 2023
Diabetes mellitus has become a major public health concern associated with high mortality and reduced life expectancy and can cause blindness, heart attacks, kidney failure, lower limb amputations, and strokes. A new generation of antidiabetic peptides (ADPs) that act on β-cells or T-cells to regulate insulin production is being developed to alleviate the effects of diabetes. However, the lack of effective peptide-mining tools has hampered the discovery of these promising drugs. Hence, novel computational tools need to be developed urgently. In this study, we present ADP-Fuse, a novel two-layer prediction framework capable of accurately identifying ADPs or non-ADPs and categorizing them into type 1 and type 2 ADPs. First, we comprehensively evaluated 22 peptide sequence-derived features coupled with eight notable machine learning algorithms. Subsequently, the most suitable feature descriptors and classifiers for both layers were identified. The output of these single-feature models, embedded with multiview information, was trained with an appropriate classifier to provide the final prediction. Comprehensive cross-validation and independent tests substantiate that ADP-Fuse surpasses single-feature models and the feature fusion approach for the prediction of ADPs and their types. In addition, the SHapley Additive exPlanation method was used to elucidate the contributions of individual features to the prediction of ADPs and their types. Finally, a user-friendly web server for ADP-Fuse was developed and made publicly accessible (https://balalab-skku.org/ADP-Fuse), enabling the swift screening and identification of novel ADPs and their types. This framework is expected to contribute significantly to antidiabetic peptide identification.
- SERVER: Multi-modal Speech Emotion Recognition using TransformeR-based and Vision-based EmbeddingsNhat Truong Pham, Duc Ngoc Minh Dang, Bich Hong Ngoc Pham, and Sy Dzung Nguyen1In 2023 8th International Conference on Intelligent Information Technology (ICIIT), 2023
This paper proposes a multi-modal approach for speech emotion recognition (SER) using both text and audio inputs. The audio embedding is extracted by using a vision-based architecture, namely VGGish, while the text embedding is extracted by using a transformer-based architecture, namely BERT. Then, these embeddings are fused using concatenation to recognize emotional states. To evaluate the effectiveness of the proposed method, the benchmark dataset, namely IEMOCAP, is employed in this study. Experimental results indicate that the proposed method is very competitive and better than most of the latest and state-of-the-art methods using multi-modal analysis for SER. The proposed method achieves 63.00% unweighted accuracy (UA) and 63.10% weighted accuracy (WA) on the IEMOCAP dataset. In the future, an extension of multi-task learning and multi-lingual approaches will be investigated to improve the performance and robustness of multi-modal SER. For reproducibility purposes, our code is publicly available.
- Uplink registration-based MAC protocol for IEEE 802.11ah networksDuc Ngoc Minh Dang, and Nhat Truong PhamIn 2023 8th International Conference on Intelligent Information Technology (ICIIT), 2023
IEEE 802.11ah (Wi-Fi HaLow) operates in license-exempt ISM bands below 1 GHz and provides longer-range connectivity. The main advantage of the IEEE 802.11ah is it provides long range connection with low power consumption. RAW (Restricted Access Window) in IEEE 802.11ah helps to reduce the collision probability and enhance the network throughput when many stations contend the channel. Since stations are assigned to uplink RAW slots based on their Association Identifications (AID), the number of stations that have uplink data packets in each RAW slot is a big difference. It results in low fairness among stations. The paper proposes an uplink registration-based MAC protocol for IEEE 802.11ah networks (UR-MAC). In UR-MAC protocol, stations with uplink data will register with the AP by attaching the uplink registration to the data packet during downlink communications. The AP will allocate RAW slots based on the uplink registered station list. The UR-MAC protocol tries to use up the resources of the RAW slots as well as balance the number of stations with uplink data among the RAW slots. Through the evaluation and comparison analysis, the UR-MAC protocol significantly improves the fairness index compared to the IEEE 802.11ah protocol while still ensuring the probability of successful transmission, the average number of successfully transmitted packets, and power efficiency compared to the IEEE 802.11ah protocol.
- DrugormerDTI: Drug Graphormer for drug–target interaction predictionJiayue Hu, Wang Yu, Chao Pang, Junru Jin, Nhat Truong Pham, Balachandran Manavalan, and Leyi WeiComputers in Biology and Medicine, 2023
Drug-target interactions (DTI) prediction is a crucial task in drug discovery. Existing computational methods accelerate the drug discovery in this respect. However, most of them suffer from low feature representation ability, significantly affecting the predictive performance. To address the problem, we propose a novel neural network architecture named DrugormerDTI, which uses Graph Transformer to learn both sequential and topological information through the input molecule graph and Resudual2vec to learn the underlying relation between residues from proteins. By conducting ablation experiments, we verify the importance of each part of the DrugormerDTI. We also demonstrate the good feature extraction and expression capabilities of our model via comparing the mapping results of the attention layer and molecular docking results. Experimental results show that our proposed model performs better than baseline methods on four benchmarks. We demonstrate that the introduction of Graph Transformer and the design of residue are appropriate for drug-target prediction.
- AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state networkMustaqeem Khan, Abdulmotaleb El Saddik, Fahd Saleh Alotaibi, and Nhat Truong PhamKnowledge-Based Systems, 2023
Speech signals are the most convenient way of communication between human beings and the eventual method of Human-Computer Interaction (HCI) to exchange emotions and information. Recognizing emotions from speech signals is a challenging task due to the sparse nature of emotional data and features. In this article, we proposed a Deep Echo-State-Network (DeepESN) system for emotion recognition with a dilated convolution neural network and multi-headed attention mechanism. To reduce the model complexity, we incorporate a DeepESN that combines reservoir computing for higher-dimensional mapping. We also used fine-tuned Sparse Random Projection (SRP) to reduce dimensionality and adopted an early fusion strategy to fuse the extracted cues and passed the joint feature vector via a classification layer to recognize emotions. Our proposed model is evaluated on two public speech corpora, EMO-DB and RAVDESS, and tested for subject/speaker-dependent/independent performance. The results show that our proposed system achieves a high recognition rate, 91.14, 85.57 for EMO-DB, and 82.01, 77.02 for RAVDESS, using speaker-dependent and independent experiments, respectively. Our proposed system outperforms the State-of-The-Art (SOTA) while requiring less computational time.
- Towards an efficient machine learning model for financial time series forecastingArun Kumar, Tanya Chauhan, Srinivasan Natesan, Nhat Truong Pham, Ngoc Duy Nguyen2, and Chee Peng LimSoft Computing, 2023
Financial time series forecasting is a challenging problem owing to the high degree of randomness and absence of residuals in time series data. Existing machine learning solutions normally do not perform well on such data. In this study, we propose an efficient machine learning model for financial time series forecasting through carefully designed feature extraction, elimination, and selection strategies. We leverage a binary particle swarm optimization algorithm to select the appropriate features and propose new evaluation metrics, i.e. mean weighted square error and mean weighted square ratio, for better performance assessment in handling financial time series data. Both indicators ascertain that our proposed model is effective, which outperforms several existing methods in benchmark studies.
- Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognitionNhat Truong Pham, Duc Ngoc Minh Dang, Ngoc Duy Nguyen2, Thanh Thi Nguyen3, Hai Nguyen4, Balachandran Manavalan, Chee Peng Lim, and Sy Dzung Nguyen1Expert Systems with Applications, 2023
Recently, speech emotion recognition (SER) has become an active research area in speech processing, particularly with the advent of deep learning (DL). Numerous DL-based methods have been proposed for SER. However, most of the existing DL-based models are complex and require a large amounts of data to achieve a good performance. In this study, a new framework of deep attention-based dilated convolutional-recurrent neural networks coupled with a hybrid data augmentation method was proposed for addressing SER tasks. The hybrid data augmentation method constitutes an upsampling technique for generating more speech data samples based on the traditional and generative adversarial network approaches. By leveraging both convolutional and recurrent neural networks in a dilated form along with an attention mechanism, the proposed DL framework can extract high-level representations from three-dimensional log Mel spectrogram features. Dilated convolutional neural networks acquire larger receptive fields, whereas dilated recurrent neural networks overcome complex dependencies as well as the vanishing and exploding gradient issues. Furthermore, the loss functions are reconfigured by combining the SoftMax loss and the center-based losses to classify various emotional states. The proposed framework was implemented using the Python programming language and the TensorFlow deep learning library. To validate the proposed framework, the EmoDB and ERC benchmark datasets, which are imbalanced and/or small datasets, were employed. The experimental results indicate that the proposed framework outperforms other related state-of-the-art methods, yielding the highest unweighted recall rates of 88.03 ± 1.39 (%) and 66.56 ± 0.67 (%) for the EmoDB and ERC datasets, respectively.
- Pretoria: An effective computational approach for accurate and high-throughput identification of CD8+ t-cell epitopes of eukaryotic pathogensPhasit Charoenkwan, Nalini Schaduangrat, Nhat Truong Pham, Balachandran Manavalan, and Watshara ShoombuatongInternational Journal of Biological Macromolecules, 2023
T-cells recognize antigenic epitopes present on major histocompatibility complex (MHC) molecules, triggering an adaptive immune response in the host. T-cell epitope (TCE) identification is challenging because of the extensive number of undetermined proteins found in eukaryotic pathogens, as well as MHC polymorphisms. In addition, conventional experimental approaches for TCE identification are time-consuming and expensive. Thus, computational approaches that can accurately and rapidly identify CD8+ T-cell epitopes (TCEs) of eukaryotic pathogens based solely on sequence information may facilitate the discovery of novel CD8+ TCEs in a cost-effective manner. Here, Pretoria (Predictor of CD8+ TCEs of eukaryotic pathogens) is proposed as the first stack-based approach for accurate and large-scale identification of CD8+ TCEs of eukaryotic pathogens. In particular, Pretoria enabled the extraction and exploration of crucial information embedded in CD8+ TCEs by employing a comprehensive set of 12 well-known feature descriptors extracted from multiple groups, including physicochemical properties, composition-transition-distribution, pseudo-amino acid composition, and amino acid composition. These feature descriptors were then utilized to construct a pool of 144 different machine learning (ML)-based classifiers based on 12 popular ML algorithms. Finally, the feature selection method was used to effectively determine the important ML classifiers for the construction of our stacked model. The experimental results indicated that Pretoria is an accurate and effective computational approach for CD8+ TCE prediction; it was superior to several conventional ML classifiers and the existing method in terms of the independent test, with an accuracy of 0.866, MCC of 0.732, and AUC of 0.921. Additionally, to maximize user convenience for high-throughput identification of CD8+ TCEs of eukaryotic pathogens, a user-friendly web server of Pretoria (http://pmlabstack.pythonanywhere.com/Pretoria) was developed and made freely available.
- An exploratory simulation study and prediction model on human brain behavior and activity using an integration of deep neural network and biosensor Rabi antennaNhat Truong Pham, Montree Bunruangses, Phichai Youplao, Anita Garhwal, Kanad Ray, Arup Roy, Sarawoot Boonkirdram, Preecha Yupapin, Muhammad Arif Jalil, Jalil Ali, Shamim Kaiser, Mufti Mahmud, Saurav Mallik, and Zhongming ZhaoHeliyon, 2023
The plasmonic antenna probe is constructed using a silver rod embedded in a modified Mach-Zehnder interferometer (MZI) ad-drop filter. Rabi antennas are formed when space-time control reaches two levels of system oscillation and can be used as human brain sensor probes. Photonic neural networks are designed using brain-Rabi antenna communication, and transmissions are connected via neurons. Communication signals are carried by electron spin (up and down) and adjustable Rabi frequency. Hidden variables and deep brain signals can be obtained by external detection. A Rabi antenna has been developed by simulation using computer simulation technology (CST) software. Additionally, a communication device has been developed that uses the Optiwave program with Finite-Difference Time-Domain (OptiFDTD). The output signal is plotted using the MATLAB program with the parameters of the OptiFDTD simulation results. The proposed antenna oscillates in the frequency range of 192 THz to 202 THz with a maximum gain of 22.4 dBi. The sensitivity of the sensor is calculated along with the result of electron spin and applied to form a human brain connection. Moreover, intelligent machine learning algorithms are proposed to identify high-quality transmissions and predict the behavior of transmissions in the near future. During the process, a root mean square error (RMSE) of 2.3332 (±0.2338) was obtained. Finally, it can be said that our proposed model can efficiently predict human mind, thoughts, behavior as well as action/reaction, which can be greatly helpful in the diagnosis of various neuro-degenerative/psychological diseases (such as Alzheimer’s, dementia, etc.) and for security purposes.
- Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural networkNhat Truong Pham, Sy Dzung Nguyen1, Vu Song Thuy Nguyen, Bich Ngoc Hong Pham, and Duc Ngoc Minh DangJournal of Information and Telecommunication, 2023
Speech emotion recognition (SER) has several applications, such as e-learning, human-computer interaction, customer service, and healthcare systems. Although researchers have investigated lots of techniques to improve the accuracy of SER, it has been challenging with feature extraction, classifier schemes, and computational costs. To address the aforementioned problems, we propose a new set of 1D features extracted by using an overlapping sliding window (OSW) technique for SER in this study. In addition, a deep neural network-based classifier scheme called the deep Pattern Recognition Network (PRN) is designed to categorize emotional states from the new set of 1D features. We evaluate the proposed method on the Emo-DB and the AESSD datasets that contain several different emotional states. The experimental results show that the proposed method achieves an accuracy of 98.5% and 87.1% on the Emo-DB and AESSD datasets, respectively. It is also more comparable with accuracy to and better than the state-of-the-art and current approaches that use 1D features on the same datasets for SER. Furthermore, the SHAP (SHapley Additive exPlanations) analysis is employed for interpreting the prediction model to assist system developers in selecting the optimal features to integrate into the desired system.
- Fruit-CoV: An efficient vision-based framework for speedy detection and diagnosis of SARS-CoV-2 infections through recorded cough soundsLong H Nguyen(†), Nhat Truong Pham(†)(*), Van Huong Do, Liu Tai Nguyen, Thanh Tin Nguyen6, Hai Nguyen4, Ngoc Duy Nguyen2, Thanh Thi Nguyen3, Sy Dzung Nguyen1, Asim Bhatti, and Chee Peng LimExpert Systems with Applications, 2023
COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This deadly virus has spread worldwide, leading to a global pandemic since March 2020. A recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths globally. Therefore, developing an efficient self-testing service for SARS-CoV-2 at home is vital. In this study, a two-stage vision-based framework, namely Fruit-CoV, is introduced for detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, audio signals are converted into Log-Mel spectrograms, and the EfficientNet-V2 network is used to extract their visual features in the first stage. In the second stage, 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN are employed to aggregate feature representations of the Log-Mel spectrograms and the waveform. Finally, the combined features are used to train a binary classifier. In this study, a dataset provided by the AICovidVN 115M Challenge is employed for evaluation. It includes 7,371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results indicate that the proposed model achieves an Area Under the Receiver Operating Characteristic Curve (AUC) score of 92.8% and ranks first on the final leaderboard of the AICovidVN 115M Challenge. Our code is publicly available.
- Towards designing a generic and comprehensive deep reinforcement learning frameworkNgoc Duy Nguyen2, Thanh Thi Nguyen3, Nhat Truong Pham, Hai Nguyen4, Dang Tu Nguyen, Thanh Dang Nguyen, Chee Peng Lim, Michael Johnstone, Asim Bhatti, Douglas Creighton, and Saeid NahavandiApplied Intelligence, 2023
Reinforcement learning (RL) has emerged as an effective approach for building an intelligent system, which involves multiple self-operated agents to collectively accomplish a designated task. More importantly, there has been a renewed focus on RL since the introduction of deep learning that essentially makes RL feasible to operate in high-dimensional environments. However, there are many diversified research directions in the current literature, such as multi-agent and multi-objective learning, and human-machine interactions. Therefore, in this paper, we propose a comprehensive software architecture that not only plays a vital role in designing a connect-the-dots deep RL architecture but also provides a guideline to develop a realistic RL application in a short time span. By inheriting the proposed architecture, software managers can foresee any challenges when designing a deep RL-based system. As a result, they can expedite the design process and actively control every stage of software development, which is especially critical in agile development environments. For this reason, we design a deep RL-based framework that strictly ensures flexibility, robustness, and scalability. To enforce generalization, the proposed architecture also does not depend on a specific RL algorithm, a network configuration, the number of agents, or the type of agents.
2022
- Speech emotion recognition: A brief review of multi-modal multi-task learning approachesNhat Truong Pham, Anh-Tuan Tran, Bich Ngoc Hong Pham, Hanh Dang-Ngoc, Sy Dzung Nguyen1, and Duc Ngoc Minh DangIn International Conference on Advanced Engineering Theory and Applications, 2022
Speech emotion recognition (SER) has become an attention-grabbing topic in recent years thanks to the development of deep learning in the field of speech processing. However, it is difficult to recognize an accurate emotional state from only speech signals. Therefore, researchers have investigated multi-modalities such as speech, visual, and text inputs to improve the emotional recognition rate of speech. In addition, to enhance the generalized deep learning models for SER, multi-task learning (MTL) strategies have also been applied in the past decade. In this paper, a brief and comprehensive review of multi-modal multi-task learning (3MTL) approaches for recognizing emotional states from speech signals is presented, including multi-modal SER, multi-task learning SER, and multi-modal multi-task learning SER. This paper also discusses about some problems that still need to be solved in 3MTL SER and gives some suggestions for the future.
- Priority-Based Uplink Raw Slot Utilization in the IEEE 802.11 ah NetworksDuc Ngoc Minh Dang, Anh Khoa Tran, and Nhat Truong PhamIn International Conference on Advanced Engineering Theory and Applications, 2022
The IEEE 802.11ah standard allows an Access Point (AP) to connect up to 8192 stations at a transmission range of up to 1 km. The goal of the IEEE 802.11ah standard is to maintain wide connectivity and energy efficiency. Some stations are allocated to RAW slots, but they do not have uplink data packets to transmit results in low channel efficiency. The new MAC protocol (PUT-MAC) allows stations to use adjacent RAW slots in a priority manner to improve channel access usage efficiency. The stations use different Arbitration Inter Frame Space and contention window values to contend channel in the same RAW slot. The paper conducts simulations to compare the network performance of the PUT-MAC protocol with the IEEE 802.11ah.
- vieCap4H Challenge 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTMThanh Tin Nguyen6, Long H Nguyen, Nhat Truong Pham(*), Liu Tai Nguyen, Van Huong Do, Hai Nguyen4, and Ngoc Duy Nguyen2VNU Journal of Science: Computer Science and Communication Engineering, 2022
This study presents our approach on the automatic Vietnamese image captioning for healthcare domain in text processing tasks of Vietnamese Language and Speech Processing (VLSP) Challenge 2021, as shown in Figure 1. In recent years, image captioning often employs a convolutional neural network-based architecture as an encoder and a long short-term memory (LSTM) as a decoder to generate sentences. These models perform remarkably well in different datasets. Our proposed model also has an encoder and a decoder, but we instead use a Swin Transformer in the encoder, and a LSTM combined with an attention module in the decoder. The study presents our training experiments and techniques used during the competition. Our model achieves a BLEU4 score of 0.293 on the vietCap4H dataset, and the score is ranked the 3rd place on the private leaderboard. Our code can be found at https://github.com/ngthanhtin/VLSP_ImageCaptioning for reproducible purposes.
- Space-Frequency Diversity based MAC protocol for IEEE 802.11 ah networksDuc Ngoc Minh Dang, Van Thau Tran, Hoang Lam Nguyen, Nhat Truong Pham, Anh Khoa Tran, and Ngoc-Hanh DangIn 2022 International Conference on Advanced Technologies for Communications (ATC), 2022
IEEE 802.11ah is a sub-GHz communication technology to offer longer range and low power connectivity for the Internet of Things (IoT) applications. A Restricted Access Window (RAW) is specified to decrease the collision probability. Stations are divided into groups and stations from each group attempt to access the channel by employing the Distributed Coordination Function during their assigned RAW slots. However, the network throughput is limited by a single channel MAC protocol. In this paper, Space-Frequency Diversity-based MAC protocol for the IEEE 802.11ah network (SF-MAC protocol) is proposed to allow stations of different sectors to transmit packets on different channels with the help of Forwarders. The proposed SF-MAC protocol improves the packet delivery ratio and aggregate throughput of the network.
- Key Information Extraction from Mobile-Captured Vietnamese Receipt Images using Graph Neural Networks ApproachVan Dung Pham, Le Quan Nguyen, Nhat Truong Pham, Bao Hung Nguyen, Duc Ngoc Minh Dang, and Sy Dzung Nguyen1In 2022 6th International Conference on Green Technology and Sustainable Development (GTSD), 2022
Information extraction and retrieval are growing fields that have a significant role in document parser and analysis systems. Researches and applications developed in recent years show the numerous difficulties and obstacles in extracting key information from documents. Thanks to the raising of graph theory and deep learning, graph representation and graph learning have been widely applied in information extraction to obtain more exact results. In this paper, we propose a solution upon graph neural networks (GNN) for key information extraction (KIE) that aims to extract the key information from mobile-captured Vietnamese receipt images. Firstly, the images are pre-processed using U2-Net, and then a CRAFT model is used to detect texts from the pre-processed images. Next, the implemented TransformerOCR model is employed for text recognition. Finally, a GNN-based model is designed to extract the key information based on the recognized texts. For validating the effectiveness of the proposed solution, the publicly available dataset released from the Mobile-Captured Receipt Recognition (MC-OCR) Challenge 2021 is used to train and evaluate. The experimental results indicate that our proposed solution achieves a character error rate (CER) score of 0.25 on the private test set, which is more comparable with all reported solutions in the MC-OCR Challenge 2021 as mentioned in the literature. For reproducing and knowledge-sharing purposes, our implementation of the proposed solution is publicly available at https://github.com/ThorPham/Key_infomation_extraction.
- Vietnamese Scene Text Detection and Recognition using Deep Learning: An Empirical StudyNhat Truong Pham, Van Dung Pham(†), Qui Nguyen-Van, Bao Hung Nguyen, Duc Ngoc Minh Dang, and Sy Dzung Nguyen1In 2022 6th International Conference on Green Technology and Sustainable Development (GTSD), 2022
Scene text detection and recognition are vital challenging tasks in computer vision, which are to detect and recognize sequences of texts in natural scenes. Recently, researchers have investigated a lot of state-of-the-art methods to improve the accuracy and efficiency of text detection and recognition. However, there has been little research on text detection and recognition in natural scenes in Vietnam. In this paper, a deep learning-based empirical investigation of Vietnamese scene text detection and recognition is presented. Firstly, four detection models including differentiable binarization network (DBN), pyramid mask text detector (PMTD), pixel aggregation network (PAN), and Fourier contour embedding network (FCEN), are employed to detect text regions from the images. Then, four text recognition models including convolutional recurrent neural network (CRNN), self-attention text recognition network (SATRN), no-recurrence sequence-to-sequence text recognizer (NRTR), and RobustScanner (RS) are also investigated to recognize the texts. Moreover, data augmentation methods are also applied to enrich data for improving the accuracy and enhancing the performance of scene text detection and recognition. To validate the effectiveness of scene text detection and recognition models, the VinText dataset is employed for evaluation. Empirical results show that PMTD and SATRN achieve the highest scores among the others for text detection and recognition, respectively. For knowledge-sharing, our implementation is publicly available at https://github.com/ThorPham/VN_scene_text_detection_recognition.
- Safety Message Broadcast Reliability Enhancement MAC protocol in VANETsDuc Ngoc Minh Dang, Anh Khoa Tran, Nhat Truong Pham, Khanh Duong Tran, and Hanh Ngoc DangIn 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), 2022
Recently, Vehicular Ad-hoc Networks (VANETs) have been considered as an important part of the Intelligent Transportation System. Data transmission in VANET can be safety message and non-safety message transmissions. While the safety message transmission typically requires bounded delay and a high packet delivery ratio, the non-safety message transmission demands sufficiently high throughput. In this paper, a MAC protocol for Safety message broadcast Reliability Enhancement in VANETs, named SRE-MAC protocol, is proposed to ensure both the reliability of safety message transmission and the high throughput for non-safety data transmission. In particular, the proposed SRE-MAC employs a time slot allocation of TDMA and a random-access technique of CSMA schemes for accessing the control channel. To evaluate our proposed SRE-MAC protocol, some extensive simulations are conducted. The simulation results show that the proposed SRE-MAC protocol achieves higher performance in terms of safety packet delivery ratio and throughput of non-safety packets, as compared to the IEEE 1609.4 and the VER-MAC protocol.
- A deep learning approach for detecting drill bit failures from a small sound datasetThanh Tran, Nhat Truong Pham, and Jan LundgrenScientific Reports, 2022
Monitoring the conditions of machines is vital in the manufacturing industry. Early detection of faulty components in machines for stopping and repairing the failed components can minimize the downtime of the machine. In this article, we present a method for detecting failures in drill machines using drill sounds in Valmet AB, a company in Sundsvall, Sweden that supplies equipment and processes for the production of pulp, paper, and biofuels. The drill dataset includes two classes: anomalous sounds and normal sounds. Detecting drill failure effectively remains a challenge due to the following reasons. The waveform of drill sound is complex and short for detection. Furthermore, in realistic soundscapes, both sounds and noise exist simultaneously. Besides, the balanced dataset is small to apply state-of-the-art deep learning techniques. Due to these aforementioned difficulties, sound augmentation methods were applied to increase the number of sounds in the dataset. In this study, a convolutional neural network (CNN) was combined with a long-short-term memory (LSTM) to extract features from log-Mel spectrograms and to learn global representations of two classes. A leaky rectified linear unit (Leaky ReLU) was utilized as the activation function for the proposed CNN instead of the ReLU. Moreover, an attention mechanism was deployed at the frame level after the LSTM layer to pay attention to the anomaly in sounds. As a result, the proposed method reached an overall accuracy of 92.62% to classify two classes of machine sounds on Valmet’s dataset. In addition, an extensive experiment on another drilling dataset with short sounds yielded 97.47% accuracy. With multiple classes and long-duration sounds, an experiment utilizing the publicly available UrbanSound8K dataset obtains 91.45%. Extensive experiments on our dataset as well as publicly available datasets confirm the efficacy and robustness of our proposed method. For reproducing and deploying the proposed system, an open-source repository is publicly available at https://github.com/thanhtran1965/DrillFailureDetection_SciRep2022.
- Improving ligand-ranking of AutoDock Vina by changing the empirical parametersT Ngoc Han Pham, Trung Hai Nguyen5, Nguyen Minh Tam, Thien Y. Vu, Nhat Truong Pham, Nguyen Truong Huy, Binh Khanh Mai, Nguyen Thanh Tung, Minh Quan Pham, Van V. Vu, and Son Tung NgoJournal of Computational Chemistry, 2022
AutoDock Vina (Vina) achieved a very high docking-success rate, p̂, but give a rather low correlation coefficient, R, for binding affinity with respect to experiments. This low correlation can be an obstacle for ranking of ligand-binding affinity, which is the main objective of docking simulations. In this context, we evaluated the dependence of Vina R coefficient upon its empirical parameters. R is affected more by changing the gauss2 and rotation than other terms. The docking-success rate p̂ is sensitive to the alterations of the gauss1, gauss2, repulsion, and hydrogen bond parameters. Based on our benchmarks, the parameter set1 has been suggested to be the most optimal. The testing study over 800 complexes indicated that the modified Vina provided higher correlation with experiment Rset1=0.556±0.025 compared with RDefault=0.493±0.028 obtained by the original Vina and RVina 1.2=0.503±0.029 by Vina version 1.2. Besides, the modified Vina can be also applied more widely, giving R ≥ 0.500 for 32/48 targets, compared with the default package, giving R ≥ 0.500 for 31/48 targets. In addition, validation calculations for 1036 complexes obtained from version 2019 of PDBbind refined structures showed that the set1 of parameters gave higher correlation coefficient (Rset1=0.617±0.017) than the default package (RDefault=0.543±0.020) and Vina version 1.2 (RVina 1.2=0.540±0.020). The version of Vina with set1 of parameters can be downloaded at https://github.com/sontungngo/mvina. The outcomes would enhance the ranking of ligand-binding affinity using Autodock Vina.
- Determination of the optimal number of clusters: a fuzzy-set based methodSy Dzung Nguyen1, Vu Song Thuy Nguyen, and Nhat Truong PhamIEEE Transactions on Fuzzy Systems, 2022
The optimal number of clusters (Copt) is one of the determinants of clustering efficiency. In this article, we present a new method of quantifying Copt for centroid-based clustering. First, we propose a new clustering validity index named fRisk(C) based on the fuzzy set theory. It takes the role of normalization and accumulation of local risks coming from each action either splitting data from a cluster or merging data into a cluster. fRisk(C) exploits the local distribution information of the database to catch the global information of the clustering process in the form of the risk degree. Based on the monotonous reduction property of fRisk(C), which is proved theoretically, we present a fRisk-based new algorithm named fRisk4-bA for determining Copt. In the algorithm, the well-known L-method is employed as a supplemented tool to catch Copt on the graph of the fRisk(C). Along with the stable convergence trend of the method to be proved theoretically, numerical surveys are also carried out. The surveys show that the high reliability and stability, as well as the sensitivity in separating/merging clusters in high-density areas, even if the presence of noise in the databases, are the strong points of the proposed method.
- HCILab at Memotion 2.0 2022: Analysis of sentiment, emotion and intensity of emotion classes from meme images using single and multi modalitiesIn Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection (De-Factify@AAAI 2022), CEUR, 2022
Nowadays, memes found on internet are overwhelming. Although they are innocuous and sometimes entertaining, there exist memes that contain sarcasm, offensive, or motivational feelings. In this study, several approaches are proposed to solve the multiple modality problem in analysing the given meme dataset. The imbalance issue has been addressed by using a new Auto Augmentation method and the uncorrelation issue has been mitigated by adopting deep Canonical Correlation Analysis to find the most correlated projections of visual and textual feature embedding. In addition, both stacked attention and multi-hop attention network are employed to efficiently generate aggregated features. As a result, our team, i.e. HCILab, achieved a weighted F1 score of 0.4995 for sentiment analysis, 0.7414 for emotion classification, and 0.5301 for scale/intensity of emotion classes on the leaderboard. This results are obtained by using concatenation between image and text model and our code can be found at https://github.com/ngthanhtin/Memotion2_AAAI_WS_2022.
2021
- Separate sound into STFT frames to eliminate sound noise frames in sound classificationThanh Tran, Kien Bui Huy, Nhat Truong Pham, Marco Carratù, Consolatina Liguori, and Jan LundgrenIn 2021 IEEE Symposium Series on Computational Intelligence (SSCI), 2021
Sounds always contain acoustic noise and background noise that affects the accuracy of the sound classification system. Hence, suppression of noise in the sound can improve the robustness of the sound classification model. This paper investigated a sound separation technique that separates the input sound into many overlapped-content Short-Time Fourier Transform (STFT) frames. Our approach is different from the traditional STFT conversion method, which converts each sound into a single STFT image. Contradictory, separating the sound into many STFT frames improves model prediction accuracy by increasing variability in the data and therefore learning from that variability. These separated frames are saved as images and then labeled manually as clean and noisy frames which are then fed into transfer learning convolutional neural networks (CNNs) for the classification task. The pre-trained CNN architectures that learn from these frames become robust against the noise. The experimental results show that the proposed approach is robust against noise and achieves 94.14% in terms of classifying 21 classes including 20 classes of sound events and a noisy class. An open-source repository of the proposed method and results is available at https://github.com/nhattruongpham/soundSepsound.
2020
- A method upon deep learning for speech emotion recognitionNhat Truong Pham, Duc Ngoc Minh Dang, and Sy Dzung Nguyen1Journal of Advanced Engineering and Computation, 2020
Feature extraction and emotional classification are significant roles in speech emotion recognition. It is hard to extract and select the optimal features, researchers can not be sure what the features should be. With deep learning approaches, features could be extracted by using hierarchical abstraction layers, but it requires high computational resources and a large number of data. In this article, we choose static, differential, and acceleration coefficients of log Mel-spectrogram as inputs for the deep learning model. To avoid performance degradation, we also add a skip connection with dilated convolution network integration. All representatives are fed into a self-attention mechanism with bidirectional recurrent neural networks to learn long term global features and exploit context for each time step. Finally, we investigate contrastive center loss with softmax loss as loss function to improve the accuracy of emotion recognition. For validating robustness and effectiveness, we tested the proposed method on the Emo-DB and ERC2019 datasets. Experimental results show that the performance of the proposed method is strongly comparable with the existing state-of-the-art methods on the Emo-DB and ERC2019 with 88% and 67%, respectively.