Speech and Music Processing

Introduction to Speech and Music Processing

About the Course

This course introduces the fundamental technologies employed in Sound and Music Computing (SMC) focusing on speech and music. This course introduces the concept of sound and its representations in the analog and digital domains, as well as in time and frequency domains. Moreover, this course provides hands-on experience using relevant Machine Learning (ML) tools, and an in-depth review of related technologies in sound data analytics, including Automatic Speech Recognition (ASR), Automatic Music Transcription (AMT). Topics in sound synthesis, automatic music generation will be covered for breadth. Prospective students are expected to have some exposure to ML as they are expected to build, adapt, and/or modify common ML pipelines for speech or music processing as part of their group projects.
In summary, this course includes two aspects: understanding and generation. We will focus on speech and music understanding and generation. In addition to help students get hands-on experience in Discrete Fourier Transform (DFT), Automatic Speech Recognition (ASR), and Automatic Music Transcription (AMT), we will seek to enhance students’ abilities of effective communications in the context of a group project.

Learning Objectives:
-Understand Discrete Fourier Transform (DFT) in the context of audio analysis and synthesis.
-Understand the building blocks of automatic speech recognition (ASR) systems, be able to implement and evaluate an ASR system with a SOTA toolkit such as Speechbrain in a group project.
-Understand, be able to implement and evaluate the approaches to automatic music transcription (AMT) in a group project.
-Work in groups, present solutions in both oral and written formats, and discuss with other students on speech and music processing.

About the Lecturer

Wang Ye

Department of Computer Science, School of Computing, NUS

Dr. Ye Wang is a tenured associate professor in the School of Computing at the National University of Singapore. He obtained his BSc in EE from South China University of Technology in 1983, MSc in EE from Braunschweig University of Technology in Germany in 1993, and PhD in Information Technology from Tampere University of Technology in Finland in 2002. He has worked as a research engineer and senior research engineer at Nokia Research Center, Finlad (1994 - 2002) before joining NUS in 2002.
Dr. Wang’s research has evolved from error robust audio streaming and low power media processing for portable devices to Sound and Music Computing for Human Health and Potential (SMC4HHP), with a focus on applications in rehabilitation and language learning. Dr. Wang’s research has been consistently published in the top venues in including NeurIPS, ICLR, ACM MM), and ISMIR. Dr. Wang has served as editorial board member of Journal of New Music Research, IEEE Transactions on Multimedia (TMM), ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), as well as on the program committees of top international multimedia conferences regularly. He has co-authored a number of Best Papers at ACM MM, ACM CHI, ISMIR and IEEE ISM. He was invited to give keynote address at ISMIR2014 on “Sound and Music Computing for Exercise and Rehabilitation” where he has presented interdisciplinary research projects he has initiated and pioneered in collaboration with clinicians at Harvard Medical School in the USA, Singapore General Hospital as well as Huashan Hospital in China. Furthermore, he was invited to talk at the Symposium on Data Science and Music Information Retrieval at Stanford’s Center for Computer Research in Music and Accoustics (CCRMA), and the Music and the Brain Conference at the Stanford School of Medicine in May 2016, as well as at the Stanford Graduate School of Education in 2021.