Expressive speech synthesis aims to finely model speaking style, emotions, and intonation to enhance the naturalness and expressiveness of synthesized speech. It has broad application prospects in areas such as audiobooks, AI anchors, and human-machine conversational interaction. However, it also faces a series of challenges and difficulties, such as how to efficiently extract and represent multi-scale style features, how to achieve style control and transferring without sacrificing naturalness and comprehensibility, and how to generate expressive and diverse speech prosody. This report introduces the work of the Human-Computer Speech Interaction (HCSI) Laboratory at Tsinghua University in these areas, including: proposing a multi-scale style modeling method for expressive speech synthesis, enhancing the controllability, expressiveness, and flexibility of speech synthesis; proposing a block-based multi-scale cross-speaker style transfer method, achieving cross-speaker speech style transfer; proposing a prosody predictor based on denoising diffusion probability model for generating expressive and diverse speech prosody. Experimental results validate the effectiveness and advantages of the proposed methods in expressive speech synthesis.