Millions of content posts are posted each day on social media platforms, which have rich information present in more than one of these modalities: image, text, video, and audio. To holistically understand such content, AI models are required to learn a unified representation of multimodal data that effectively captures information from all of the present modalities. There are two important aspects of Multi-modal Representation Learning: firstly designing deep learning architectures to effectively integrate information from each modality, and secondly designing the training objectives that require good understanding from all the modalities to solve the task. In the talk, we will discuss some of these approaches for multi-modal representation learning.