Existing studies in user profiling are unable to fully utilize the multimodal information. This paper presents a cross-modal joint representation learning network, and develop a multi-modal fusion model. Firstly, a stacking method is adopted to learn the joint representation network which fuse the cross-modal information. Then, attention mechanism is introduced to automatically learn the contribution of different modal to the prediction task. Proposed model has a well defined loss function and network structure, which enables combining the related features in various models by learning the joint representations after feature-level and decision-level fusion. The extensive experiments on real data sets show that proposed model outperforms the baselines.
Key words:
user profiling,
model combination,
stacking,
cross-modal learning joint representation,
multi-layer and multi-level model fusion