Video anomaly detection aims to detect abnormal events. Motivated by the power of transformers recently shown in vision tasks, we propose a novel transformer-based network for video anomaly detection. To capture long-range information in video, we employ a multi-scale transformer as an encoder. A convolutional decoder is utilized to predict the future frame from the extracted multi-scale feature maps. The proposed method is evaluated on three benchmark datasets: USCD Ped2, CUHK Avenue, and ShanghaiTech. The results show that the proposed method achieves better performance compared to recent methods.