Transformer models have been widely applied across various domains, often treating spatio-temporal data as video-like sequences due to the success of generative video prediction. However, this paper argues that transformers are not always optimal for spatio-temporal data with long forecast horizons and strong periodicity. Focusing on metocean forecasting, specifically sea ice, ocean, and atmospheric data, the study evaluates transformer-based models against convolutional neural networks (CNNs). For long-term sea ice forecasting in the Arctic, transformers such as TimeSformer and SwinLSTM failed to capture annual dynamics, including summer melt. In contrast, a lightweight CNN baseline outperformed existing state-of-the-art numerical and data-driven forecasts, improving error metrics by up to 30%. Similarly, in atmospheric bias correction, CNNs proved superior, reducing errors in Global Forecast System fields by 20% relative to transformers. The narrative shifts with ocean forecasting, where transformer models enhanced by contrastive pre-training achieved comprehensive superiority. They significantly reduced errors across all ocean variables, including a 40% reduction for mixed layer depth. These three case studies demonstrate that transformer limitations exist but are conditional rather than absolute, while CNNs remain the appropriate choice when data is limited or fine spatial structure is critical.