Accurate identification of protein-ligand binding sites is an essential step in structure-based drug discovery. Herein, we present SwinSite, a deep learning framework that leverages a hybrid architecture combining 3D convolutional neural networks and hierarchical vision transformer modules to predict ligand binding sites based on a 3D structure of a target protein. SwinSite encodes spatial information by voxelizing a protein structure into 3D grids centered around surface residues, allowing for a detailed spatial representation of the protein's surface environment. By combining local feature extraction with hierarchical self-attention via shifted windows, SwinSite effectively captures both fine-grained geometric features and long-range dependencies. Evaluations on multiple benchmark data sets demonstrate that SwinSite outperforms existing CNN- and GNN-based ligand binding site detection methods consistently, highlighting its robustness and generalization ability.