We propose a human action clustering method based on a 3D representation of the body in terms of volumetric coordinates. Features representing body postures are extracted directly from 3D data, making the system inherently insensitive to viewpoint dependence, motion ambiguities and self-occlusions. An Invariant Shape Descriptor of human body is obtained in order to capture only posture-dependent characteristics, despite possible differences in translation, orientation, scale and body size. Frame-by-frame descriptions, generated from a gesture sequence, are collected together in matrices. Clustering of action matrices is eventually performed, and through a Dynamic Time Warping (while computing the distance metric), we gain independence from possible temporal nonlinear distortions among different instances of the same gesture.