From playing basketball to ordering at a food counter, we frequently and effortlessly coordinate our attention with others towards a common focus: we look at the ball, or point at a piece of cake. This non-verbal coordination of attention plays a fundamental role in our social lives: it ensures that we refer to the same object, develop a shared language, understand each other’s mental states, and coordinate our actions. Models of joint attention generally attribute this accomplishment to gaze coordination. But are visual attentional mechanisms sufficient to achieve joint attention, in all cases? Besides cases where visual information is missing, we show how combining it with other senses can be helpful, and even necessary to certain uses of joint attention. We explain the two ways in which non-visual cues contribute to joint attention: either as enhancers, when they complement gaze and pointing gestures in order to coordinate joint attention on visible objects, or as modality pointers, when joint attention needs to be shifted away from the whole object to one of its properties, say weight or texture. This multisensory approach to joint attention has important implications for social robotics, clinical diagnostics, pedagogy and theoretical debates on the construction of a shared world.