Continuous efforts have been devoted to language understanding (LU) for conversational queries with the fast and wide-spread popularity of voice assistants. In this paper, we first study the LU problem in the spatial domain, which is a critical problem for providing location-based services by voice assistants, but is without in-depth investigation in existing studies. Spatial domain queries have several unique properties making them be more challenging for language understanding than common conversational queries, including lexicalsimilar but diverse intents and highly ambiguous words. Thus, a special tailored LU framework for spatial domain queries is necessary. To the end, a dataset SMQ was extracted and annotated based on the real-life queries from a voice assistant service. We then proposed a new multi-task framework MELIP that jointly learns the intent detection and entity linking tasks on the SMQ with invented hierarchical intent detection method and triple-scoring mechanism for entity linking. A specially designed spatial GCN is also utilized to model spatial context information among entities. We have conducted extensive experimental evaluations with state-of-the-art entity linking and intent detection methods, which demonstrated that MELIP can outperform all baselines with a significant margin.