In this paper, we propose a method that a robot can select an object specified by human speech among several objects based on generic object recognition. Although object selection methods have been proposed based on specific object recognition, generic object recognition is more useful for the selection in a real environment. In the proposed method, an object is selected by integrating speech recognition results and generic object recognition results. We investigated the relation between the method of narrowing down candidates based on speech and image recognition results and the object selection accuracy.