Interaction with its environment is a key requisite for designing a humanoid robot especially to have the ability to recognize and manipulate unknown objects and it is crucial to successfully work in natural environments. However visual object recognition still remains a challenging problem. To get the robot capable of identifying the geometric shapes and colors of the objects, this paper proposes new approach using neuro Zernike moments. Furthermore, the paper proposes a natural language understanding system, where the robot will be able to effectively communicate with human through a dialogue developed in Arabic language. The developed dialogue and a dynamic object model are used for learning semantic categories, object descriptions, and new words acquisition for object learning. In this paper, a robot will be developed to interact with the users performing some specified actions. Moreover, integration between the proposed vision and natural language understanding systems has been presented. Finally, a hardware circuit is designed and Q-learning technique is presented assisting the robot to track and grip objects. Intensive experiments have been conducted indoor to address the validity of the complete system. Qualitative comparison among different techniques is accomplished. The achieved results show that the overall system performance of the proposed system outperforms in terms of accuracy and response time.