State-of-the-art methods in image-based robotic grasping use deep convolutional neural networks to determine the robot parameters that maximize the probability of a stable grasp given an image of an object. Despite the high accuracy of these models they are not applied in industrial order picking tasks to date. One of the reasons is the fact that the generation of the training data for these models is expensive. Even though this could be solved by using a physics simulation for training data generation, another even more important reason is that the features that lead to the prediction made by the model are not human-readable. This lack of interpretability is the crucial factor why deep networks are not found in critical industrial applications. In this study we suggest to reformulate the task of robotic grasping as three tasks that are easy to assess from human experience. For each of the three steps we discuss the accuracy and interpretability. We outline how the proposed three-step model can be extended to depth images. Furthermore we discuss how interpretable machine learning models can be chosen for the three steps in order to be applied in a real-world industrial environment.