Background: Freezing of gait (FOG) is an episodic and highly disabling symptom of Parkinson's Disease (PD). Traditionally, FOG assessment relies on time-consuming visual inspection of camera footage. Therefore, previous studies have proposed portable and automated solutions to annotate FOG. However, automated FOG assessment is challenging due to gait variability caused by medication effects and varying FOG-provoking tasks. Moreover, whether automated approaches can differentiate FOG from typical everyday movements, such as volitional stops, remains to be determined. To address these questions, we evaluated an automated FOG assessment model with deep learning (DL) based on inertial measurement units (IMUs). We assessed its performance trained on all standardized FOG-provoking tasks and medication states, as well as on specific tasks and medication states. Furthermore, we examined the effect of adding stopping periods on FOG detection performance. Methods: Twelve PD patients with self-reported FOG (mean age 69.33 +/- 6.28 years) completed a FOG-provoking protocol, including timed-up-and-go and 360-degree turning-in-place tasks in On/Off dopaminergic medication states with/without volitional stopping. IMUs were attached to the pelvis and both sides of the tibia and talus. A multi-stage temporal convolutional network was developed to detect FOG episodes. FOG severity was quantified by the percentage of time frozen (%TF) and the number of freezing episodes (#FOG). The agreement between the model-generated outcomes and the gold standard experts' video annotation was assessed by the intra-class correlation coefficient (ICC). Results: For FOG assessment in trials without stopping, the agreement of our model was strong (ICC(%TF) = 0.92 [0.68, 0.98]; ICC(#FOG) = 0.95 [0.72, 0.99]). Models trained on a specific FOG-provoking task could not generalize to unseen tasks, while models trained on a specific medication state could generalize to unseen states. For assessment in trials with stopping, the model trained on stopping trials made fewer false positives than the model trained without stopping (ICC(%TF) = 0.95 [0.73, 0.99]; ICC(#FOG) = 0.79 [0.46, 0.94]). Conclusion: A DL model trained on IMU signals allows valid FOG assessment in trials with/without stops containing different medication states and FOG-provoking tasks. These results are encouraging and enable future work investigating automated FOG assessment during everyday life.