LU decomposition is an important computational step in many engineering and scientific computing problems. In most of critical applications, many small-scale problems need to be solved instead of a few large linear systems. However, when facing with small or medium sized matrices, existing batched LU decomposition algorithms suffer from the global memory access latency bottleneck, and the performance is poor. We implement a series of specialized optimized batched GPU-based LU decomposition algorithms for this situation, and two outperforming algorithms are selected after a systematic testing. They both achieved speedup ratio greater than 3 compared with cuBLAS, and even greater than 10 in some cases.