The implementation of an unstructured grid matrix-free GMRES+LU-SGS scheme on shared-memory, cache-based parallel machines is described. A special grid renumbering technique is used for the parallelization rather than the traditional method of partitioning the computational domain. The renumbering technique helps to avoid inter-processor data dependencies, cache-misses, and cache-line overwrite while allowing pipelining. The resulting source code can be used with maximum efficiency and without modifications on traditional (scalar) computers, vector supercomputers, and shared-memory parallel systems. Special attention has been paid to develop an optimally parallelized preconditioner for the GMRES scheme.