<div>
<div>
<div>
<p> In the modern world, it is hard to imagine a day without some
form of interaction with digital data. Real world data originating from
signal generating transducers or communication channels are often
recorded as streams of data samples separated by time stamps, sample
counters or simply data record delimiter e.g. newline (\n), comma (,)
etc. Sampling is the basis of statistical estimation from any data source
containing signal records. The process of random sampling has been
in practice since time immemorial. However, rapid scale of data generation processes working in tandem with of computing infrastructures
, the volume of data is getting quite unmanageably large in nearly
every discipline of science. On the other hand, mere volume of data
is of no consequence if we can’t extract effective intelligence out of
it on an “on demand” basis. Of particular interest is the case where
data is stored in a file as a record separated by newline(or any other
delimiter) character. When the number of records in the file is greater
than a threshold, random sampling is a formidable task. It is nearly
impossible to pragmatically load the entire file in the computer memory
or even if theoretically possible, the time it takes to load the data in
its entirety from natural data sources can be overwhelmingly long or
often unnecessary! We can strategically bypass these problems by
carefully designing a data interface tool such that any part of a given
file can be instantly accessed for random sampling or other kinds of
processing tasks by loading only the necessary parts of the data. With
this goal, we created a novel, portable and highly efficient rapid data
access tool named GSFRS: Giant Signal File Random Sampler, written
in modern C++ language to enable near real-time access to any part of
an arbitrarily large sized data file that is almost independent of the file
size for all practical scenarios. Also, big-data processing would become
relatively commonplace and cost effective even in commodity hardwares
once the indices are made available through its indexing protocol. This
capability would potentially revolutionize the way we gather intelligence
from files containing large samples. Adaptation of GSFRS at the source
level of various data generators, processing times and energy footprints
of various computations can be dramatically reduced.
</p>
</div>
</div>
</div>