Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm’s accuracy performance by 27% on average and achieved above 90% common shingles.
The comparison studies on random access memory (RAM) acquisition tools are either limited in metrics or the selected tools were designed to be executed in older operating systems. Therefore, this study evaluates widely used seven shareware or freeware/open source RAM acquisition forensic tools that are compatible to work with the latest 64-bit Windows operating systems. These tools' user interface capabilities, platform limitations, reporting capabilities, total execution time, shared and proprietary DLLs, modified registry keys, and invoked files during processing were compared. We observed that Windows Memory Reader and Belkasoft's Live Ram Capturer leaves the least fingerprints in memory when loaded. On the other hand, ProDiscover and FTK Imager perform poor in memory usage, processing time, DLL usage, and not-wanted artifacts introduced to the system. While Belkasoft's Live Ram Capturer is the fastest to obtain an image of the memory, Pro Discover takes the longest time to do the same job.
Researchers confront major problems while searching for various kinds of data in a large imprecise database, as they are not spelled correctly or in the way they were expected to be spelled. As a result, they cannot find the word they are looking for. Over the years of struggle, relying on pronunciation of words was considered to be one of the practices to solve the problem effectively. The technique used to acquire words based on sounds is known as "Phonetic Matching". Soundex is the first algorithm proposed and other algorithms like Metaphone, Caverphone, DMetaphone, Phonex etc., have been also used for information retrieval in different environments. This paper deals with the analysis and evaluation of different phonetic matching algorithms on several datasets comprising of street names of North Carolina and English dictionary words. The analysis clearly states that there is no clear best technique in general since Metaphone has the best performance for English dictionary words, while NYSIIS has better performance for datasets having street names. Though Soundex has high accuracy in correcting the misspelled words compared to other algorithms, it has lower precision due to more noise in the considered arena. The experimental results paved way for introducing some suggestions that would aid to make databases more concrete and achieve higher data quality.
Abstract-Translation has become very important in our generation as people with completely different cultures and languages are networked together through the Internet. Nowadays one can easily communicate with anyone in the world with the services of Google Translate and/or other translation applications. Humans can already recognize languages that they have priory been exposed to. Even though they might not be able to translate, they can have a good idea of what the spoken language is. This paper demonstrates how different Neural Network models can be trained to recognize different languages such as French, English, Spanish, and German. For the training dataset voice samples were choosed from Shtooka, VoxForge, and Youtube. For testing purposes, not only data from these websites, but also personally recorded voices were used. At the end, this research provides the accuracy and confidence level of multiple Neural Network architectures, Support Vector Machine and Hidden Markov Model, with the Hidden Markov Model yielding the best results reaching almost 70 percent accuracy for all languages.
As the Internet and World Wide Web have rapidly evolved and revolutionized the applications in everyday life, it is a demanding challenge for investigators to keep up with the emerging technologies for forensic analyses. Investigating web browser usages for criminal activities, also known as web browser forensics, is a significant part of digital forensics as crucial browsing information of the suspect can be discovered. Particularly, in this study, an emerging web storage technology, called IndexedDB, is examined. Characteristics of IndexedDB technology in five major web browsers under three major operating systems are scrutinized. Also, top 15 US websites ranked by Alexa are investigated for their data storage in IndexedDB. User screen names, ids, and records of conversations, permissions, and image locations are some of the data found in IndexedDB. Furthermore, BrowStEx, a proof‐of‐concept tool previously developed, is extended and cultivated into BrowStExPlus, with which aggregating IndexedDB artifacts is demonstrated.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.