Cihan Varol scite author profile

Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm’s accuracy performance by 27% on average and achieved above 90% common shingles.

show abstract

In‐Depth Analysis of Computer Memory Acquisition Software for Forensic Purposes

McDown

Varol

Carvajal

et al. 2015

Journal of Forensic Sciences

View full text Add to dashboard Cite

The comparison studies on random access memory (RAM) acquisition tools are either limited in metrics or the selected tools were designed to be executed in older operating systems. Therefore, this study evaluates widely used seven shareware or freeware/open source RAM acquisition forensic tools that are compatible to work with the latest 64-bit Windows operating systems. These tools' user interface capabilities, platform limitations, reporting capabilities, total execution time, shared and proprietary DLLs, modified registry keys, and invoked files during processing were compared. We observed that Windows Memory Reader and Belkasoft's Live Ram Capturer leaves the least fingerprints in memory when loaded. On the other hand, ProDiscover and FTK Imager perform poor in memory usage, processing time, DLL usage, and not-wanted artifacts introduced to the system. While Belkasoft's Live Ram Capturer is the fastest to obtain an image of the memory, Pro Discover takes the longest time to do the same job.

show abstract

Performance Evaluation of Phonetic Matching Algorithms on English Words and Street Names - Comparison and Correlation

Koneru

Pulla

Varol

2016

View full text Add to dashboard Cite

Researchers confront major problems while searching for various kinds of data in a large imprecise database, as they are not spelled correctly or in the way they were expected to be spelled. As a result, they cannot find the word they are looking for. Over the years of struggle, relying on pronunciation of words was considered to be one of the practices to solve the problem effectively. The technique used to acquire words based on sounds is known as "Phonetic Matching". Soundex is the first algorithm proposed and other algorithms like Metaphone, Caverphone, DMetaphone, Phonex etc., have been also used for information retrieval in different environments. This paper deals with the analysis and evaluation of different phonetic matching algorithms on several datasets comprising of street names of North Carolina and English dictionary words. The analysis clearly states that there is no clear best technique in general since Metaphone has the best performance for English dictionary words, while NYSIIS has better performance for datasets having street names. Though Soundex has high accuracy in correcting the misspelled words compared to other algorithms, it has lower precision due to more noise in the considered arena. The experimental results paved way for introducing some suggestions that would aid to make databases more concrete and achieve higher data quality.

show abstract

Open Source Data Quality Tools: Revisited

Pulla

Varol

et.al

2016

View full text Add to dashboard Cite

Automatic Spoken Language Recognition with Neural Networks

Gazeau¹,

Varol²

2018

IJITCS

View full text Add to dashboard Cite

Abstract-Translation has become very important in our generation as people with completely different cultures and languages are networked together through the Internet. Nowadays one can easily communicate with anyone in the world with the services of Google Translate and/or other translation applications. Humans can already recognize languages that they have priory been exposed to. Even though they might not be able to translate, they can have a good idea of what the spoken language is. This paper demonstrates how different Neural Network models can be trained to recognize different languages such as French, English, Spanish, and German. For the training dataset voice samples were choosed from Shtooka, VoxForge, and Youtube. For testing purposes, not only data from these websites, but also personally recorded voices were used. At the end, this research provides the accuracy and confidence level of multiple Neural Network architectures, Support Vector Machine and Hidden Markov Model, with the Hidden Markov Model yielding the best results reaching almost 70 percent accuracy for all languages.

show abstract

BrowStExPlus: A Tool to Aggregate IndexedDB Artifacts for Forensic Analysis

Palıgu

Kumar

Cho

et al. 2019

Journal of Forensic Sciences

View full text Add to dashboard Cite

As the Internet and World Wide Web have rapidly evolved and revolutionized the applications in everyday life, it is a demanding challenge for investigators to keep up with the emerging technologies for forensic analyses. Investigating web browser usages for criminal activities, also known as web browser forensics, is a significant part of digital forensics as crucial browsing information of the suspect can be discovered. Particularly, in this study, an emerging web storage technology, called IndexedDB, is examined. Characteristics of IndexedDB technology in five major web browsers under three major operating systems are scrutinized. Also, top 15 US websites ranked by Alexa are investigated for their data storage in IndexedDB. User screen names, ids, and records of conversations, permissions, and image locations are some of the data found in IndexedDB. Furthermore, BrowStEx, a proof‐of‐concept tool previously developed, is extended and cultivated into BrowStExPlus, with which aggregating IndexedDB artifacts is demonstrated.

show abstract

Testing IoT Security: The Case Study of an IP Camera

Abdalla

Varol²

2020

View full text Add to dashboard Cite

Age and Gender Identification by SMS Text Messages

Khdr

Varol

2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Cihan Varol

Detecting near-duplicate text documents with a hybrid approach

In‐Depth Analysis of Computer Memory Acquisition Software for Forensic Purposes

Performance Evaluation of Phonetic Matching Algorithms on English Words and Street Names - Comparison and Correlation

Open Source Data Quality Tools: Revisited

Automatic Spoken Language Recognition with Neural Networks

BrowStExPlus: A Tool to Aggregate IndexedDB Artifacts for Forensic Analysis

Testing IoT Security: The Case Study of an IP Camera

Age and Gender Identification by SMS Text Messages

Contact Info

Product

Resources

About