Data Wrangling in Database Systems: Purging of Dirty Data

Azeroual, Otmane

doi:10.3390/data5020050

Cited by 22 publications

(10 citation statements)

References 12 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This results in a "tall" format with potentially many rows for each item. Dealing with dirty or ill-defined data introduces additional challenges of cleaning (making data types consistent, ensuring appropriate types), validation (checking for bad data) and removing or replacing anomalous values [5,48]. This may require decisions about densification or imputation [46,52] or about what to ignore [48].…”

Section: Table Techniques In Visualization Researchmentioning

confidence: 99%

Untidy Data: The Unreasonable Effectiveness of Tables

Bartram

Correll

Tory³

2022

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

Fig. 1: An example spreadsheet (shared with permission of Cornerstone Architects) showing various "rich" table features that our participants employed, including (1) A Master Table of base data that is often left untouched, with manipulations happening in a copy or other area separate from the base data; (2) Marginalia such as comments or derived rows or columns in the periphery of the base table, often taking the form of freeform natural language comments; (3) Annotations such as highlighting or characters with specific meaning (e.g., a dash denotes missing values) to flag particular cells as anomalous or requiring action; and (4) Multi-cell features such as labels or even data that span multiple rows or columns of the sheet.

show abstract

Section: Table Techniques In Visualization Researchmentioning

confidence: 99%

Untidy Data: The Unreasonable Effectiveness of Tables

Bartram

Correll

Tory³

2022

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

show abstract

“…In Computer Science, many papers have discussed methods for the data quality management in information systems from different domains, such as the first-time-right principle, the closed-loop principle, data catalogue, data profiling ( Azeroual et al, 2018b ), data cleansing ( Azeroual et al, 2018a ), data wrangling ( Azeroual, 2020 ), data monitoring, data lakes ( Mathis, 2017 ), data text mining ( Azeroual, 2019 ), and machine learning ( Duka and Hribar, 2010 ; Maali et al, 2010 ); these papers have also shown how the methods can be used in practice to ensure data quality. The methods of data cleaning and monitoring range from fully automated to mostly manual operations, which is closely related to the amount of knowledge required for each operation.…”

Section: How To Improve Data Qualitymentioning

confidence: 99%

Trustworthy or not? Research data on COVID-19 in data repositories

Azeroual

Schöpfel

2021

Libraries, Digital Information, and COVID

Self Cite

View full text Add to dashboard Cite

“…A study has been done to gain insight into the Exploratory Data analysis techniques regarding cyber events. Data Analysis techniques were used in similar tasks such as for Unified Host and Network data set (Beazley et al , 2019), A Cyber Threat intelligence Perspective(Al-Mohannadi et al , 2020), Analysis of Cyber Defence Exercise using exploratory sequential analysis(Andersson et al , 2011), Intrusion Detection Technique based on proposed Statistical Flow Features for Protecting Network Traffic of Internet of Things (Moustafa et al , 2019).…”

Section: Related Workmentioning

confidence: 99%

Exploratory data analysis for cybersecurity

Miranda-Calle

Reddy

Dhawan

et al. 2021

WJE

View full text Add to dashboard Cite

Purpose The impact of cyberattacks all over the world has been increasing at a constant rate every year. Performing exploratory analysis helps organizations to identify, manage and safeguard the information that could be vulnerable to cyber-attacks. It encourages to the creation of a plan for security controls that can help to protect data and keep constant tabs on threats and monitor their organization’s networks for any breaches. Design/methodology/approach The purpose of this experimental study is to state the use of data science in analyzing data and to provide a more detailed view of the most common cybersecurity attacks, what are the most accessed logical ports, visible patterns, as well as the trends and occurrence of attacks. The data to be processed has been obtained by aggregating data provided by a company’s technology department, which includes network flow data produced by nine different types of attacks within every day user activities. This could be insightful for many companies to measure the damage caused by these breaches but also gives a foundation for future comparisons and serves as a basis for proactive measures within industry and organizations. Findings The most common cybersecurity attacks, most accessed logical ports and their visible patterns were found in the acquired data set. The strategies, which attackers have used with respect to time, type of attacks, specific ports, IP addresses and their relationships have been determined. The statistical hypothesis was also performed to check whether attackers were confined to perform random attacks or to any specific machines with some pattern. Originality/value Policies can be suggested such that if an attack is conducted on a specific machine, which can be prevented by identifying the machine, ports and duration of the attacks on which the attacker is targeting and to formulate such policies that the organization should follow to tackle these targeted attacks in the future.

show abstract

Data Wrangling in Database Systems: Purging of Dirty Data

Cited by 22 publications

References 12 publications

Untidy Data: The Unreasonable Effectiveness of Tables

Untidy Data: The Unreasonable Effectiveness of Tables

Trustworthy or not? Research data on COVID-19 in data repositories

Exploratory data analysis for cybersecurity

Contact Info

Product

Resources

About