The Curious Case of Curation

img_2984Context & Problem. Although it typically takes up more than 70% of the Data Science Project life cycle, Data Preparation and Curation it’s somewhat of a mystery why Data Wrangling or Curation is often not explicitly planned for and ends up seriously denting the success of Data Science Projects. Sometimes referred to Data Wrangling, this process collects appropriate data, cleans it, removes redundancies, inconsistencies, resolves gaps that disallow processing, and categorizes or tags data. You soon realize that the raw data you have painstakingly obtained is not ready for processing : you cannot use it in your analysis or for your models without some manual, semi-automated curation.

Python Notebooks along with Libraries such as Pandas can be used to wrangle, prepare or curate your data.

Visualizing Data, or creating automations such as ETLs (Extract-Transform-Load), Categorization, Clustering, Regression, etc. is not feasible when the data is not amenable to processing.

Create consistency. Data will not be uniform, neither in format or codes, or data types, so filtering the data for consistency and homogeneity is a first important step. Try to note the lineage of the data, where it came from, who had manipulated it prior, how reliable is it, is it current, is it acceptable from the point of view of stakeholders who will observe the outputs and visualizations of the data.

One of the first consistency problems is missing data, or null values. Sometimes you need to map data to same consistent range, e.g., 1-10 instead of say 1-100.

Clustering, grouping is another important element of Data Curation and Wrangling.

Formatting and externalizing data for consumption by external parties.

Leave a comment