Data Management for Data Science
Garbage in, garbage out
Preparing data applies to every analytics measure, data science prediction, statistical forecast or machine learning model. Using poorly formed or inaccurate data reduces value by misinforming decisions, has the potential to damage your company reputation or even cause a breach of regulation. Often these are not one-off cases, but repeat themselves at every batch run, marketing campaign and so on. Good data management is a process which encourages sharing, collaboration and reuse. Most of all, it builds a solid foundation for data driven businesses to trust analytics and data science.
Data Management tasks we all do
We access to databases, data lakes, perhaps stream data from source and almost all of us seem to have data in Excel spreadsheets somewhere. For data science we usually need data from several sources. We join them ensuring consistent granularity into a single (denormalised) table with all the information needed for analysis or modelling.
There are the simple tasks such as selecting columns, filtering rows, standardising names, calculating new variables and joining well defined, structured tables of data. Such tasks lend themselves to visual data preparation software. This makes data management accessible to everyone, with or without programming skills. That helps the curiosity of data scientists which is needed when we seek to explore, find relationships and patterns in data.
Disposable or reusable analytics?
Data preparation tools are increasingly using artificial intelligence to suggest derivations, standardisations and so on. AI makes each one of us more productive, but there are tasks which require succinct knowledge of a business processes. For example, calculating a new measure from a table which has many rows representing a single event and must be combined in a specific way. We write ETL logic in SQL, SAS or whatever open source language is flavour of the month. We collapse the appropriate information into a single meaningful row for analysis. Such work tends to be reusable, that is it can be rerun on new data to update the output created. It is therefore shared as a task or job and automated, meaning scheduled to run on a given event, time trigger or even in the data stream.
In practice many so called disposable data jobs become depended on. We build intricate flows of jobs that move data from source to neatly organised data marts for analysis. Your business should encourage a process of sharing, collaborating and automating data management work with aim of minimising repetition and improving data quality.
Businesses data management processes should extend to data science. Adhering to security and terminology defined in data dictionaries or master data management repositories encourages integration of analytics and data science outcomes with processes and systems. It enables us to easily demonstrate compliance with regulations such as GDPR from source to output. Collectively, this supports trust and longevity in our work.