Exhaust Data vs. Captured Data

1/10/2020

As a recent graduate of the Flatiron School's Data Science Program seeking my first real job as a data scientist, I'm full of questions about what day 1 on the job will look like. I've been a teacher for the past 8 years and I know what day 1 of school always looks like. But as a career changer actively applying to jobs, I'm entering new territory. Yes, I've worked on a lot of great projects, yes I'm confident I can make models, clean data, create engaging visualizations and present my findings to stakeholders. But, like, how do I actually get the data I need to work with on day 1? You may be just like me and wondering the same thing. Chances are very high your boss or manager at your new data science job won't send you a Kaggle link and tell you to download a nice, clean and structured .csv file. But, I have a feeling you already knew that. Here, I'll break down all the ways in which businesses and organizations gather data and those words you'll need to know when your given new responsibilities at your new job.

Gathered Raw Data
Data is the most important aspect of a data science project. That statement seems a little redundant doesn't it? But, people often forget the quality of a project or found insight is only as good as the data it comes from. That will often depend on where the data comes from. Here are the two ways in which data can be gathered.

Captured data
Gathered from direct measurement or observation, captured data is commonly found through surveying or experimentation. This is typically data that has intentionally been collected. Healthcare data such as blood-glucose levels can be accumulated through captured data from medical reports or readers for a specific study. But the data doesn't stop there, think of all the other data points that would be collected if we were intentionally capturing blood glucose levels.

Exhaust data
The extra data that is given off when intentionally capturing certain data points is known as exhaust data. These 'extra emissions' have been found, at times, to be more informative than the captured data. An example of exhaust data in our blood glucose levels data would be the age, the diet, the socioeconomic standing or smoker/non-smoker standing of the patient. A key to unveiling interesting findings is in the researchers ability to find the important exhaust data and put it to use.

Practice
Can you think of an example of exhaust data in the following places?

- social media
- bank statements
- electric scooters

Knowing the importance of exhaust data is critical to drawing informative insights from captured data.

Structured vs. Unstructured Data
Now, that you know what kind of data you have and how it was acquired, it's time to clean it all up and turn it into a workable format. In a perfect world, we would all be working with structured data everyday. It would clean, labeled, sorted into nice rows and columns and would be found in relational databases or spreadsheets that would just need a SQL query or a pandas .merge() to prepare it. But, typically data comes in unstructured and sloppy. This would be like if a company were collecting loads of exhaust data and were unsure how they would use it, but collected it anyways. This brings up the question of how that data is then accessed by you, the data scientist, or stored by the company.

Data Lake vs. Data Warehouse
Raw unstructured data goes into a lake when it is being collected for no specific reason. This would commonly involve the collection of exhaust data. A data warehouse has more structure, more reasoning behind storing and organizing data.

Source: Grazziti

Supervised vs. Unsupervised Learning
Finally, we get to talk a bit about modeling. Many data scientists would tell you that modeling is about 10% of their overall work. Let's put the details of this post all together with two examples.

Example 1:
You've downloaded a nice, clean .csv type and want to model with it.

Q#1. This data is most likely what type:

a.) structured data or b.) unstructured data?

Q#2. This data is most likely held in a:

a.) data lake or b.) data warehouse

Source: SEMrush

Answers:
a.) structured data
b.) data warehouse

Nice work. This data has been collected and stored in a certain area for a certain reason and will most likely be structured data accessed through a relational database, perhaps using SQL querying.

So, now all you will be able to model using supervised learning. You may created a training data set and a testing data set and you can created target variables where your ML model can learn from the training and apply what it has learned from the training to a test set.

Example 2:
This one is a little more difficult. You have someone at work come to you and ask you a specific question to analyze. Now, you have to figure out what data you will work with, where to get it and how to clean it and structure it for analysis or modeling.

You will now diving down into the data lake and pulling out unstructured data, maybe exhaust data. So, how do you model with unstructured data? Instead of classification or regression, you now are trying to learn about the input variables and the structure in which the data flows in. For example, you could be utilizing K-means clustering or association learning. Once you learn more about your dataset and how it could be structured, you can move onto drawing insights.

Hopefully this run-down of data types and structures will make you feel more confident going into the office on day 1. Don't be afraid to ask questions, don't be afraid to open up your notes and if your colleagues use a bunch of acronyms and words you don't know, write them down without them looking and google it. You'll be just fine.

References:

Structured vs. Unstructured data
https://learn.g2.com/structured-vs-unstructured-data

Supervised vs. Unsupervised
https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Data Lakes
https://en.wikipedia.org/wiki/Data_lake
https://searchaws.techtarget.com/definition/data-lake

0 Comments

Data Science Blog

Exhaust Data vs. Captured Data

Leave a Reply.

Archives

Categories