Is Being a Data Scientist Really That Sexy? Part 2

I received a large number of comments on my previous blog, along the lines: “Come on! The “RACE, Learn, Play” cannot be that easy? There must be a part of the job that is not that sexy? Sure, everything in life comes at a cost.

So here I will cover some not so “sexy” parts of my job:

1) I can’t always find the silver bullet. I am not a “unicorn”, not a fairy godmother and I don’t do magic. Here are a few descriptions of my Data Scientist role I found from various sources:

“A New Breed”;

“Part analyst, part artist”;

“Unicorn of data”;

“The data scientist will sift through all incoming data with the goal of discovering a previously hidden insight” and so on.

It almost sounds like magic and too often, magic is expected. Combining multiple data sources and looking for hidden insights is very much what I enjoy and part of what I do, but it’s not always true that amazing insights are there hiding away. Realising that fact can be valuable, but it’s not so sexy!

The Challenge: “Here, I’ve got all this data (TB of it) and I have this problem, I need a solution (yesterday), find it, please.” Sometimes I quickly realise that the data is not fit for purpose or does not contain the answer that is expected and then I face the hard part.

The Response: "Re. those silver bullets, you asked for: Well, there is not even a silver paper weight!”

2) Sometimes it feels like I am dredging a swamp, not fishing in a lake. The expectations that just because you have a “data lake” and hundreds of terabytes of data and “a unicorn” that magic will happen and all the problems will disappear in an instance is hard to manage. I don’t do magic, I do data!

I like to get to the granular level of data and look at it in many different ways and test different modelling techniques against it. Some mistake data lake principles as “let’s just store all the data in whatever format so we can do some data science on it later”.

They then refer to the idea of ‘Schema on read’ - an idea that’s increasingly popular, but it shouldn’t be the only method! The truth is that someone at some stage has to make some sense of all this data. Is it useful? What do I do with it? What format is it in? How should I analyse it? What problems can be solved with it?

If data is sourced from different channels including documents, the web, and different databases, it means that the data is in different formats. Most algorithms require input to be in the same format and in that case be prepared that someone (“the unicorn”) will spend time to clean-up, shape and make sense of the data before any analysis can be done on it. That’s not always so sexy.

Don’t hoard, just because there is a massive amount of searchable data, it does not mean that it would be useful to anyone.

There is no free lunch, understand what it is that you are storing and for what purpose.

3) Not all tools are created equal. There’s a cost and sometimes that cost is my time. More often than not, the data I’m asked to go fish on is in HDFS. There’s no doubt that Hadoop has great value and I can get great results with the right problem. But despite numerous research showing the sweet spots, there are many that still believe it can do everything and can do it all cheaply.

Hadoop has almost become synonyms with ‘big data’ and there are countless ‘size’ based pitches on how some companies are successfully storing Petabytes of data and delivering insights from it. What is usually lacking from these pitches is how and at what cost? And I am not talking $ per petabyte storage, it is more $ per petabyte analysis. Hey, this stuff is powerful but hard! You really need to know what you doing.

Back in 2013 David Menninger, the head of business development and strategy for EMC’s Pivotal was quoted to say: “Hadoop is like a puppy: you get one for free, but there are the hidden costs like taking it to the vet and feeding it” . 2 years later, Fortune were still reporting that companies are not adopting Hadoop lightly. The main reasons are the shortage of skills. “Hadoop can handle huge data sets and make them useable, the capabilities needed to set up and run Hadoop remain scarce and expensive”.

One thing is to store the data, the other side of it is to analyse it. I am not talking about simple means and averages, or data quality reporting. Doing complex machine learning on Hadoop requires skills in many different disciplines: machine learning, distributed algorithms, HDFS, Hive, Java, Python, Map Reduce, etc.

The industry is closing this gap, but finding “the unicorns” training them and keeping them will take some time and effort. So the cost is not in the storage but in the usage of the data.

Tatiana Bokareva is a Data Scientist for Teradata Aster, the market leader in big data analytics. Tatiana is responsible for data mining, analytics and ultra-fast analysis of unstructured, semi structured data using Teradata Aster advance analytics platform.