August 28th, 2015

An Introduction to Data Science: Meetup Summary Guest Post by Zen Kishimoto

RSS icon RSS Category: Uncategorized
Fallback Featured Image

Originally posted on Tek-Tips forums by Zen here
I went to two meetups at H2O, which provides an open source predictive analytics platform. The second meetup was full of participants because its theme was an introduction to data science.
Data science is a new buzzword, and I feel like everyone claims to be a data scientist or something relating to that these days. But other than real data scientists, very few people really understand what a data scientist is or does. Bits and pieces of information are available, but it takes a basic understanding of the subject to exploit such fragmented information. Once you are up to a certain altitude, a series of blogs by Ricky Ho are very informative.
But first things first. There were three speakers at that meetup, but I’ll only elaborate on the first one, who described data science for laymen. The speaker was Dr. Erin LeDell, whose presentation title was Data Science for Non-Data Scientists.


Dr. Erin LeDell

In the following summary of her points, I stay at a bare-bones level so that a total amateur can grasp what data science is all about. Once you get it, I think you can reference other materials for more details. Her presentations and others are available here. The presentation was also videorecorded and is available here.
At the beginning, she introduced three Stanford professors who work closely with H2O:

The first two professors publish many books, but LeDell mentioned
that two ebooks on very relevant subjects are available free of charge.
You can download the books below:

###Data science process
LeDell gave a high-level view of the data science process:


A simple explanation of each step is given here, in my words.

Problem formulatio

  • A data scientist studies and researches the problem domain and identifies factors contributing to the analysis.

Collect & process data

  • Relevant data about the identified factors are collected and processed. Data processing includes cleansing data, which means getting rid of corrupt and/or incorrect values and normalizing values. Some data scientists say that 50-80% of their time is spent cleansing data. This was mentioned by   several data scientists.

Machine learning

  • The most appropriate machine learning algorithm is developed or selected from a pool of well-known algorithms.

Insights & action

  • The results are analyzed for appropriate action.

What background does a data scientist need?

This is a question asked by many non–data scientists. I have seen it many times, along with many answers. LeDell answered: mathematics and statistics, programming and database, communication and visualization, and domain knowledge and soft skills.
Drew Conway‘s answer is well known. Actually, the second speaker referred to it.
Data Scientist Skills

This diagram is available here.

Machine Learning

LeDell classified machine learning into three categories: regression, classification, and clustering.
These algorithms are well known and documented. In most cases, a data scientist uses an existing algorithm rather than developing one.

Deep and ensemble learning

LeDell introduced two more technologies: deep and ensemble learning.
Deep learning is described as:
“A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)
In ensemble learning, multiple learning algorithms obtain better predictive performance with the penalty of computation time. More details are here.
Finally, LeDell gave the following information for more details on the subject.
I skipped some of her discussion here, but I hope this is a good start to understanding what data science is, and that you will dig further into it.

Zen Kishimoto

About Zen Kishimoto

Seasoned research and technology executive with various functional expertise, including roles in analyst, writer, CTO, VP Engineering, general management, sales, and marketing in diverse high-tech and cleantech industry segments, including software, mobile embedded systems, Web technologies, and networking. Current focus and expertise are in the area of the IT application to energy, such as smart grid, green IT, building/data center energy efficiency, and cloud computing.

Leave a Reply

AI-Driven Predictive Maintenance with H2O Hybrid Cloud

According to a study conducted by Wall Street Journal, unplanned downtime costs industrial manufacturers an

August 2, 2021 - by Parul Pandey
What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today