June 24th, 2013

Saving Big Data Science is Saving Science

RSS icon RSS Category: Uncategorized
Fallback Featured Image

For time is the ultimate non-renewable resource!

Data Science represents the convergence of Domain knowledge, Data Collection and a series of hypotheses validated or invalidated by use of Math. And Big Data Science takes that one step further into the realm of massive datasets that become necessary and pre-condition in Science and Business. So the business of making this work is the business of doing science. From the scientific process to decision making & solving business problems. Removing the drudgery of Data Science is freeing the brightest minds of our time to ask BIG questions & refine their hypotheses a dozen times a minute. Just like the act of search by Google made each of us use the world of information better than ever before. Big Data Science is set to Change the world in so many dimensions!
(reprinted with permission from the experiments of our math geek!) This is emblematic of issues with state of R, the lingua franca of data science.

---------- Forwarded message ----------
From: Irene Lang <irene@0xdata.com>
Date: Mon, Jun 24, 2013 at 2:44 AM
Subject: Re: more workloads/datasets>
To: SriSatish Ambati <srisatish@0xdata.com>

OK. I need your help with this, please.
I can no longer run even really moderately sized datasets on my laptop. For example – I tried running a straightforward glm validation on a 2MB dataset after generating a model using a training set, and I finally gave up at about 2:30AM after letting it run all afternoon because I needed to get something done today.
Doing this sort of thing on my machine is slowing me down because I don't have the memory to do anything quickly, and it means that I end up taking hours to run a single test – limiting my ability to play with the data in any meaningful way. It also means that I've effectively paper-weighted my computer for several hours, which is kindof painful in terms of getting other tasks done while tests are running – because apparently my whole memory is in use, other functions are nill. So, I know the limitation is my computer and not R, and I know that there has to be a relatively straightforward solution to this. I imagine that if I could run R either on the server, or on amazon web services, I don't have to worry about immediately replacing my computer with one that has more memory (which isn't my first choice, because this one is perfectly good, otherwise.)
So, this experiment will now continue onto EC2 and a bigger server – However the problem is not her computer..

Leave a Reply

An Introduction to Time Series Modeling:
Time Series Preprocessing and Feature Engineering

Time is the only nonrenewable resource - Sri Ambati, Founder and CEO, H2O.ai. Prediction is very

October 26, 2021 - by Adam Murphy
New Features Now Available with the Latest Release of the H2O AI Hybrid Cloud 21.10

The Makers here at H2O.ai have been busy building new features and enhancing capabilities across

October 18, 2021 - by
Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Improving NLP Model Performance with Context-Aware Feature Extraction

I would like to share with you a simple yet very effective trick to improve

October 8, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao

Start your 14-day free trial today