The Big Deal on Big Data (Part 2)

by

The Changing Face of Data

The challenge is not only the amount that data is growing but the type of data is changing as well. Traditionally, computer information systems are really good and collecting, processing and analyzing structured data – information that can be described using a structure or schema. The earliest versions of these were known as flat models, originally used by punch-card systems and later mainframe programs, to structure data into fixed-length fields[i]. (These models are still used today in what people call “flat files” that are often used to transfer data between systems.) Today’s modern systems now primarily rely on relational models that structure data according to tables of data (made up of rows and columns) and the relations between them. These are stored in major database systems including Microsoft’s SQL Server, Oracle’s RDBMS and IBM’s DB2.

However, the emergence of semi-structured and unstructured data is fueling much of the Internet’s data growth. An example of semi-structured data is an email – it includes structured elements such as from, subject, date and content. However, the message itself can contain anything the user wishes in whatever format they want. Unstructured data examples include pictures, videos, phone conversations, text messages and Tweets. While structured data can be more easily analyzed, semi-structured and unstructured data is more open to interpretation and is more difficult for computer systems to manage. (Just think how hard it would be for you to answer questions about the document you are reading compared to a table of sales statistics.)

This is especially relevant to marketers. The rapid increases in consumer generated data includes on-line behaviors including participation in social networks, mobile searching (which now includes location-based data), targeted display ads, data integration across e-commerce / web-sites and digital messaging including email, SMS and texting. As a great example, researchers at Northwestern University used time to respond to emails to glean information about social closeness between users[ii]. The shorter it took to respond to an email, the closer their research showed the connection. Are you collecting that type of information? There’s a lot of data out there, and there’s a lot of work to fully analyze and understand it.

Today’s Approach to Storing and Processing Data Wasn’t Built for This Explosion

Our heavy reliance on relational data models was built for a different world. In 1970, E.F. Codd wrote his seminal paper[iii] that first described the concepts behind using relational models for data storage. He was primarily concerned about wasted disk space and faster searching of information within larger data sets (larger being relative as Codd only had to deal with kilobytes and megabytes of data). This was at a time when computer resources were expensive and efficiency was extremely valuable. For example, Intel’s first commercial chip (released in 1971) was capable of 92,000 operations per second compared with today’s Quad-core i7 chips that are capable of 177,730,000,000 operations per second[iv]. Storage costs are another area that has seen amazing efficiencies. In 1971, IBM disk drives cost $17,000,000 per gigabyte in today’s dollars[v]. Today, the cost is under $0.10 per gigabyte[vi] and declining quickly.

The building blocks to solving the problems outlined by Codd (matched with the reality of how expensive computers were back then) were to centralize the data store and eliminate as much redundancy of the data set as possible. This helped to speed up searches and ensured data integrity. Yet, forty years later, with vast increases in the amount of data and shrinking costs of computer systems, we still rely on these 1970’s innovations to manage our data.

An implication of this is seen in challenges related to how we scale our relational database management systems (RDBMs). The two major approaches to scaling computer systems are vertical scaling and horizontal scaling:

  • Vertical scaling refers to the ability to add more scale to a single computer node by upgrading things such as the processing power, amount of memory or hard-drive capacity. Think of waiting in line at the grocery store, this approach parallels making the checkout process faster so that people in line behind you wait less.
  • Horizontal scaling refers to the ability to add in additional nodes to manage the workload required of the system. Going back to our grocery store analogy, this approach is similar to adding in additional checkout lines.

Relational database systems have often relied on vertical scaling requiring expensive hardware and hitting pragmatic limits in what a single computer is capable of processing. Horizontal scaling, while sometimes being more complex, can scale larger and typically costs less. But, since RDBM systems were originally built on a single computer assumption, they aren’t as amenable to horizontal scale.

As a strategy around this, database architects and administrators have implemented a number of work-arounds to find ways to scale horizontally. One approach is to use a master-slave architecture that uses data replication; essentially, data is pushed to others servers that can be used for read-only operations like reporting. Another approach is to use partitioning strategies such as list partitioning that segregates data across databases (e.g., by country, by first letter of the last name, grouping by zip code, etc.) This allows a degree of horizontal scaling, but there are several significant drawbacks:

  • Often, its up to the developer of the database system to make a choice on how to partition the data. While this strategy may work in theory, practice may prove otherwise. By putting the onus of the strategy on the application layer itself, the decision has to be made prior to building the solution, and if production performance demonstrates that the wrong strategy was selected, it will require a fairly significant re-design to mitigate.
  • The partitions themselves are treated as separate data stores. That’s good news in terms of scale and performance, but the challenge is that if you want to combine information across databases, you have to do some pretty computationally expensive joins across system boundaries. That means that faster performance can suffer in the name of scale.
  • The management and costs can be expensive. Relying on commercial vendors to provide partitioning in more seamless and manageable ways requires an enterprise suite of tools, technologies and expertise. Building and managing those environments can take a lot of resources both in terms of licensing and people costs.

While these various strategies and work-arounds have resulted in RDBM systems to scale to truly impressive levels, the onslaught of data and unnecessary complexity has brought us to an inflection point: managing for “Big Data”.

 


[i] http://en.wikipedia.org/wiki/Flat_file_database#History

[ii] http://news.sciencemag.org/sciencenow/2011/11/e-mail-reveals-your-closest-frie.html

[iii] http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

[iv] http://en.wikipedia.org/wiki/Intel_4004

[v] http://www-03.ibm.com/ibm/history/exhibits/system7/system7_press.html. Original was $3,245,000 per GB based on purchase price of $16,225 and capacity of 5 MB (or 0.001 GB). Using inflation calculator at http://www.westegg.com/inflation comparing 1971 dollars to 2010.

[vi] http://www.mkomo.com/cost-per-gigabyte. Alternatively, go to Amazon.com or other supplier and check yourself – the prices continue to go down every day.

READ MORE

Shifting Perspectives: 3 Learnings From a 3-Day Training

Shifting Perspectives: 3 Learnings From a 3-Day Training

About a week ago, I completed the second live (virtual) training in the process of becoming a Certified Professional Coach through iPEC. Once again, my mind was blown! It reinforced for me that virtual workshops can, and do, work, and, in a lot of ways, I prefer them...

read more
Finding My Work-Life Balance

Finding My Work-Life Balance

In my previous post, I told the story of how I got back into consulting after becoming a mom. All of the diverse experiences I had during that journey have helped me to find my work-life balance by… Defining Boundaries “Go home,” my first boss said 12 years back —...

read more
How I Got Back to Work After Being a Full-Time Mom

How I Got Back to Work After Being a Full-Time Mom

I Landed My Dream Job Throwback to 2014, I had completed my MBA, landed my dream job as a consultant, and was hoping that my new consulting career would exponentially ramp up my career growth for the next 5 years. This would position me to take on critical decision...

read more
Self-Awareness is Key to Belonging

Self-Awareness is Key to Belonging

In August of this year, as part of our annual company meeting, our team at Thought Ensemble participated in the foundational session of Diversity, Equity, and Inclusion (DEI) training led by Dr. Nika White, IOM, CDE (she/her/hers). One of the most meaningful moments...

read more
Finding Your Organization’s Magic Pixie Dust

Finding Your Organization’s Magic Pixie Dust

It is often said that organizational culture is like a fog — it is all around us; it impacts our ability to see, to move quickly, and to deliver; but we cannot quite put our finger on it. Indeed, some organizations see their culture as a byproduct of operations,...

read more
We’ve Refreshed Our Brand!

We’ve Refreshed Our Brand!

Why have we refreshed our brand, you ask? Well, as we have grown and matured as an organization, we felt that our previous brand elements no longer represented us as well as they could. You see, we founded Thought Ensemble back in 2008 to help companies better compete...

read more
Thought Ensemble’s Purpose — Inspired in 2020

Thought Ensemble’s Purpose — Inspired in 2020

I recently wrote about how company purpose is being tested and inspired by all the events of 2020. This topic is very real for us at Thought Ensemble. We’ve been thinking a lot about what really matters as we’ve navigated the...

read more
How 2020 Is Testing and Inspiring Corporate Purpose

How 2020 Is Testing and Inspiring Corporate Purpose

In August 2019, the Business Roundtable rewrote their statement of corporate purpose. I followed this with significant interest being that I have never forgotten the debates about corporate purpose in business school almost two decades ago. We were taught that the...

read more
Why Purpose-Driven Organizations May Struggle With Change

Why Purpose-Driven Organizations May Struggle With Change

I love working with companies who really want to make a difference, beyond just making money for their shareholders. I mean, making money is fun and all, but it is even more rewarding to join in on a just cause. Plus, as this HBR article explains, companies who have...

read more