Note: This post originally appeared in ComputerWorld as three part series.
Data science isn’t new, but the demand for quality data has exploded recently. This isn’t a fad or a rebranding, it’s an evolution. Decisions that govern everything from successful presidential campaigns to a one-man startup headquartered at a kitchen table are now be made on real, actionable data, not hunches and guesswork.
Because data science is growing so rapidly, we now have a massive ecosystem of useful tools. I’ve spent the past month or so trying to organize this ecosystem into a coherent portrait and, over the next few days, I’m going to roll it out and explain what I think it all means.
Since data science is so inherently cross-functional, many of these companies and tools are hard to categorize. But at the very highest level, they break down into the three main parts of a data scientist’s work flow. Namely: getting data, wrangling data, and analyzing data.
Here’s the ecosystem in its entirety.
What’s the point of doing this?
I spend a ton of time talking to data scientists about how they work, what their challenges are, and what makes their jobs easier. There are of course thousands of tools in the data science toolbox, so this ecosystem is by no means exhaustive. But the software and companies I’ve heard most often are all included, as well as the open source programs that often drive and inform the tools themselves.
Data scientists can’t just live in R or Excel. They need tools to get make sure their data is the highest possible quality and applications to do predictive analysis. In fact, this is where I think the distinction between a statistician and a data scientist is something more than semantic. In my view, statisticians take data and run a regression. Data scientists actually fetch the data, run the regression, communicates these findings, show patterns, and lead the way towards actionable, real-world changes in their organization, regardless of what that organization actually does. Since they need to oversee the entire data pipeline, my hope is that this ecosystem shows many of the important tools data scientists use, how they use them, and, importantly, how they interact together.
Let’s get started.
Part 1: Data Sources
The rest of this ecosystem doesn’t exist without the data to run it. Broadly speaking, there are three very different kinds of data sources: databases, applications, and third party data.
Structured databases predate unstructured ones. The structured database market is somewhere around $25 billion and you’ll see big names like Oracle in our ecosystem along with a handful of upstarts like MemSQL. Structured databases store a finite amount of data columns, generally run on SQL, and are usually used by the sort of business functions where perfection and reliability are of paramount concern, i.e. finance and operations.
One of the key assumptions of most structured databases is that queries run against it must return consistent, perfect results. A good example of who might absolutely need to run on a structured database? Your bank. They’re storing account information, personal markers (like your first and last name), loans their customers have taken out, etc. A bank must always know exactly how much money you have in your account, down to the penny.
And then there are unstructured databases. It’s no shock that these were pioneered by data scientists, because data scientists look at data differently than an accountant would. Data scientists are less interested in absolute consistency and more interested in flexibility. Because of that, unstructured databases lower the friction for storing and querying lots of data in different ways.
I’d say that a lot of the popularity of unstructured databases was born directly out of Google’s success. Google was trying to store the internet in a database. Think of how ambitious and utterly gigantic that task is. MapReduce, a technology that powered this database, was in some ways less powerful than SQL, but it allowed the company to adapt and grow their data stores as they saw fit. It allowed Google to use that database in ways they simply didn’t foresee when they were starting out. For example, Google could query across all websites, asking which sites linked to other sites and modify its search results for its customers. This scalable flexible querying gave Google a huge competitive advantage, which is why Yahoo and others massively invested in building an open source version of this technology called Hadoop.
Additionally, unstructured databases often require less server space. There were major internet companies that just a few years ago would wipe their databases every three months because it was too expensive to store everything. This kind of logic is unthinkable now.
Having all that data allows companies to build everything from frighteningly powerful recommendation engines to world-class translation systems to incredibly efficient inventory management. Unstructured databases generally aren’t as infallible as structured databases, but that’s worth the tradeoff for many applications, especially in the data science world. For example, say your unstructured database is running on 1000 machines and one is down. It’s okay if the recommendation engine that’s calling out to those machines uses 99 pieces of data instead of 100 to suggest you watch a Patrick Swayze movie. The priority for this sort of database is flexibility, scale, and speed. It’s okay that it can sometimes be inexact.
One of the more well known examples of a company that creates unstructured database software is Cloudera, which runs on Hadoop. And to show you how much this space is growing, consider this: seven years ago I got calls from VCs that assumed their market would be ten or fifteen companies globally. A year ago, they raised nearly a billion dollars.
As data scientists have become the biggest consumers of data (as opposed to finance and accounting), get used to hearing more and more about unstructured databases.
In the last ten years storing critical business data in the cloud has changed from unthinkable to common practice. This has been maybe the biggest shift in business’s IT infrastructure.
I’ve noted four major examples in the application space in the ecosystem (Sales, Marketing, Product, and Customer), but these days every single business function has several SaaS applications to choose from. The trend probably started withSalesForce. They were the first very successful enterprise data application that decided to create and target their software to their end user, not a CIO. Because of this, SalesForce was building and iterating on software directly for sales teams andnot to the whims of individual CIOs. They built something that worked great for their users, and in the process showed that enterprise customers would be willing to entrust critical company data in the cloud.
Instead of sales data living in-house in a custom-installed database, it now lives in the cloud where a company whose entire lifeblood is based on making that data usable and robust takes care of it. Other companies quickly followed suit. Now, essentially every department of a business has a data application marketed to and made for that department. Marketo stores marketing data, MailChimpruns your email, Optimizely crunches A/B testing data for you, Zendesk lets you know how happy your customers are. The list goes on and on.
Why that’s relevant? Now every department of a business has a powerful set of data for data scientists to analyze and use in predictive analysis. The volume of data is great, but now it’s scattered across multiple applications. Say you wanted to look at a specific customer in yourSugarCRM app. Are you trying to see how many support tickets they’ve written? That’s probably in your ZenDesk app. Are you making sure they’ve paid their most recent bill? That lives in your Xeroapp. That data all lives in different places, on different sites, in different databases.
As business move to the cloud they collect more data but it’s scattered across applications and servers all over the world.
Third Party Data
Third party data is much, much older than unstructured databases or data applications. Dun & Bradstreet is, at its heart, a data seller that’s been around since 1841. But as the importance of data for every organization grows, this is a space that’s going to keep evolving over the coming years.
Broadly, I’ve broken this part of our ecosystem out to four areas: business information, social media data, web scrapers, and public data.
Business information is the oldest. I mentioned Dun & Broadstreet above, but business data sellers are vital to nearly any organization dealing with those businesses. Business data answers the critical question for any B2B company: who should my sales team be talking to? These days, that data has been repurposed for many other applications from online maps to high frequency trading. Upstart data sellers like Factual don’t just sell business data, but they do tend to start there because it’s so lucrative.
Social media data is new but growing rapidly. It’s a way for marketers to prove their efforts are making a tangible impact and getting sentiment analysis on social data is what smart PR firms do to take their temperature of their brands and demonstrate their value. Here, you’ll find everything from Radian6 to DataSift.
Then there’s web scraping. Personally, I think this is going to be a gigantic space. If we can get to the point where any website is a data source that can be leveraged and analyzed by smart data science teams, there’s really no telling what new businesses and technologies are going to be born from that. Right now, some of the players are import.io and kimono, but I think this space is going to explode in the coming years.
Finally, I’d be remiss if I didn’t mention public data. I’m not sure President Obama gets elected without the team of data scientists he employed during his 2004 campaign and I think some of the lessons he learned about the power of data were the reason he spearheaded Data.gov. A lot of local governments have followed that lead. Amazon Web Services houses some amazing public data (everything from satellite imagery to Enron emails). These are giant data sets that can help power new businesses, train smarter algorithms, and solve real-world problems. The space is growing so fast we even see a company, Enigma.io that exists for the sole purpose of helping companies use all the public datasets out there.
Open Source Tools
There has been a massive expansion of the number of open-source data stores, especially unstructured data stores with Cassandra, Redis, Riak, Spark, CouchDB and MongoDB being some of the most popular. This post focuses mostly on companies but another blog post, Data Engineering Ecosystem, An Interactive Map gives a great overview of the most popular open source data storage and extraction tools.
Part 2: Data Wrangling
There was a money quote from Michael Cavaretta, a data scientist at Ford Motors, in a recent article in the NY Times. The piece was about the challenges data scientists face going about their daily business. Cavaretta said: “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff.” Data wrangling is cleaning data, connecting tools, and getting data into a usable format; the sexy stuff is predictive analysis and modeling. Considering the first is sometimes referred to as “janitor work,” you can guess which one is a bit more enjoyable.
In our recent survey, we found that data scientists spent a solid 80% of their time wrangling data. Given how expensive of a resource data scientists are, it’s surprising there are not more companies in this space.
In our last section, I noted how structured databases were originally built for finance and operations while unstructured ones were pushed forward by data scientists. I see a similar thing happening in this space. Since structured databases are an older industry, there were already myriad tools available for operations and finance people who have always worked with data. But there are also a new class of tools designed specifically for data scientists who have many of the same problems, but often need additional flexibility.
We’ll start with an area I know well.
Data enrichment improves raw data. Original data sources can be messy, in different formats, from multiple applications (and so on) which makes running predictive analysis on it difficult, if not impossible. Enrichment cleans that data so data scientists don’t have to.
I’ve broken this category into “human” and “automated,” but both approaches involve both people and machines. Human data enrichment means taking every row of your data set and having a human being transform it, but this requires a lot of computer automation to keep it reliable. Likewise, automated data enrichment involves setting up rules and scripts to transform data but requires a human to set up and check those rules.
Human enrichment relies on the fact that there are tasks people are simply better at than machines. Take image classification, for example. Humans can easily tell if a satellite photo contains clouds. Machines still struggle to consistently do that.
Language is another big use case for human data enrichment. Natural language processing algorithms can do amazing things, but they can’t spot sarcasm or irony or slang nearly as well as a person can. You’ll often see PR firms and marketers analyze sentiment this way.
Human-enriched daa can also be used to train search algorithms, and people can read and collect disparate information better than a machine can. Again, this requires the tasks to be well set up, and for the software to contain quality control safeguards, but if you get thousands of people working in tandem on simple jobs that people do better than machines, you can enrich tons of data at impressive speeds. Our company, CrowdFlower, is in this space, but so are others likeWorkFusion and in some ways, Amazon Mechanical Turk.
Automated enrichment has the same goals, but works with scripts and having machines transform raw data into usable data, instead of people. As I mentioned above, you still need a smart data scientist inputting that information and checking it when enrichment is complete, but automated enrichment can be incredibly powerful if all the i’s are dotted. Data with small errors and inconsistencies can be transformed into usable data near instantaneously with the right scripting.
Notably, automated solutions work well for cleaning data that doesn’t need a human eye. Examples range from simple tasks like name and date formatting to more complicated ones like dynamically pulling in meta data from the internet. Trifacta, Tamr, Paxata, and Pentaho come to mind as great automated solutions, but this is a space I see growing quickly as companies rush in to give data scientists some of their valuable time back.
ETL stands for Extract, Transform, and Load and the name gets to the heart of what the tools in this section of our ecosystem do. Essentially, what ETL/Blending solutions do for data scientists is take dissimilar data sources and marry them so analysis can be done.
Here’s an example of what I’m talking about. Say you have a financial database that contains a list of your customers, how much they pay, and what they buy. That lives in one place. Now say you have a differentdatabase containing each customer’s geographical business address. The companies in this space help combine that data into a single, usable database, so a data scientist can look into questions like which regions buy the most of a certain product, which parts of the country are your target market, etc.
And this is just a simple example; they can get much more complex. But essentially every data scientist will need to do some blending in their day-to-day responsibilities. Multiple data sources are frequently all formatted differently and, if you wanted a holistic view of a client or your enterprise at large, you’d need to blend these sources together to do deep analysis.
Alteryx, Astera, CloverETL, and etleap all have software that can handle this sort of data blending. And though ETL has been around since the days of structured databases, it figures to become increasingly vital. After all: more data sources means more discordant formatting. The promise of big data rests on being able to get both a granular and bird’s eye view of any of this information, for whatever analysis needs doing.
Data integration solutions overlap significantly with ETL/Blending software. Companies in both spaces aim to integrate data, but data integration is more concerned with unifying dataapplications and specific formats (as opposed to working on generic sets of data).
Think of what I mentioned last time, how there are third-party cloud applications that take care of everything from sales and marketing data to social reach and email campaigns. How do you combine each application into a usable data set on which a data scientist can run predictive analysis? With software like ClearStory or Databricks or SnapLogic.
Informatica has been in the space for years and does over a billion dollars of revenue. They also do quite a bit of work in each category of data wrangling as I’ve defined it here. Microsoft actually has two large offerings that would fit in this category: Azure Data Factory and SQL Server Integration Services.
Much like the ETL/blending tools, data integration programs are mainly focused on combining data from the left side of our ecosystem so it can be modeled by software on the right. In other words, integration tools like Apatar or Zoomdata and the like allow you to marry data from cloud applications like Hootsuite or Gainsight so you can get BI from Domo or Chartio.
Lastly, let’s talk about API connectors. These companies don’t focus so much on transforming data as they do on integrating with as many separate APIs as possible. When companies like these started forming, I don’t think many of us predicted how big this space would actually be.
But these can be really, really powerful tools in the right hands. To start with a fairly non-technical example, I think IFTTT is a great way to understand what happens with an API connector. IFTTT (which stands for “if this, then that”) allows someone who posts an Instagram picture to save it immediately to their Dropbox or post it on Twitter. You can think of it as an API connector that a non-data scientist uses to stay on top of their online persona. But it’s important to include here because a lot of data scientists I talk to use it as a lightweight tool for personal applications and for work.
Zapier is like IFTTT but focused on being a lightweight connector for business applications, which may make it more relevant for many data science teams.
MuleSoft, meanwhile, connects all of your business applications. Say a user logs onto your site. Who needs to know about it? Does your sales team need the lead? Does your nurture team need to know that user is back again? How about marketing? Do they want to know their email campaign is working? A single API connector can trigger all these actions simultaneously.
Lastly, Segment.io connects your product to many of the SaaS business applications on the left of the inforgraphic and more.
API connectors simply don’t exist without the abundance of tools in this ecosystem to actually connect to. And while they weren’t totally designed for data scientists, data scientists use them, especially in conjunction with blending and integration tools.
Open Source Tools
There are far fewer open-source data wrangling tools than data stores or in the analytics space. Google open-sourced their very interesting open-refine project. For the most part we see companies building their own ad-hoc tools mainly in Python, though Kettle is open-source ETL tool with some traction.
Part 3: Data applications
Remember that quote I started part two with? About data scientists wanting better tools for wrangling so they could work on the “sexy stuff”? Well, after covering how data is stored, how its cleaned, and how its combined from disparate databases, we’re finally there. Data applications are where the “sexy stuff” like predictive analysis, data mining, and machine learning happen. This is the part where we take all this data and do something really amazing with it.
Broadly, I’ve broken this column of our ecosystem into two main branches: insights and models. Insights let you learn something from your data while models let you build something with your data. They’re the tools that data scientists use to explain the past and to predict the future.
We’ll start with insights.
I’ve segregated these tools into four major categories, namely statistical tools, business intelligence, data mining, and data collaboration. Those first two are large, mature segments with tools that have been around in some cases for decades. Data mining and collaboration aren’t quite brand new, but they are less mature markets I expect to grow dramatically as more organizations put additional focus and budget on data and data science.
Statistical tools focus on ad-hoc analysis and allow data scientists to do powerful things like run regressions and visualize data in a more easily digestible format. It’s impossible to talk about statistical tools and not mention Microsoft Excel, a program used by both data scientists, analysts, and basically everyone else with a computer. Data scientists have done powerful things with Excel and it was one of their best, original tools, so it has serious staying power despite serious flaws. In fact, in CrowdFlower’s recent survey of data science tools, we found that Excel is still the program data scientists use most.
Still, there are plenty of tools after that old mainstay. The programming language R is extremely popular as a way to analyze data and has a vast library of open-source statistical packages. Tableau is a great program for visualizing data used by everyone from businesses to academics to journalists. Mathworks makes Matlab, which is an engineering platform unto itself, allowing users to not only create graphs but also build and optimize algorithms. SPSSand Stata have been around for decades and make it easy to do complicated analysis on large volumes of data.
Business intelligence tools are essentially statistical tools focused on creating clear dashboards and distilling metrics. You can think of them as tools that translate complicated data into a more readable, more understandable format for less technical people in your organization. Dashboards allow non-data scientists to see the numbers that are important to them upfront and make connections based on their expertise. Gartner pegs this as a $14 billion market, with the old guard like SAP, Oracle, and IBM being the largest companies in the space. That said, there are upstarts here as well. Companies like Domo and Chartio connect to all manner of data sources to create attractive, useful dashboards. These are tools created for data scientists to show stakeholders their success in an organization as well as the health of the organization as whole.
Where those business intelligence tools are more about distilling data into easy-to-absorb dashboards, data mining and exploration software is concerned with robust, data-based insights. This is much more in line with the “sexy stuff” mentioned in the quote above. These companies aren’t about just showing data off, they specialize in building something actionable from that data.
Unlike the 3rd party applications I wrote about in part one, these business intelligence tools are often open-ended enough to handle a wide array of use cases, from government to finance to business. For example, a company like Palantir can build solutions that do everything from enterprise cyber security to syncing law enforcement databases to disease response. These tools integrate and analyze data and often, once set up by a data scientist, they can provide the tools for anyone in an organization to become a sort of mini-data scientist, capable of digging into data to look for trends they can leverage for their own department’s success. Platfora is a good example of this, but there are plenty more we’ll see popping up in the coming years.
The last bit of our insights section centers around data collaboration. This is another space that’s likely to be more and more important in future as companies build out larger data science teams. And if open data is going to become the new open source (and I think it has to), tools like Mode Analytics will become even more important. Mode lets data scientists share SQL-based analytics and reports with any member of their (or other organization). Silk is a really robust visualization tool that allows users to upload data and create a wide array of filterable graphs, maps, and charts. R studio offers tools for data scientists to build lightweight ad-hoc apps that can be shared with teams and helps non-data scientists investigate data. The fact that there are companies sprouting up to aid with this level of data collaboration is just further proof that data science isn’t just growing: it’s pretty much everywhere.
Again, it’s hard to draw hard-and-fast lines here. A lot of these tools can be used by non-technical users or create dashboards or aid with visualization. But all of them are based on taking data and learning something with it. Our next section, models, is a bit different. It’s about building.
I need to start this section with a shout-out. Part of the inspiration for this project was Shivon Zilis’s superb look at the machine intelligence landscape, and I mention it now because modeling and machine learning overlap rather significantly. Her look is in-depth and fantastic and if this is a space you’re interested in, it’s required reading.
Models are concerned with prediction and learning. In other words: either taking a data set and making a call about what’s going to happen or training an algorithm with some labeled data and trying to automatically label more data.
The predictive analytics space encompasses tools that are more focused on doing regressions. These tools focus on not simply aggregating data or combining it or cleaning it but instead looking back through historical data and trends and making highly accurate forecasts with that data. For example, you might have a large data set that matches a person’s credit score with myriad demographic details. You could then use predictive analysis to judge a certain applicant’s credit worthiness based on their differences and similarities to the demographic data in your model. Predictive analysis is done by everyone from political campaign managers choosing where and when they need to place commercials to energy companies trying to plan for peaks and valleys in local power usage.
There are a whole host of companies that help with predictive analysis and plenty more on the way. Rapid Insights helps its customers build regressions that give insights into data sets. Skytree focuses on analytics on very large data sets. Companies like Numenta are trying to build machines that continuously learn and can spot patterns that are both important and actionable for the organizations running that data. But at their base level, they’re about taking data, analyzing it, and smartly forecasting events with that information.
Deep learning, on the other hand, is more of a technique than a solution. That said, it has suddenly become a very hot space because it offers the promise of much more accurate models especially at very high volumes of training data. Deep learning seems to work best on images, and so most of the early companies doing deep learning tend to be focused on that. Facebook, in fact, had some early success training algorithms for facial recognition based on the face itself (as opposed to making assumptions about who might be whom based on overlapping friend circles and other relationships). Metamind offers a lightweight deep learning platform available to anyone for essentially any application. Dato packages many of the features in other categories, such as ETL and visualization.
Natural language processing tools, commonly referred to as NLPs, try to build algorithms that understand real speech. Machine learning here involves training those algorithms to detect the nuances of text, not just hunt for keywords. This means being able to identify slang, sarcasm, misspellings, emoticons, and all the other oddities of real discourse. Building these tools requires incredibly large bodies of data but NLPs have the potential to remove a lot of the cost and man hours associated with everything from document processing to transcription to sentiment analysis. And those each are giant markets in their own right.
Probably the best known illustration of NLPs in pop culture was Watson’s performance onJeopardy! That’s actually a very instructive example. When you think of how Jeopardy! clues are phrased, with puns and wordplay and subtleties, the fact that Watson could understand those clues (let alone win its match) is an impressive feat. And that was in 2011; the space has grown immensely since. Companies like Attensity build NLP solutions for a wide variety of industries while Maluuba has a more consumer-facing option that is, in essence, a personal assistant that understands language. Idibon focuses on non-English languages, an important market that is sometimes overlooked. I think we’ll see a lot of growth here in the next decade or so, as these tools have the opportunity to truly transform hundreds of industries.
Lastly, let’s cover briefly about machine learning platforms. While most of the tools above are more like managed services, machine learning platforms do something much different. A tool like Kaggle isn’t so much a concrete product as it is a company that farms data out to data scientists and has them compete to create the best algorithm (a bit like the Netflix prize I mentioned in part one). Microsoft has Azure ML and Google’s Prediction API fit well here because, like Kaggle, they can handle a wide array of data problems and aren’t specifically bucketed into one specific field. Google’s Prediction API offers a black box learner that tries to model your input data while Microsoft’s Azure ML gives data scientists a toolkit to put together pieces and build a machine learning workflow.
Open Source Tools
Probably because this category has the most ongoing research, there is quite a rich collection of open-source modeling and insights tools. R is an essential tool for most data scientists and works both as a programming language and an interactive environment for exploring data. Octave is a free, open-source port of matlab that works very well. Julia is becoming increasing popular for technical computing. Stanford has an NLP library that has tools for most standard language processing tasks. Scikit, a machine learning package for python, is becoming very powerful and has implementations of most standard modeling and machine learning algorithms.
All told, data application tools are what make data scientists incredibly valuable to any organization. They’re the exact thing that allows a data scientist to make powerful suggestions, uncover hidden trends, and provide tangible value. But these tools simply don’t work unless you have good data and unless you enrich, blend, and clean that data.
Which, in the end, is exactly why I chose to call this an ecosystem and not just a landscape. Data sources and data wrangling need to come into play before you get insights and models. Most of us would rather do mediocre analysis on great data than great analysis on mediocre data. Of course, used correctly, a data scientist can do great analysis on great data. And that’s when the value of a data scientist becomes immense.