A few years ago, I realized that I purport to run a data driven organization but I am the least data driven when it comes to my company’s most important process: hiring. I wanted to change that, so I put all the hires I made throughout my career in a spreadsheet and looked for correlations. I graded everyone I hired on a five point scale and wrote down everything I knew at the time of hiring. The process actually changed the way I hire pretty significantly (you can see some of the results in my earlier blog post).
I’ve become much more intuitive about my interview process rather than working through a checklist of desired skills. I’ve also become much more aggressive about pulling people from my own network and I’ve encouraged my employees to do the same. I’m more open to people with unusual backgrounds, but I do a lot more thorough reference checking. I think hiring is a pretty personal thing and there are lots of different ways to do it successfully, and I wondered how much my results would match someone else’s experience. I was actually able to convince a friend, a very successful serial entrepreneur to run the same experiment I did. He can remain anonymous, or if he’s willing I will post his name.
Anyway, here are some of his results: His overall distribution looks somewhat like mine, but he’s labeled more of his hires as “Superstar” and “Disaster”. Maybe he has a riskier hiring strategy or maybe he’s just more opinionated than me :). He labeled schools on a 0-2 scale of general “prestigiousness” and referral strength on a 0-3 scale of how strong the referral connection was (I described the scale in my earlier post). I found a weak correlation between prestigiousness of school and employee outcome, he actually found no correlation or a weak negative correlation. Like me he found a very strong correlation between referral strength and outcome. He also looked at some things I hadn’t thought of. One thing he checked was how outcome changes over time. He found that he was getting a lot better at hiring. I like to think I’ve gotten a lot better at hiring, but now I want to go back and check. He also found a correlation between dollar compensation and success, which is interesting. I think I would find a negative correlation – I believe that I’ve found executives harder to hire in general, although after recent adjustments I think I’ve gotten better at it. He noticed a weak positive correlation between a competitive hiring process and success. What’s the big takeaway here? The results are interesting, but they’re probably highly personal. Anyone who does a significant amount of hiring should really spend an hour and run the data on themselves. It’s probably the single best thing you could do with an hour of your time. If you want to share it with me, I’m happy to aggregate it, anonymize it and post it.
This was originally a response to a question on Quora.
Two notes sounding “good” together sounds like a very subjective statement. The songs we like and the sounds we like are incredibly dependent on our culture, personality, mood, etc.
But there is something that feels fundamentally different about certain pairs of notes that sound “good” together. All over the world humans have independently chosen to put the same intervals between notes in their music. The feeling of harmony we get when we hear the notes C and G together and the feeling of disharmony we get when we hear C and G flat together turns out to be part of the universal human experience.
Instead of from subjective notions of “good” and “bad”, scientists call the feeling of harmony “consonance” and the feeling of disharmony “dissonance”. Some cultures and genes of music use a lot more dissonance, but most humans perceive the same relative amounts of dissonance between pairs of notes.
The most consonant pairs of sounds are two sounds that are perceived as having the same “pitch” . In other words, the G key below middle C on my piano is so consonant with the G string on my guitar that they are said to be the same note.
Here is a recording of one second of me playing the G-string on my guitar. This graph shows the waveformof the sound, which is really just a rapid series of changes in the air pressure. Hidden within this waveform are patterns that our ears and brain perceive.
These waves then cause little hairs in our ears, called stereocillia, to vibrate, with different hairs vibrating at different frequencies. We perceive this sound through stereocillia in our ear that vibrate at different frequencies You can think of sound as the sum of different frequencies of vibrations and the hairs on our ears extract the amount of each frequency contained in the sound. We can also use math to extract the frequencies contained in the sound as I did below with something called a Fourier Transform.
We commonly think of a pitch, like a G, sometimes think of pitches as having a single “frequency” but sounds the graph shows that it’s are actually composed of various amounts of many different frequencies. In this case the lowest frequency of the string is 196Hz or 196 vibrations per second, but the string is also vibrating at double, triple, 4x times that. The lowest frequency is called the fundamental frequency. These higher frequencies are called overtones also known as harmonics when they are at simple multiples of the fundamental frequency. Instruments with vibrating strings like my guitar tend to vibrate at multiple frequencies where each frequency is a multiple of the lowest frequency – this is related to the physics of a string and it will be really important here.
Here is a one-second recording of me singing along to the G-string.
This audio waveform looks pretty different from the recording of my guitar, but when we look at the frequencies we can see that the two match up.
I added red dots to this frequency graph to highlight where the harmonic frequencies are and show the uniform spacing. Each dot is exactly 196Hz apart just like in the graph of the guitar’s frequencies.
The lowest or fundamental frequency of the recording of my voice matches the 196Hz of my guitar string shown on the previous graph. It’s amazing that we are able to make our voices harmonize so exactly without even thinking about it.
When I sing the G note along with my guitar my voice and my instrument are causing the same hairs in my ear to vibrate.
The fact that the frequency peaks or red dots are even spaced is a physical property of our vocal chords and comes from the fact that our vocal chords are essentially a long tube of air. Other instruments that are like longs tubes of air have the same property such as flutes, saxophones, horns and harmonicas.
When I play my guitar an octave higher I can make a harmony. A one second recording looks like this – again totally different from the previous two.
But when I look at the frequencies in its composition, they are exactly double the the frequencies of the low G string or me singing the low G. The red dots show the spikes from our earlier low G graph, the yellow dots are the frequency spikes from the high G sound.
So when you go an octave up, the same hairs will vibrate as with the lower octave, although not all of them. That’s what gives us the senseof two “notes” being the same even when they’re an octave apart.
Almost every culture that has a notion of an octavealso has a notion of a “fifth” or note halfway between an octave. Two notes that are a fifth apart are the most consonant of any two notes that are not the same.
The G note is the “fifth” of a C note. In western music, all of the most common chords with a C root have a G note in them. Why does a C and a G fit so well together? Here are the frequencies of playinga C on my guitar.
You can see in red the harmonics (or frequency spikes) of my G note and in yellow the harmonics of my C note. They don’t always line up but because my C note’s fundamental frequency (need to define this) is 3/2 of my G note they line up every 3rd harmonic of the C and every 2nd harmonic for the G.
The two notes that sound most consonant with a C are F and G, corresponding to the “perfect fourth” and “perfect fifth” intervalsfrom C. Why do they line up so well? We can look at how many of the harmonics line up.
You can see that G and F harmonics line up quite frequently with C’s harmonics at the bottom. But notice that G and F’s harmonics don’t line up with each other very frequently. So G and C sound very consonant and F and C sound very consonant but G and F sound much more dissonant. This is why it’s very common to play G and C together or F and C together but it’s unusual less common to play a C, G and F all at once.
All of the notes that are consonant with C have intervals with many harmonics overlapping as you can see on this bigger chart.
You can see here that C and E have lots of overlapping harmonics – C, E and G would be a C major chord. C and D# have almost as many overlapping harmonics and C, D# and G would be a C minor chord.
Some notes don’t correspond to any simple fractional interval, and those notes sound very dissonant. For example, playing C and F# together is extremely dissonant because there are no overlapping harmonics (the F# doesn’t quite even line up with 2/5 interval – for more on this see my answer to Why are there 12 notes?).
Some instruments don’t produce these overtones at simple multiples of the fundamental frequency. Drums usually don’t produce simple overtones because the vibrations travel across them in more than one dimension, which creates more complicated patterns. This is why you can’t typically hear drums harmonizing with each other even though they have a recognizable pitch.
We can stop there if we want to, but there are other psycho-acoustic effects that affect consonance vs. dissonance. One effect worth mentioning is the dissonance we here when two frequencies are close but not overlapping. .
When two notes are played close together the waveforms look roughly like this:
When we extend out the waveforms we can see that they move in and out phase.
Our ear hears the sum of the blue and the orange waveform which looks like this.
Or looking at a longer time period:
When the wave forms are in sync at the beginning they amplify each other, but as they get out of phase they subtract from each other. This creats theabeating sound that is very recognizabe if you’ve ever heard an out of tune piano or an out of tune guitar.
To western ears this sounds like an out of tune instrument. Some cultures incorporate this sound into their music. It’s pretty clear that this is an effect associated with dissonance. As other people have mentioned in their answers, two pure sounds with frequencies that are within a note or two are universally heard as dissonant.
This was originally a response to a quora question that I went down a long rabbit hole answering.
First of all – let’s cover some basic stuff. What is frequency?
Here is the sound wave of a recording of me playing the low E string on my guitar.
There are multiple oscillations happening here and you can what’s called a Fourier transform to figure out how much of each frequency there is.
You can see that there are multiple frequencies in this sound, but the strongest frequency is at around 82 Hertz or 82 oscillations per second. We perceive this sound as a low E note.
If I play the next higher E note on my guitar the dominant frequency is twice as high (164 Hz). If I play the next higher E the dominant frequency doubles again.
Here the x axis is the perceived pitch or note and the y-axis is the frequencies. Each A note is an octave apart but the frequency is doubling each time.
The perceived pitch difference between two frequencies goes down as the frequencies go up. Another way of saying this is the frequency difference between two notes gets further apart as the notes get higher.
Here’s a little illustration of that with the real frequencies of notes.
A scale is generally divided into even pitch increments (this is called “equal temperament”). This means that the ratio of the frequency of a note and the frequency of the next note is always the same.
So why 12 intervals?
There’s a second fact about the way we perceive sound which is that two sounds with a simple frequency ratio sound good. There is a lot of fascinating research about when and why this is true – I really enjoyed Music, A Mathematical Offering by Dave Benson that goes really deep into how our ear works and why we perceive sound the way we do. But let’s take this as given.
For example two notes an octave apart have a frequency ratio of 2:1 and they sound very resonant.
Besides an octave, the simplest possible ratio is 3:2 – halfway to the next octave. It’s the basis of all the most common chords and it’s really nice to have this ratio in a scale. But if we want to evenly space the pitch of our notes, we will never get exactly this nice ratio.
For example if we have 6 notes we don’t even get close.
Starting at note “0”, we have no note in our scale that is anywhere near halfway to the next octave.
But if we use twelve notes we happen to get really close.
Note number 7 happens to be almost exactly halfway between our root note zero and the next octave higher.
This turns out to really just be a happy coincidence. For fun I graphed all the possible scales between one and 24 notes.
It turns out 12 notes happens to have a note that is way closer to the halfway point than any other number of notes. When we get to 24 notes the same note shows up.
Another way of looking at this is just plotting how close each scale gets to the halfway note. This “halfway note” is extremely confusingly often called a “fifth” in music.
Here lower is closer, and you can see that 12 is by far the number of notes that works best.
Since we’ve come this far we might as well see what happens if we try larger scales.
At 29 notes we get a slightly better halfway note. At 41 notes we get one even better. But for small numbers of notes 12 stands out as a much better choice than any other.
This is a little off topic now, but I was interested to see how well the other notes in the 12 note scale line up with simple intervals.
The halfway interval or “perfect fifth” lines up the closest with the exact mathematical interval, but there are also notes that correspond closely to almost all of the other simple intervals. It’s interesting that no note seems to correspond with going 3/4 of the way to the next octave.
I hope you had as much fun reading this as I did exploring this and making graphs :).
Note: This post originally appeared in ComputerWorld as three part series.
Data science isn’t new, but the demand for quality data has exploded recently. This isn’t a fad or a rebranding, it’s an evolution. Decisions that govern everything from successful presidential campaigns to a one-man startup headquartered at a kitchen table are now be made on real, actionable data, not hunches and guesswork.
Because data science is growing so rapidly, we now have a massive ecosystem of useful tools. I’ve spent the past month or so trying to organize this ecosystem into a coherent portrait and, over the next few days, I’m going to roll it out and explain what I think it all means.
Since data science is so inherently cross-functional, many of these companies and tools are hard to categorize. But at the very highest level, they break down into the three main parts of a data scientist’s work flow. Namely: getting data, wrangling data, and analyzing data.
Here’s the ecosystem in its entirety.
What’s the point of doing this?
I spend a ton of time talking to data scientists about how they work, what their challenges are, and what makes their jobs easier. There are of course thousands of tools in the data science toolbox, so this ecosystem is by no means exhaustive. But the software and companies I’ve heard most often are all included, as well as the open source programs that often drive and inform the tools themselves.
Data scientists can’t just live in R or Excel. They need tools to get make sure their data is the highest possible quality and applications to do predictive analysis. In fact, this is where I think the distinction between a statistician and a data scientist is something more than semantic. In my view, statisticians take data and run a regression. Data scientists actually fetch the data, run the regression, communicates these findings, show patterns, and lead the way towards actionable, real-world changes in their organization, regardless of what that organization actually does. Since they need to oversee the entire data pipeline, my hope is that this ecosystem shows many of the important tools data scientists use, how they use them, and, importantly, how they interact together.
Let’s get started.
Part 1: Data Sources
The rest of this ecosystem doesn’t exist without the data to run it. Broadly speaking, there are three very different kinds of data sources: databases, applications, and third party data.
Structured databases predate unstructured ones. The structured database market is somewhere around $25 billion and you’ll see big names like Oracle in our ecosystem along with a handful of upstarts like MemSQL. Structured databases store a finite amount of data columns, generally run on SQL, and are usually used by the sort of business functions where perfection and reliability are of paramount concern, i.e. finance and operations.
One of the key assumptions of most structured databases is that queries run against it must return consistent, perfect results. A good example of who might absolutely need to run on a structured database? Your bank. They’re storing account information, personal markers (like your first and last name), loans their customers have taken out, etc. A bank must always know exactly how much money you have in your account, down to the penny.
And then there are unstructured databases. It’s no shock that these were pioneered by data scientists, because data scientists look at data differently than an accountant would. Data scientists are less interested in absolute consistency and more interested in flexibility. Because of that, unstructured databases lower the friction for storing and querying lots of data in different ways.
I’d say that a lot of the popularity of unstructured databases was born directly out of Google’s success. Google was trying to store the internet in a database. Think of how ambitious and utterly gigantic that task is. MapReduce, a technology that powered this database, was in some ways less powerful than SQL, but it allowed the company to adapt and grow their data stores as they saw fit. It allowed Google to use that database in ways they simply didn’t foresee when they were starting out. For example, Google could query across all websites, asking which sites linked to other sites and modify its search results for its customers. This scalable flexible querying gave Google a huge competitive advantage, which is why Yahoo and others massively invested in building an open source version of this technology called Hadoop.
Additionally, unstructured databases often require less server space. There were major internet companies that just a few years ago would wipe their databases every three months because it was too expensive to store everything. This kind of logic is unthinkable now.
Having all that data allows companies to build everything from frighteningly powerful recommendation engines to world-class translation systems to incredibly efficient inventory management. Unstructured databases generally aren’t as infallible as structured databases, but that’s worth the tradeoff for many applications, especially in the data science world. For example, say your unstructured database is running on 1000 machines and one is down. It’s okay if the recommendation engine that’s calling out to those machines uses 99 pieces of data instead of 100 to suggest you watch a Patrick Swayze movie. The priority for this sort of database is flexibility, scale, and speed. It’s okay that it can sometimes be inexact.
One of the more well known examples of a company that creates unstructured database software is Cloudera, which runs on Hadoop. And to show you how much this space is growing, consider this: seven years ago I got calls from VCs that assumed their market would be ten or fifteen companies globally. A year ago, they raised nearly a billion dollars.
As data scientists have become the biggest consumers of data (as opposed to finance and accounting), get used to hearing more and more about unstructured databases.
In the last ten years storing critical business data in the cloud has changed from unthinkable to common practice. This has been maybe the biggest shift in business’s IT infrastructure.
I’ve noted four major examples in the application space in the ecosystem (Sales, Marketing, Product, and Customer), but these days every single business function has several SaaS applications to choose from. The trend probably started withSalesForce. They were the first very successful enterprise data application that decided to create and target their software to their end user, not a CIO. Because of this, SalesForce was building and iterating on software directly for sales teams andnot to the whims of individual CIOs. They built something that worked great for their users, and in the process showed that enterprise customers would be willing to entrust critical company data in the cloud.
Instead of sales data living in-house in a custom-installed database, it now lives in the cloud where a company whose entire lifeblood is based on making that data usable and robust takes care of it. Other companies quickly followed suit. Now, essentially every department of a business has a data application marketed to and made for that department. Marketo stores marketing data, MailChimpruns your email, Optimizely crunches A/B testing data for you, Zendesk lets you know how happy your customers are. The list goes on and on.
Why that’s relevant? Now every department of a business has a powerful set of data for data scientists to analyze and use in predictive analysis. The volume of data is great, but now it’s scattered across multiple applications. Say you wanted to look at a specific customer in yourSugarCRM app. Are you trying to see how many support tickets they’ve written? That’s probably in your ZenDesk app. Are you making sure they’ve paid their most recent bill? That lives in your Xeroapp. That data all lives in different places, on different sites, in different databases.
As business move to the cloud they collect more data but it’s scattered across applications and servers all over the world.
Third Party Data
Third party data is much, much older than unstructured databases or data applications. Dun & Bradstreet is, at its heart, a data seller that’s been around since 1841. But as the importance of data for every organization grows, this is a space that’s going to keep evolving over the coming years.
Broadly, I’ve broken this part of our ecosystem out to four areas: business information, social media data, web scrapers, and public data.
Business information is the oldest. I mentioned Dun & Broadstreet above, but business data sellers are vital to nearly any organization dealing with those businesses. Business data answers the critical question for any B2B company: who should my sales team be talking to? These days, that data has been repurposed for many other applications from online maps to high frequency trading. Upstart data sellers like Factual don’t just sell business data, but they do tend to start there because it’s so lucrative.
Social media data is new but growing rapidly. It’s a way for marketers to prove their efforts are making a tangible impact and getting sentiment analysis on social data is what smart PR firms do to take their temperature of their brands and demonstrate their value. Here, you’ll find everything from Radian6 to DataSift.
Then there’s web scraping. Personally, I think this is going to be a gigantic space. If we can get to the point where any website is a data source that can be leveraged and analyzed by smart data science teams, there’s really no telling what new businesses and technologies are going to be born from that. Right now, some of the players are import.io and kimono, but I think this space is going to explode in the coming years.
Finally, I’d be remiss if I didn’t mention public data. I’m not sure President Obama gets elected without the team of data scientists he employed during his 2004 campaign and I think some of the lessons he learned about the power of data were the reason he spearheaded Data.gov. A lot of local governments have followed that lead. Amazon Web Services houses some amazing public data (everything from satellite imagery to Enron emails). These are giant data sets that can help power new businesses, train smarter algorithms, and solve real-world problems. The space is growing so fast we even see a company, Enigma.io that exists for the sole purpose of helping companies use all the public datasets out there.
Open Source Tools
There has been a massive expansion of the number of open-source data stores, especially unstructured data stores with Cassandra, Redis, Riak, Spark, CouchDB and MongoDB being some of the most popular. This post focuses mostly on companies but another blog post, Data Engineering Ecosystem, An Interactive Map gives a great overview of the most popular open source data storage and extraction tools.
Part 2: Data Wrangling
There was a money quote from Michael Cavaretta, a data scientist at Ford Motors, in a recent article in the NY Times. The piece was about the challenges data scientists face going about their daily business. Cavaretta said: “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff.” Data wrangling is cleaning data, connecting tools, and getting data into a usable format; the sexy stuff is predictive analysis and modeling. Considering the first is sometimes referred to as “janitor work,” you can guess which one is a bit more enjoyable.
In our recent survey, we found that data scientists spent a solid 80% of their time wrangling data. Given how expensive of a resource data scientists are, it’s surprising there are not more companies in this space.
In our last section, I noted how structured databases were originally built for finance and operations while unstructured ones were pushed forward by data scientists. I see a similar thing happening in this space. Since structured databases are an older industry, there were already myriad tools available for operations and finance people who have always worked with data. But there are also a new class of tools designed specifically for data scientists who have many of the same problems, but often need additional flexibility.
We’ll start with an area I know well.
Data enrichment improves raw data. Original data sources can be messy, in different formats, from multiple applications (and so on) which makes running predictive analysis on it difficult, if not impossible. Enrichment cleans that data so data scientists don’t have to.
I’ve broken this category into “human” and “automated,” but both approaches involve both people and machines. Human data enrichment means taking every row of your data set and having a human being transform it, but this requires a lot of computer automation to keep it reliable. Likewise, automated data enrichment involves setting up rules and scripts to transform data but requires a human to set up and check those rules.
Human enrichment relies on the fact that there are tasks people are simply better at than machines. Take image classification, for example. Humans can easily tell if a satellite photo contains clouds. Machines still struggle to consistently do that.
Language is another big use case for human data enrichment. Natural language processing algorithms can do amazing things, but they can’t spot sarcasm or irony or slang nearly as well as a person can. You’ll often see PR firms and marketers analyze sentiment this way.
Human-enriched daa can also be used to train search algorithms, and people can read and collect disparate information better than a machine can. Again, this requires the tasks to be well set up, and for the software to contain quality control safeguards, but if you get thousands of people working in tandem on simple jobs that people do better than machines, you can enrich tons of data at impressive speeds. Our company, CrowdFlower, is in this space, but so are others likeWorkFusion and in some ways, Amazon Mechanical Turk.
Automated enrichment has the same goals, but works with scripts and having machines transform raw data into usable data, instead of people. As I mentioned above, you still need a smart data scientist inputting that information and checking it when enrichment is complete, but automated enrichment can be incredibly powerful if all the i’s are dotted. Data with small errors and inconsistencies can be transformed into usable data near instantaneously with the right scripting.
Notably, automated solutions work well for cleaning data that doesn’t need a human eye. Examples range from simple tasks like name and date formatting to more complicated ones like dynamically pulling in meta data from the internet. Trifacta, Tamr, Paxata, and Pentaho come to mind as great automated solutions, but this is a space I see growing quickly as companies rush in to give data scientists some of their valuable time back.
ETL stands for Extract, Transform, and Load and the name gets to the heart of what the tools in this section of our ecosystem do. Essentially, what ETL/Blending solutions do for data scientists is take dissimilar data sources and marry them so analysis can be done.
Here’s an example of what I’m talking about. Say you have a financial database that contains a list of your customers, how much they pay, and what they buy. That lives in one place. Now say you have a differentdatabase containing each customer’s geographical business address. The companies in this space help combine that data into a single, usable database, so a data scientist can look into questions like which regions buy the most of a certain product, which parts of the country are your target market, etc.
And this is just a simple example; they can get much more complex. But essentially every data scientist will need to do some blending in their day-to-day responsibilities. Multiple data sources are frequently all formatted differently and, if you wanted a holistic view of a client or your enterprise at large, you’d need to blend these sources together to do deep analysis.
Alteryx, Astera, CloverETL, and etleap all have software that can handle this sort of data blending. And though ETL has been around since the days of structured databases, it figures to become increasingly vital. After all: more data sources means more discordant formatting. The promise of big data rests on being able to get both a granular and bird’s eye view of any of this information, for whatever analysis needs doing.
Data integration solutions overlap significantly with ETL/Blending software. Companies in both spaces aim to integrate data, but data integration is more concerned with unifying dataapplications and specific formats (as opposed to working on generic sets of data).
Think of what I mentioned last time, how there are third-party cloud applications that take care of everything from sales and marketing data to social reach and email campaigns. How do you combine each application into a usable data set on which a data scientist can run predictive analysis? With software like ClearStory or Databricks or SnapLogic.
Informatica has been in the space for years and does over a billion dollars of revenue. They also do quite a bit of work in each category of data wrangling as I’ve defined it here. Microsoft actually has two large offerings that would fit in this category: Azure Data Factory and SQL Server Integration Services.
Much like the ETL/blending tools, data integration programs are mainly focused on combining data from the left side of our ecosystem so it can be modeled by software on the right. In other words, integration tools like Apatar or Zoomdata and the like allow you to marry data from cloud applications like Hootsuite or Gainsight so you can get BI from Domo or Chartio.
Lastly, let’s talk about API connectors. These companies don’t focus so much on transforming data as they do on integrating with as many separate APIs as possible. When companies like these started forming, I don’t think many of us predicted how big this space would actually be.
But these can be really, really powerful tools in the right hands. To start with a fairly non-technical example, I think IFTTT is a great way to understand what happens with an API connector. IFTTT (which stands for “if this, then that”) allows someone who posts an Instagram picture to save it immediately to their Dropbox or post it on Twitter. You can think of it as an API connector that a non-data scientist uses to stay on top of their online persona. But it’s important to include here because a lot of data scientists I talk to use it as a lightweight tool for personal applications and for work.
Zapier is like IFTTT but focused on being a lightweight connector for business applications, which may make it more relevant for many data science teams.
MuleSoft, meanwhile, connects all of your business applications. Say a user logs onto your site. Who needs to know about it? Does your sales team need the lead? Does your nurture team need to know that user is back again? How about marketing? Do they want to know their email campaign is working? A single API connector can trigger all these actions simultaneously.
Lastly, Segment.io connects your product to many of the SaaS business applications on the left of the inforgraphic and more.
API connectors simply don’t exist without the abundance of tools in this ecosystem to actually connect to. And while they weren’t totally designed for data scientists, data scientists use them, especially in conjunction with blending and integration tools.
Open Source Tools
There are far fewer open-source data wrangling tools than data stores or in the analytics space. Google open-sourced their very interesting open-refine project. For the most part we see companies building their own ad-hoc tools mainly in Python, though Kettle is open-source ETL tool with some traction.
Part 3: Data applications
Remember that quote I started part two with? About data scientists wanting better tools for wrangling so they could work on the “sexy stuff”? Well, after covering how data is stored, how its cleaned, and how its combined from disparate databases, we’re finally there. Data applications are where the “sexy stuff” like predictive analysis, data mining, and machine learning happen. This is the part where we take all this data and do something really amazing with it.
Broadly, I’ve broken this column of our ecosystem into two main branches: insights and models. Insights let you learn something from your data while models let you build something with your data. They’re the tools that data scientists use to explain the past and to predict the future.
We’ll start with insights.
I’ve segregated these tools into four major categories, namely statistical tools, business intelligence, data mining, and data collaboration. Those first two are large, mature segments with tools that have been around in some cases for decades. Data mining and collaboration aren’t quite brand new, but they are less mature markets I expect to grow dramatically as more organizations put additional focus and budget on data and data science.
Statistical tools focus on ad-hoc analysis and allow data scientists to do powerful things like run regressions and visualize data in a more easily digestible format. It’s impossible to talk about statistical tools and not mention Microsoft Excel, a program used by both data scientists, analysts, and basically everyone else with a computer. Data scientists have done powerful things with Excel and it was one of their best, original tools, so it has serious staying power despite serious flaws. In fact, in CrowdFlower’s recent survey of data science tools, we found that Excel is still the program data scientists use most.
Still, there are plenty of tools after that old mainstay. The programming language R is extremely popular as a way to analyze data and has a vast library of open-source statistical packages. Tableau is a great program for visualizing data used by everyone from businesses to academics to journalists. Mathworks makes Matlab, which is an engineering platform unto itself, allowing users to not only create graphs but also build and optimize algorithms. SPSSand Stata have been around for decades and make it easy to do complicated analysis on large volumes of data.
Business intelligence tools are essentially statistical tools focused on creating clear dashboards and distilling metrics. You can think of them as tools that translate complicated data into a more readable, more understandable format for less technical people in your organization. Dashboards allow non-data scientists to see the numbers that are important to them upfront and make connections based on their expertise. Gartner pegs this as a $14 billion market, with the old guard like SAP, Oracle, and IBM being the largest companies in the space. That said, there are upstarts here as well. Companies like Domo and Chartio connect to all manner of data sources to create attractive, useful dashboards. These are tools created for data scientists to show stakeholders their success in an organization as well as the health of the organization as whole.
Where those business intelligence tools are more about distilling data into easy-to-absorb dashboards, data mining and exploration software is concerned with robust, data-based insights. This is much more in line with the “sexy stuff” mentioned in the quote above. These companies aren’t about just showing data off, they specialize in building something actionable from that data.
Unlike the 3rd party applications I wrote about in part one, these business intelligence tools are often open-ended enough to handle a wide array of use cases, from government to finance to business. For example, a company like Palantir can build solutions that do everything from enterprise cyber security to syncing law enforcement databases to disease response. These tools integrate and analyze data and often, once set up by a data scientist, they can provide the tools for anyone in an organization to become a sort of mini-data scientist, capable of digging into data to look for trends they can leverage for their own department’s success. Platfora is a good example of this, but there are plenty more we’ll see popping up in the coming years.
The last bit of our insights section centers around data collaboration. This is another space that’s likely to be more and more important in future as companies build out larger data science teams. And if open data is going to become the new open source (and I think it has to), tools like Mode Analytics will become even more important. Mode lets data scientists share SQL-based analytics and reports with any member of their (or other organization). Silk is a really robust visualization tool that allows users to upload data and create a wide array of filterable graphs, maps, and charts. R studio offers tools for data scientists to build lightweight ad-hoc apps that can be shared with teams and helps non-data scientists investigate data. The fact that there are companies sprouting up to aid with this level of data collaboration is just further proof that data science isn’t just growing: it’s pretty much everywhere.
Again, it’s hard to draw hard-and-fast lines here. A lot of these tools can be used by non-technical users or create dashboards or aid with visualization. But all of them are based on taking data and learning something with it. Our next section, models, is a bit different. It’s about building.
I need to start this section with a shout-out. Part of the inspiration for this project was Shivon Zilis’s superb look at the machine intelligence landscape, and I mention it now because modeling and machine learning overlap rather significantly. Her look is in-depth and fantastic and if this is a space you’re interested in, it’s required reading.
Models are concerned with prediction and learning. In other words: either taking a data set and making a call about what’s going to happen or training an algorithm with some labeled data and trying to automatically label more data.
The predictive analytics space encompasses tools that are more focused on doing regressions. These tools focus on not simply aggregating data or combining it or cleaning it but instead looking back through historical data and trends and making highly accurate forecasts with that data. For example, you might have a large data set that matches a person’s credit score with myriad demographic details. You could then use predictive analysis to judge a certain applicant’s credit worthiness based on their differences and similarities to the demographic data in your model. Predictive analysis is done by everyone from political campaign managers choosing where and when they need to place commercials to energy companies trying to plan for peaks and valleys in local power usage.
There are a whole host of companies that help with predictive analysis and plenty more on the way. Rapid Insights helps its customers build regressions that give insights into data sets. Skytree focuses on analytics on very large data sets. Companies like Numenta are trying to build machines that continuously learn and can spot patterns that are both important and actionable for the organizations running that data. But at their base level, they’re about taking data, analyzing it, and smartly forecasting events with that information.
Deep learning, on the other hand, is more of a technique than a solution. That said, it has suddenly become a very hot space because it offers the promise of much more accurate models especially at very high volumes of training data. Deep learning seems to work best on images, and so most of the early companies doing deep learning tend to be focused on that. Facebook, in fact, had some early success training algorithms for facial recognition based on the face itself (as opposed to making assumptions about who might be whom based on overlapping friend circles and other relationships). Metamind offers a lightweight deep learning platform available to anyone for essentially any application. Dato packages many of the features in other categories, such as ETL and visualization.
Natural language processing tools, commonly referred to as NLPs, try to build algorithms that understand real speech. Machine learning here involves training those algorithms to detect the nuances of text, not just hunt for keywords. This means being able to identify slang, sarcasm, misspellings, emoticons, and all the other oddities of real discourse. Building these tools requires incredibly large bodies of data but NLPs have the potential to remove a lot of the cost and man hours associated with everything from document processing to transcription to sentiment analysis. And those each are giant markets in their own right.
Probably the best known illustration of NLPs in pop culture was Watson’s performance onJeopardy! That’s actually a very instructive example. When you think of how Jeopardy! clues are phrased, with puns and wordplay and subtleties, the fact that Watson could understand those clues (let alone win its match) is an impressive feat. And that was in 2011; the space has grown immensely since. Companies like Attensity build NLP solutions for a wide variety of industries while Maluuba has a more consumer-facing option that is, in essence, a personal assistant that understands language. Idibon focuses on non-English languages, an important market that is sometimes overlooked. I think we’ll see a lot of growth here in the next decade or so, as these tools have the opportunity to truly transform hundreds of industries.
Lastly, let’s cover briefly about machine learning platforms. While most of the tools above are more like managed services, machine learning platforms do something much different. A tool like Kaggle isn’t so much a concrete product as it is a company that farms data out to data scientists and has them compete to create the best algorithm (a bit like the Netflix prize I mentioned in part one). Microsoft has Azure ML and Google’s Prediction API fit well here because, like Kaggle, they can handle a wide array of data problems and aren’t specifically bucketed into one specific field. Google’s Prediction API offers a black box learner that tries to model your input data while Microsoft’s Azure ML gives data scientists a toolkit to put together pieces and build a machine learning workflow.
Open Source Tools
Probably because this category has the most ongoing research, there is quite a rich collection of open-source modeling and insights tools. R is an essential tool for most data scientists and works both as a programming language and an interactive environment for exploring data. Octave is a free, open-source port of matlab that works very well. Julia is becoming increasing popular for technical computing. Stanford has an NLP library that has tools for most standard language processing tasks. Scikit, a machine learning package for python, is becoming very powerful and has implementations of most standard modeling and machine learning algorithms.
All told, data application tools are what make data scientists incredibly valuable to any organization. They’re the exact thing that allows a data scientist to make powerful suggestions, uncover hidden trends, and provide tangible value. But these tools simply don’t work unless you have good data and unless you enrich, blend, and clean that data.
Which, in the end, is exactly why I chose to call this an ecosystem and not just a landscape. Data sources and data wrangling need to come into play before you get insights and models. Most of us would rather do mediocre analysis on great data than great analysis on mediocre data. Of course, used correctly, a data scientist can do great analysis on great data. And that’s when the value of a data scientist becomes immense.
Note: I wrote this article for Computer World
In the past 10 years, the focus of data has been on amassing and storing: the more data collected, the better. But while we all became expert data gatherers, what we actually ended up with was a glut of data, a shred of the insights we expected to get, and a very expensive problem.
Data scientists — the very people who are passionate about interpreting data — are doing less analyzing and more cleaning that of messy plumbing. In fact, 80 percent of their time is spent struggling with inefficient cleaning processes they must complete to make data usable. You can call it “data wrangling” or “data janitor work,” but it’s both incredibly time-consuming and a huge factor in preventing organizations from cashing in on the promise of big data.
Companies simply can’t afford to continue on this path. They need to pull themselves out of the mire of messy plumbing.
The first step is to refocus their big data lens into rich data clarity – the hidden bounty that’s shrouded within your data warehouse. Gathering and storing data are key, but for companies to truly understand and embrace what rich data can do for them, the big data conversation has to shift completely.
Here are three reasons why:
1. Big data consolidates information. Rich data drives actual growth.
Consider the case of customer databases. Every time a customer downloads a whitepaper or signs up for a newsletter from a B2B company, their activities are being recorded in a massive CRM system somewhere. But any time customer information is added, a data mess results. Does the customer record say YouTube or Google or Google, Inc? Did the customer enter “California” or “CA” as their location? These are detailed data nuances that computers don’t resolve. Imagine how much more effective businesses could be in driving customer retention and achieving revenue goals if their customer database were full of rich data, fully deduped, complete and accurate.
2. Big data gathers the big picture. Rich data makes it meaningful.
Skybox Imaging launches low-cost satellites into orbit. They take pictures of every spot on the globe each day, images full of rich economic information. Owning massive databases with trillions of pixels capturing the entire world is one thing, but what might be even harder than launching those satellites and storing those terabytes is figuring out what’s actually in those images.
One way the data scientists are using this information involves building algorithms that detect the amount of oil in Saudi Arabian oil drums. Those, in turn, can be used to predict future gas prices. Data scientists can’t afford to sit there for hours upon hours marking where items — in this case, oil drums — are in images to train their algorithms. Big data is all those countless pictures full of amorphous shapes; rich data is knowing the precise number of oil drums in every image. Once they know that, the data can be analyzed to ultimately determine gas prices months in advance. Coming to that conclusion can be transformative.
3. Big data quantifies the world. Rich data changes it.
This is a pretty bold statement, but just look at the health industry. While companies can access hundreds of thousands of anonymized patient records, suppose they actually want to figure out if a new cancer treatment is effective. Big data is those thousands upon thousands of records with different codings and different date formats and doctors notes written in text; rich data reveals who received what treatment and who got better from it. Rich data helps change the status quo of medicine by informing ongoing research, development and innovation in medical research.
When it comes to big data, pretty visualizations aren’t enough. An ugly visualization on rich data is far more useful than a beautiful visualization of messy and incomplete data. Companies that are serious about rich data should look to open source tools like OpenRefine, which enable data scientists to create semi-automated process to clean, enrich and de-duplicate data sets. Tools such as MuleSoft, IFTTT and Zapier are also starting to make it easier to import large sets of disparate data sources into the same place. In other words, we’ve got the medicine we need to cure ourselves of Messy Plumbing Syndrome; we just need to use it.
Our ability to gather and store data is rapidly outpacing our ability to make sense of it. Companies that choose to invest in the tools, people, and processes that turn big data into rich data are the ones that will come out ahead.