Starting a Second Machine Learning Tools Company, Ten Years Later

I’ve spent the last six months heads down building a new machine learning tool called Weights and Biases with my longtime cofounder Chris Van Pelt, my new cofounder and friend Shawn Lewis and brave early users at Open AI, Toyota Research, Uber and others. Now that it’s public I wanted to talk a little bit about why I’m (still) so excited about building machine learning tools.

I remember the magic I felt training my first machine learning algorithm. It was 2002 and I was taking Stanford’s 221 class from Daphne Koller. I had procrastinated so I spent 72 hours straight in the computer lab building a reinforcement learning algorithm that played game after game of Othello against itself. The algorithm started off incredibly dumb, but I kept fiddling and watching the computer struggle to play on my little ASCII terminal. In the middle of the night, something clicked and it started getting better and better, blowing past my own skill level. It felt like breathing life into a machine. I was hooked.

When I worked as a TA in Daphne’s lab a few years later during grad school, it seemed like nothing in ML was working. The now famous NIPS conference had just a few hundred attendees. I remember Mike Montemerlo and Sebastian Thrun had to work to get skeptical grad students excited about a self-driving car project. Out in the world, AI was mostly being used to rank ads.

Unveiling CrowdFlower at Tech Crunch in 2009

After working on search for a few years, by 2007 it was clear to me that the biggest problem in Machine Learning in every company and lab was access to training data. I left my job to start CrowdFlower (now Figure Eight) to solve that problem. Every researcher knew that access to training data was a major problem, but outside of research it wasn’t yet clear at all. We made tens of millions of dollars creating training data sets for everything from eBay’s product recommendations to instagram’s support ticket classification but until around 2016, nearly all VCs were adamant that machine learning wasn’t a legitimate vertical worth targeting.

Ten years later the company is thriving and spawned a field full of competitors. But it turned out that one of our core doctrines was wrong. My strong bias was always that algorithms don’t matter. Over and over I had worked with people who promised a magic new breakthrough algorithm that would fundamentally change the way AI worked. It was never true. It was painful watching companies pour resources into improving algorithms when simply collecting more training data would have had a much bigger impact.

Training data has become a mainstream concept (apparently I find that painful?)

The first sign something had changed came in 2012, when I heard from Anthony Goldbloom that neural nets — the darling of 70s-era AI professors — were winning Kaggle competitions. In 2013 and 2014 we started seeing an explosion of image labeling tasks at CrowdFlower. It became undeniable that these “new” algorithms that people were calling deep learning were working in practical ways on applications where ML had never worked before.

These cheap robots do object recognition better than any supercomputer on the planet just a few years ago.

I stepped down as CEO of Figure Eight and went about building my technical chops in deep learning. I spent days in my garage, building robots running TensorFlow on a Raspberry Pi. My friend Adrien Treuille and I locked ourselves in an Airbnb and implemented backpropagation for perceptrons and then Convolutional Neural Nets and then more complicated models. To sharpen my thinking, I taught “introduction to deep learning classes” to thousands of engineers. I somehow got myself an internship at OpenAI and got to work with some of the best people in the world. I pair programmed with twenty-four year old grad students who intimidated the hell out of me.

Stepping back into being a practitioner gave me a view on a new set of problems. When you write (non-AI/ML) code directly, you can walk through what it does. You can diff it and version it in a meaningful way. Debugging is never easy, but we have seventy years of debugging expertise behind us and we’ve built an amazing array of tools and best practices to do it well.

Machine Learning classes for engineers are super popular.

With machine learning, we’re starting over. Instead of programming the computer directly, we write code that guides the computer to create a model. We can’t modify the model directly or even easily understand how it does what it does. Diffs between versions of the model don’t make sense to humans: if I change the functionality even slightly, every single bit in the model will likely be different. From my experience at Figure Eight, I knew all the machine learning teams were having the same problem. All of the problems machine learning always had are becoming worse with deep learning. Training data is still critically important, but because of this poor tooling, many teams that should be deploying a new model every day are lucky if they deploy twice a month.

I started Weights and Biases because, for the second time in my career, I have deep conviction about what the AI field needs. Ten years ago training data was the biggest problem holding back real world machine learning. Today, the biggest pain is a lack of basic software and best practices to manage a completely new style of codingAndrej Karpathy describes machine learning as the new kind of programming that needs a reinvented IDE. Pete Warden writes about AI’s reproducibility crisis — there’s no version control for machine learning models and it’s incredibly hard to reproduce one’s own work let alone some else’s. As machine learning rapidly evolves from research projects to critical real-world deployed software we suddenly have an acute need for a new set of developer tools.

Face Recognizing drone tracks down Chris

Working on deep learning, I had that same sense of wonder — that I was breathing life into a machine — that had first hooked me on to machine learning. Machine learning has the potential to solve the world’s biggest problems. In just the past couple of years, image recognition went from unsolvable to solved, voice recognition became a household appliance. Like Pete Warden said, software is eating the world and deep learning is eating software.

I love working with people working on machine learning. In my view the work they do has the highest potential to impact the world and I want to build them tools to help them do that. Like every powerful technology machine learning will create lots of problems to wrestle with. Every machine learning practitioner I know wants their models to be safe, fair and reliable. Today, that’s really hard to do.

You can’t paint well with a crappy paintbrush, you can’t write code well in a crappy IDE, and you can’t build and deploy great deep learning models with the tools we have now. I can’t think of any more important goal than changing that.

Check out Weights & Biases at

Thanks Noga Leviner, Michael E. Driscoll, Yanda Erlich,Will Smith and James Cham for feedback on early drafts.

The Voice Controlled, Face Recognizing, Drone Journey

I’ve been writing a fun column on machine learning and cheap hardware for O’Reilly.  One of the articles was How to build an autonomous, voice-controlled, face-recognizing drone for $200.  It was a really fun project, but the coolest part  is that a Microsoft employee, Mark Torr took the project and improved it and then wrote up a super thorough guide for how to do it.  I’m putting it here to make it easier to find his great work:

Here’s my youtube video of my drone

And here’s Mark’s

Pretty cool!  Mark’s writing is way more thorough and much easier to follow along.  Seeing his work reminds me the joy of someone taking one of my open source projects and running with it.  I would love to see open source apply to more than code.

Artificial Intelligence and the Future of Work

Technology makes some types of jobs obsolete and creates other types of jobs — that’s been true since the stone age. While in the past, machines have replaced people in jobs that require physical labor, we’re increasingly seeing traditionally white collar jobs augmented by machines: financial analysts, online marketers, and financial reporters, just to name a few. Of course, these advances also create new jobs. The electronic computers that we know today, for example, replaced human beings performing the actual calculations, but in the process created all kinds of new types of work.

Artificial intelligence seems like it might work the same way, creating jobs for artificial intelligence researchers and slowly displacing all other kinds of knowledge work. And while this might be where we end up a century from now, the path to get there won’t quite look the way people think. We can see where we’re going from AI design patterns used at Google, Facebook and other companies investing heavily in artificial intelligence. In the most common design patterns, AI can actually increase demand for exactly the kind of work that it is automating.

Design Pattern 1: Training Data

Byfar the most common kind of artificial intelligence used in the business world is called supervised machine learning. The “supervised” part is important: it means that an algorithm is learning from training data. Algorithms still don’t learn anywhere near as efficiently as humans, but they can make up for it by processing far, far more data.

The quantity and quality of training data is actually the most important factor for ensuring a machine learning algorithm works well and the best companies take this training data collection process very, very seriously.Many people don’t realize that Google pays for tens of millions of man-hours collecting and labeling data that they feed into their machine learning algorithms.

Collecting training data is a never-ending process. Every time Twitter invents a new word or emoji, machine learning algorithms have no way of understanding it until they see many examples of its usage. Every time a company wants to expand into a new language or even a new market with slightly different patterns, they need to collect a new set of training data or their machine learning algorithms are working under dubious circumstances.

As machine learning becomes more well understood and high quality algorithms become something you can buy off the shelf, training data collection has become the most labor intensive part of launching a new machine learning algorithm.

Design Pattern 2: Human-in-the-loop

Ofcourse, some problems (like spreadsheet math) are incredibly easy for computers and some problems (like walking on two feet) are incredibly hard. It’s the same with machine learning. In every domain where machine learning works there are situations the algorithms figure out right away and situations that are maddeningly difficult to get them to perform well. This is why machine learning algorithms are famously easy to get to 80% accuracy and really, really tough to get to 99% accuracy.

Luckily, good machine learning algorithms can tell the cases where they are likely to do well and likely to struggle. Machine models have no ego, so they’re happy to tell you when their confidence is low. This is why the “human-in-the-loop” design pattern has become very widespread: humans get passed the processes and decisions that a machine can’t confidently make.

For years people have dreamed of a robot personal assistant, and products like Facebook M and Clara Labs are making this a reality. But they don’t automate everything. Instead they have algorithms handle emails and scheduling issues where the intent is clear to them and hand more complicated messages and requests to human being.

This design pattern has taken off far faster than anyone expected. Self driving cars don’t immediately replace human drivers; they take over in certain situations (like parallel parking) and hand back control to the human driver when things get complicated (such as on a busy street with construction). ATMs don’t automatically read every check you deposit, only the ones where the handwriting is clear. In both instances, machines handle a sizeable percentage of the work but when they’re unsure if they can perform well, human input is needed.

Instead of machine learning replacing one job function at a time, machine learning actually replaces pieces of every job function. This makes the person doing the job increasingly more efficient. In some cases, this can lead to fewer jobs, but in others, this can create new markets and create more jobs for the same type of work. If one personal assistant can now handle twenty customers at once, personal assistants become much more inexpensive and maybe one hundred times as many people will work with one.

Design Pattern 3: Active Learning

Active learning is a design pattern that combines the first two patterns. The training data collected by the “Human in the Loop” can be fed back into the algorithm to make it better. Algorithms learn like people — novel, complicated situations help them learn much faster. So the examples that the algorithm can’t do that get labeled by a human are the perfect examples to help the algorithm improve.

In the future, as we do our jobs, we may be simultaneously teaching the same system that is slowly replacing us. On the other hand, we could see it as getting more and more leverage out of our work. It’s really a matter of your point of view.

It’s coming sooner and faster than you think

Most knowledge work has been spared from the effects of artificial intelligence because the upfront costs of building a machine learning algorithm have historically been so high. Unlike software, every machine learning model has to be custom-built for every individual application. So the only business applications that machine learning automated were massively profitable or cost-saving undertakings, like predicting energy usage or targeting ads.

But all that is changing. Two trends have been rapidly bringing the cost of machine learning down. For one, computing power is getting cheaper, as it always does. For the second, machine learning algorithms are becoming productized. In 2015 alone, Alibaba, Microsoft, Amazon and IBM all launched general-purpose cloud machine learning platforms. Companies no longer need Google-like R&D budgets to use machine learning internally.

What this means is that many smaller scale business functions are about to feel the effects of machine learning. When it costs a million dollars to build an algorithm, only the largest companies apply machine learning to classifying their support tickets, organizing their sales database, or handling collections. But when it costs twenty dollars a month, everyone will do it. And with all of the machine learning platforms launched in the last year that moment might have just happened.

What we can learn from AI’s mistakes

AI has been making a lot of progress lately by almost any standard. It has quietly become part of our world, powering markets, websites, factories, business processes and soon our houses, our cars and everything around us. But the biggest recent successes have also come with surprising failures. Tesla impressed the world by launching a self driving car, but then crashed in cases a human would have easily handled. AlphaGo beat the human champion Go player years before most experts possible, but completely collapsed after its opponent played an unusual move.

These failures might seem baffling if we follow our intuition and think of artificial intelligence the same way we think about human intelligence. AI competes with the world’s best and then fails in seemingly simple situations. But the state of the art in artificial intelligence is different from human intelligence, and its different in a way that really matters as we start deploying in the real world. How?: machine learning doesn’t generalize as well as humans.

Tiny autonomous car running tensorflow

The two recent Tesla crashes and the AlphaGo loss highlight how this plays out in real life. Each of the Tesla crashes happened in a very unusual situation — a car stopped on the left side of a highway, a truck with a high clearance perpendicular to the highway, and a wooden stake in an unpainted highway. In the game AlphaGo lost, it fell apart when the Go champion Lee Sedol played a highly unusual move that no expert would have considered.

Why is it that AI can look so brilliant and so stupid at the same time? Well, for starters, it knows less about what’s going on then you think. Let’s look at a simple example to explain. AI can get spectacularly good at distinguishing between the use of the word “cabinet” to refer to a wooden cabinet and to refer to the president’s cabinet. Our intuition, based on our understanding of human intelligence, is that a machine would have to “understand” these two cabinet concept to make this distinction so consistently. The human approach is understand two different concepts by learning about politics and woodworking. Machine learning doesn’t need to do this — it can look at 1,000 sentences containing the word cabinet, each labeled (by a human) as corresponding to one or the other meaning, It learns how frequently words like “wood” or “storage” or “secretary” occur nearby in each case. So it knows that when the word “wood” is present, chances are extremely high that we’re referring to a storage cabinet. But If Obama starts talking about how he’s getting into woodworking, the AI may fail completely.

Artificial intelligence can work as well as it does without “knowing’ the way humans “know” for a simple reason: machines can process far more training data than a human. Peter Norvig, Google’s head of research, most famously first highlighted this idea in a paper and talk called, “The Unreasonable Effectiveness of Data”. This is how modern machine learning works in general — it pours over massive datasets and learns to generalize in smart ways, but not in the same smart way that humans generalize. As a result, it can be brilliant and also get very confused.

So how should we we take all of this into account when we manage artificial intelligence in the real world?

1) Play to AI’s strengths: Collect more training data

Why does Facebook have such amazing facial recognition software? They have fantastic researchers, but the core reason is that they have billions of selfies. Why did Google build a better translation system than the CIA as a side project? They scraped more websites than anyone else, so they had more examples of translated documents.

AI improves more and more as it sees more and more data

Real breakthroughs in machine learning always come when there are new data sets. Deep learning isn’t much better than other algorithms on small amounts of data but it continues to improve on larger and larger data sets better than any other method.

2) Cover for AI’s weaknesses: Use human-in-the-loop

Artificial intelligence has a second advantage over human intelligence: it knows where it is having trouble. In the latest Tesla crash, the autopilot knew it was in an unusual situation and told the human repeatedly to take the wheel. Your bank does the same thing when it reads the numbers off a check. As of a few years ago, AI reads numbers off of almost all deposited checks, but checks with particularly bad handwriting still get handed off to a human for review. And more than fifteen years after Deep Blue beat Kasparov, there are still situations where humans can outplay computers at chess.

When done well, keeping a human-in-the-loop can give the best of both worlds: the power and cost savings of automation, without the sometimes unreliability of machine learning. A combined system has the power to be more reliable, since humans and computers make very different kinds of mistakes. The key to success is handing off between humans and computers in smart ways that may very well require new types of interfaces to effectively take advantage of relative strengths and weaknesses. After all, what good is a near perfect self driving car AI that hands off control to a human it has let fall asleep?

The Best Organization Tool for a Disorganized Person

I love workflowy.  I’ve used every day for years.  I think if everyone used it, the world would be a more productive, happier place.

Screen Shot 2015-06-24 at 1.20.20 PMIf you haven’t tried it, it’s basically Gmail for your to do lists.

Remember when you had to put your emails in folders so you could find them?  I really tried to keep organized folders because it because it was so painful when I had to search for an email.  Some people seem to take great joy in organizing things, but I am not one of them.  Foldering emails was my least favorite thing so I did it only in spastic fits of frustration.  I would name the folders awful things like “MSFT – Misc” or “Legal BS” that made sense to me in the moment, but never made sense again.

Then gmail came along with essentially unlimited storage and awesome search and I never had to worry about categorizing emails again.  It was so powerful that if I wanted to remember something I would email it to myself and add a bunch of keyword tags to help me find it in the future.  This made my life so much better.

When I started my company, I tried organizational tools and processes in the same haphazard way that I organized my email.  I would keep track of performance reviews in google docs or word docs in my dropbox.  I would try to track engineering todos in Jira.  I tried a million tools to keep myself focused and none of them worked.  I reverted to using a physical notebook.

But workflowy is like a notebook that’s always with you and more importantly, that you can search.  This is so powerful because even when you change your processes you can still find everything you wrote down.  In my 1:1s with employees sometimes we talk about urgent things and sometimes we talk about their career goals.  Sometimes we talk about their comp and sometimes we talk about their organizational concerns.  There’s no single good way to organize everything, because often I don’t know in advance what my employees are going to care about.  I write everything down in my workflowy in the haphazard, disorganized way that it comes at me.  I rarely refactor my notes and yet I can still find everything that someone has said to me.  With my longer tenured employees we can reflect on what their goals were in 2012 and how they’ve evolved.  I can pull up every conversation we’ve had about compensation when I do a comp review.  Most importantly, I don’t have to ask people the same questions multiple times.

I scribble notes in my workflowy about ideas I have for my blog and ideas I have for our conference and things I want to accomplish.  I write down things I want to say to customers the next time I see them and things I want to say to my mom.  Then I purge these thoughts from my mind until I see this person.  It’s amazing how freeing this feels.

I love workflowy because it doesn’t tell me how to organize things or do things but it’s made me so much more organized and effective.

Metrics and Hiring Part 2

A few years ago, I realized that I purport to run a data driven organization but I am the least data driven when it comes to my company’s most important process: hiring. I wanted to change that, so I put all the hires I made throughout my career in a spreadsheet and looked for correlations.  I graded everyone I hired on a five point scale and wrote down everything I knew at the time of hiring.  The process actually changed the way I hire pretty significantly (you can see some of the results in my earlier blog post).

I’ve become much more intuitive about my interview process rather than working through a checklist of desired skills.  I’ve also become much more aggressive about pulling people from my own network and I’ve encouraged my employees to do the same.  I’m more open to people with unusual backgrounds, but I do a lot more thorough reference checking. I think hiring is a pretty personal thing and there are lots of different ways to do it successfully, and I wondered how much my results would match someone else’s experience.  I was actually able to convince a friend, a very successful serial entrepreneur to run the same experiment I did.  He can remain anonymous, or if he’s willing I will post his name.

Anyway, here are some of his results: HiringL2 His overall distribution looks somewhat like mine, but he’s labeled more of his hires as “Superstar” and “Disaster”.  Maybe he has a riskier hiring strategy or maybe he’s just more opinionated than me :). Hiring2L2 He labeled schools on a 0-2 scale of general “prestigiousness” and referral strength on a 0-3 scale of how strong the referral connection was (I described the scale in my earlier post).  I found a weak correlation between prestigiousness of school and employee outcome, he actually found no correlation or a weak negative correlation. Like me he found a very strong correlation between referral strength and outcome. He also looked at some things I hadn’t thought of.  One thing he checked was how outcome changes over time.  He found that he was getting a lot better at hiring.  I like to think I’ve gotten a lot better at hiring, but now I want to go back and check. He also found a correlation between dollar compensation and success, which is interesting.  I think I would find a negative correlation – I believe that I’ve found executives harder to hire in general, although after recent adjustments I think I’ve gotten better at it. He noticed a weak positive correlation between a competitive hiring process and success. What’s the big takeaway here?  The results are interesting, but they’re probably highly personal.  Anyone who does a significant amount of hiring should really spend an hour and run the data on themselves.  It’s probably the single best thing you could do with an hour of your time.  If you want to share it with me, I’m happy to aggregate it, anonymize it and post it.