5. Solving Social Problems with Data Science
Hosted by Jonan Scheffler, with guest Isaac Slavitt.
There's data being generated and collected all around us, from the shows we binge watch to the shoes we buy online. Isaac Slavitt has a different concern: can data scientists use their methodologies to prevent diseases, combat pollution, or track wildlife migration patterns?
Isaac Slavitt is the co-founder of DrivenData, a platform for organizations to solicit help from data scientists to solve real-world problems. DrivenData does this by running "competitions" which asks teams to comb through data sets to solve problems for cash rewards.
One such competition was Zamba. Researchers set up cameras in African forests and asked engineering experts to develop AI software which could classify the types of animals which were captured. This would then help with research and conversation efforts without disturbing the natural ecosystem. Another such competition is DengAI, which seeks ML techniques to try and predict future outbreaks of dengue fever.
Isaac concludes the interview by talking about DrivenData's tech stack. He discusses the uses of both R and Python in the data scientist community. He notes that many computationally intensive task, such as ML classification and testing, are able to be offloaded to a service like Paperspace, while the majority of their platform runs on Heroku.
Links from this episode
Jonan Scheffler: Hello, and welcome back to Code[ish]. My name is Jonan Scheffler. I'm a developer advocate here at Heroku, and I am joined here today by my friend Isaac from Driven Data. Isaac, introduce yourself, please.
Isaac Slavitt: Hi, I'm Isaac Slavitt. I'm a co-founder and data scientist at Driven Data, and I work with mission-driven organizations to help figure out the best ways to use their data for the kind of social impact problems that they're working on.
Jonan Scheffler: That is a very polished elevator pitch. I feel like it was very casual, too. This is a thing you have clearly said many times.
Isaac Slavitt: I like to change it up a little bit just so it sounds fresh.
Isaac Slavitt: Sounds great.
Jonan Scheffler: As I understand, Driven Data came out of the Harvard Innovation Lab in 2014.
Isaac Slavitt: Yeah, that's right. The original idea came from a grad school project. My partner, Peter, and I were looking at how we might find an interesting data set that had some sort of social angle and work on that. We really had a lot of trouble finding it, so what's the next best thing to working on a social-impact data set is coming up with a platform to collect social-impact data sets.
Jonan Scheffler: That was a brilliant step, actually. This was a graduate school project for you. What were you studying?
Isaac Slavitt: We were in a computational science and engineering program, which is sort of like a computer science and applied math flavor of what we now call data science.
Jonan Scheffler: This was 2014. When do you think the field of data science really started to explode? It's still exploding. I think, though, in the beginning, these were pretty select skills. If you were going to become a data scientist, you were probably working at one of the big companies if you wanted to do a lot of machine learning work or kind of bleeding-edge machine learning, artificial intelligence kind of work, but I guess there are still plenty roles in just business intelligence teams across the country, different corporations.
Isaac Slavitt: Yeah, you're totally right. I think the names have always been kind of an interesting progression. Even in 2014, people were talking about data science, but it was still relatively new. It was only five years ago, but very few people had that title, and most of them were working at successful startups in the San Francisco Bay area. Even now, just a few years later, it's really exploded in popularity.
Jonan Scheffler: Today, there are six of you at Driven Data-
Isaac Slavitt: That's right.
Jonan Scheffler: ... doing all of this work yourselves and a tremendous amount of work. I was shocked when I heard that number, actually. I think you have, right now on your site, three active competitions. You're running these competitions all the time on these data sets, so tell me how that works. How do you come up with these ideas for these competitions? What do they do for you?
Isaac Slavitt: Sure. We always like to have some competitions that people can work on. A big part of our user base is people who are getting into data science, and so in addition to data science practitioners and academic researchers and grad students, there are a whole lot of people who are in quantitative fields, or they're kind of data science adjacent, but they're not data scientists in their day jobs. They're looking for interesting problems to work on, and since we understand that desire so well, we always like to have some competitions on the site, even if they're just for fun and not for a prize. The competitions-
Jonan Scheffler: Occasionally, you run them just to put the knowledge out there, and so I come and I contribute my model that I've built. Forgive me. I am not a data scientist, so I'm very likely to misuse words in this discussion, but I come. I find your competition. I download the data set, and I train up a model that I think accurately predicts something. As an example, one of the competitions up right now is to predict dengue fever infections, I guess. As the climate changes through mosquito-infested regions, you're able to predict sometimes based on climate and weather patterns where the next dengue fever outbreak is going to be.
Jonan Scheffler: I train up my model on this problem, and I submit it to you under an open source license, which I applaud you for, by the way. It's MIT, is that correct?
Isaac Slavitt: Yeah, MIT.
Jonan Scheffler: Okay. Then, even if, for example, this dengue fever one doesn't offer a prize, we're still building the knowledge of the science around this dengue fever outbreak, and we're contributing back to the world generally, right?
Isaac Slavitt: Yeah, definitely. We sort of look at these as really fun warmups for folks who are interested in competing in the competitions or learning about data science. I think for a lot of people who are getting into the field, one of the biggest early roadblocks is not necessarily learning particular skills, because with online courses and just YouTube videos and blogs and other resources, there's so much information out there.
Isaac Slavitt: What they're really struggling to find is an applied project where they can get feedback, so it's kind of like if I think back to my college calculus course, I think the even exercises had solutions in the back, but the odd ones didn't. If you're trying to learn by yourself, and you don't know how well you're doing, that's a real roadblock to kind of moving your skills forward. These competitions are always out there for folks to work on, and we also have a pipeline of four prize competitions that get developed and released on a sort of a relatively regular schedule.
Jonan Scheffler: I like very much that sense of community that evolves around these competitions. You're able to get immediate feedback because you can quantitatively compare your results against what other people in the competition have submitted, so you know immediately who you're racing against. There's this feeling of being on a team on behalf of science. You're all working towards this common, very altruistic goal. I imagine that's a very fulfilling way to work. I have never had the opportunity as a developer to work for a company that did this kind of thing.
Jonan Scheffler: I find value in what I do. Don't get me wrong, and I think people are good at justifying the value of what they do, but this is very clearly changing the world. I applaud you. That's got to be a very good feeling.
Isaac Slavitt: It feels great. Actually, one of the most gratifying things about doing this has been to see that even for competitions where there's a really robust prize ... There's a lot on the table that people are competing for. A pattern that we often see is that somebody who's winning ... they're in the top three, and they're have a good shot of taking home a big piece of that prize ... will go on the forum and share with the other people who are working on it some pointers and tips and tricks from their exploratory analysis and modeling just because it feels like one of those pay-it-forward things where they they learn from other people, and they've learned a lot from these competitions. It doesn't feel like a zero-sum game, even though there's prize money on the table.
Jonan Scheffler: I really appreciate that mentality, this sense of team that appears, and you mentioned even when there is a lot on the line. I'm very curious to know what is the largest prize you've offered for one of these.
Isaac Slavitt: Sure. I think there was a competition that wasn't exactly the same as our drivendata.org competitions in that it wasn't people submitting predictions to a straight predictive-modeling competition, but we ran a kind of online challenge called Concept to Clinic. This was, I would say, a year or two ago, where people were kind of taking models that had been developed in a previous data science competition and actually writing software. We opened a GitHub repo kind of around the model, and we stubbed out an application that would let ... In this case, it was clinical researchers who were working on detecting lung cancer from early screen scans who would take the model and then kind of move that forward by writing the software around the model so that it could be used, so that it could be fed new scans. For that competition, there was a $100,000 prize pool.
Jonan Scheffler: Wow.
Isaac Slavitt: A ton of people were working this. Again, you'd expect maybe people would be hoarding information and trying to work on their little corner of the project. What we actually saw was it looked like a regular open-source project. People were opening issues and discussing things and reviewing each others' pull requests. It was really nice to see.
Jonan Scheffler: That's fascinating. The whole time, you're moving forward cancer research. You're providing early detection software to cancer researchers. These clinicians, they don't necessarily have the coding ability to put together these models or get them online, but you are making it accessible to them. Not only are you setting up the model, but you're providing a web application template that someone could deploy on a place like, for example, Heroku. I don't know if you've heard of Heroku. They're pretty great. You could put this application online there, and then the clinicians have ready access to the research.
Isaac Slavitt: That's right. What we were trying to do here was develop a proof of concept, and we were working with the Addario Lung Cancer Foundation, which is a major lung cancer research and funder in the United States. We have a ton of respect for them. They're very forward thinking in what sorts of research they fund in addition to very traditional clinical research.
Isaac Slavitt: Competitions are awesome. They get a ton of engagement, and they really get to sort of the state of the art. It's hard to beat when you have all sorts of individuals with academic backgrounds and practitioner backgrounds who are working on a hard problem and trying all sorts of different things. You really explore the solution space, so you can be pretty confident that what you end up with at the end is probably as close as you can get to separating signal from the noise.
Isaac Slavitt: What happens, though, is a model is just a bunch of files in a folder with the source code that made them, and so it's a huge, huge challenge to get good models, but it's also a sort of interesting problem to work on where you take models and you try to make it happen. You have to write the software so that people can use them.
Jonan Scheffler: The example that I initially found when I was looking at Driven Data was a project called ... It's called Project Zamba, I think. I wonder if you could tell me more about that.
Isaac Slavitt: Sure. Project Zamba started with a competition, and the idea behind the competition is that a lot of researchers who are looking into environmental conservation and animal behavior rely on footage gathered by camera traps. These are little motion detector-activated cameras that researchers can put up in trees or other kind of manmade structures.
Jonan Scheffler: Some kind of hideout in the forest, and I park this along a trail and wait for something to move. It turns on and gives me a short clip of video that I review later, right?
Isaac Slavitt: That's exactly right. If you think about having hundreds of these or thousands of these distributed throughout an area that you're investigating, because the wind blows and moves leaves, and because there's a lot of animal and sometimes human activity in these areas, you end up with a ton of footage. The sort of traditional way to solve this problem is to throw grad students at it or spend researcher time painstakingly going through these and classifying the videos, and saying-
Jonan Scheffler: You're literally just watching hours and hours of footage and tagging, all right, at a minute 35, there was a pangolin, right? Manually.
Isaac Slavitt: That's exactly right.
Jonan Scheffler: That could take a long time to accrue data, I imagine, but you had a pretty significant set when you launched this competition.
Isaac Slavitt: Yeah, we had a big set of data. The thing about these camera traps is that if you keep them out there, you get a ton of information. It actually becomes a kind of a race. How quickly can you label all of these videos? More keeps coming in, so this is one of those things where, for years, people have wanted to automate this in some way, but either the algorithms or the hardware weren't up to the task. We're finally at a place where it makes sense to see whether a lot of this classification work can be pushed to the computer first and then just verified by a human.
Jonan Scheffler: I think you told me there were 300,000 clips in the initial data set. Is that right?
Isaac Slavitt: That's right. Yeah, 300,000.
Jonan Scheffler: The 300,000 clips were already classified manually by people, but even if for a new clip, you were only able to tell me if it was an animal or a human ... which I know that I am oversimplifying things, but in my brain that is not a data scientist brain, that seems like a relatively easy classification problem as compared to identifying the difference between a raccoon and a pangolin, for example, which is in fact exactly what you have done.
Jonan Scheffler: Your winning models are able to classify a set of species specifically out of any given clip. I can upload a video now and find out if there's a pangolin in it. Am I correct?
Isaac Slavitt: That's exactly right, but there's actually one win that's even easier that's a little bit further upstream. The first task is figuring out whether the videos have an animal at all or whether it's just kind of background motion. That's something that is actually pretty doable with computer vision. It's not necessarily the kind of thing where you need a deep neural network.
Isaac Slavitt: Certainly, deep neural networks are good at that, but that's the kind of thing where people have been doing that for a little while, but in order, like you said, to classify between a raccoon and a pangolin, which are both ... they can be about the same size and have, if you're squinting or it's dark, roughly the same kind of outline, that's really hard to do with traditional deterministic methods. That's where we're just at a point where the best-performing neural networks do very well at that and close to what a human would just do by looking and classifying them manually.
Jonan Scheffler: When you were talking about doing this without machine learning, if I were to take kind of still images of these animals ... I have a silly example, maybe, from my career on the speaking circuit. I go around to these Ruby conferences, and I talk about my silly projects. One of them was a terrible, terrible, do-it-yourself home security system that used a Raspberry Pi. It would just snap a still image every second, and it would then turn that image into a set of zeroes and ones by some method.
Jonan Scheffler: It actually would minify that, and I would end up with this 64-bit hash of what image was. I could then use ... That's called perceptual hashing that I was using to reduce this down to this 64-bit number, and then I would use the Hamming distance of those numbers to tell the difference between to images. Obviously, being able to accomplish that without the machine learning piece doesn't take away the value of having not only these models up there but the tools to use them. Even if we know how to accomplish those things easily, as you said, without the machine learning piece, it's still not accessible to the researchers, so the information is invaluable.
Isaac Slavitt: Yeah. You could get pretty far using more traditional computer vision methods. People have been working on these and perfecting them for a very long time, but there's a jump after that where your computer vision will kind of plateau, and you need a more kind of probabilistic model that can learn from data. It's hard once you start bumping up against the fundamental limitations of the data you have, like if you think about two pictures that are kind of in the dark that have similar-looking animals. It's not clear how you could just use a filter or a hash to try to figure out which is which. That's where you just need to have it look at a ton of different examples and build its own features so that it can classify better in the future.
Jonan Scheffler: This is why I think the future looks bright, maybe, for data scientists from Harvard, for example.
Isaac Slavitt: I like to think so.
Jonan Scheffler: I think you may be employed for some time to come, Isaac.
Isaac Slavitt: We've been reading articles like data scientist is the sexiest job of the 21st century and things like that for a while. I love that this skillset is getting more attention. I think that it's actually more that the general public's attention to quantitative methods has started to catch up. My undergraduate major was operations research, which is not something that a whole lot of people have heard of, but really it's just sort of applied math applied to kind of real-world business problems.
Jonan Scheffler: Like operations problems inside of a corporate, or I've got a factory, and you're doing optimization somehow.
Isaac Slavitt: Yeah. The history of it goes back to World War II, where people were trying to figure out, given certain constraints on fuel and distance, where do I put my planes and all sorts of tasks like that, and then if you think about any kind of company that deals with constraints and optimization. Just think about UPS or FedEx. They're just a fractal of these problems. The closer and closer you look at what they need to do and figure out and optimize, everything becomes an optimization problem, so that's going back to the 1940s, that sort of tool set.
Isaac Slavitt: Statistics is a much older field, and people have been using applied statistics for a very long time, so all of that was a kind of roundabout way of saying data science is a newish term, but it's really an umbrella term holding a lot of different fields. I do think it's special when you take kind of traditional quantitative analysis tools and you also combine that software and computer science skills. That's where a lot of the power comes in, when you're working with workplaces where everyone uses a computer, but some of this is just the terminology catching up to what has been important for a very long time.
Jonan Scheffler: Now, we also have the technology to catch up as well, right? We have these cloud computing platforms that are capable of handling incredibly large workloads. I have a friend who works at Google, another developer advocate over there, who just recently calculated a new world record for digits of pi using these cloud platforms that are now available, which is not something that could have been accomplished before. The pace of innovation around these platforms, and now you have GPUs available for you all the time to run your machine learning and train up your models. It's a brilliant and bright future, I think, for data science and for software. I'm looking forward to seeing where things go.
Isaac Slavitt: Yeah, and I think it would be difficult to overstate how important it is that open source has exposed so many people to these tools. It used to be the case where if you wanted to work on a GIS system, or you wanted to work on a good relational database, or you wanted to work on time series forecasting, you had to really get a job at one of the companies that either built these tools or had a $100,000 license to use them.
Isaac Slavitt: The democratizing effect of open source, and not just open source but now having platforms like Heroku where you can create a free account and experiment with getting your stuff out there, or if you need to run a computation, there are platforms where you can run very computationally expensive code that previously you could only do if you were really a graduate researcher at an institution that owned a cluster. I think that has really pushed the field ahead quite a bit and opened it up to a lot of people who wouldn't have been able to participate before.
Jonan Scheffler: I do want to ask you a little bit about the infrastructure there at Driven Data. So far, we've talked about hosting your models on Heroku. Now, when you're training a model, this is a pretty specialized task, and there are specialized services for doing this, for example TensorFlow. I come away with my trained model, and I can put that up anywhere, for example in the Zamba use case to allow researchers to upload their videos and find out if there's a pangolin in them. You've seen your users using Heroku for that. Is that correct?
Isaac Slavitt: That's right. One of those interesting asymmetries is that it can be incredibly expensive to train a model. For some of the papers that get published now about AI applications, the researchers may have used a cluster with 100 GPUs that was running for a week to train a model, but just because the model is expensive to train doesn't mean that it's necessarily that expensive to get predictions out. Just for folks who are not familiar with the terminology, training is when you're exposing the model to lots of different examples, so the training data is data where you have both the input and the known output. Then, your testing it and vetting it on data where you give it just the inputs, but it hasn't been trained on the outputs, so you kind of see how well it does generalizing to new examples. That's-
Jonan Scheffler: I'll take a data set, and I'll run maybe 80% of my data through with the input and the output both present so that my model adjusts it weights to the existing identification. Then, the remaining 20%, I'm allowed to use to test my model then, or it gives me the opportunity to use to test my model, right?
Isaac Slavitt: That's right, and so the process of finding those weights, when you really boil down most statistical modeling, we're just trying to get a bunch of numbers that either push a prediction into a certain class, or push it towards a yes or a no in the case of binary classification, or try to find a certain number in the output. It all comes back to just finding these weights, but the optimization algorithms that you need to run to find those weights when you're feeding the model new examples, that can be extremely expensive to do. Once you have the weights, it's generally just a process of feeding in the input, plugging it in, and then you get out your answer.
Jonan Scheffler: To your point, then using that data becomes much, much less expensive, because you're just putting the trained model up on a site like Heroku, and you're only paying us whatever you're paying for your dynos and your database, right?
Isaac Slavitt: That's right. To bring this example back to your question, we talked about the Zamba competition where people were looking at trying to take these videos of animals and classify them into what exact species it was. At the end of the competition, the organization was interested in developing that further. It's great to have the model, but they wanted people who aren't machine learning researchers to be able to take their new footage, plug it in, and get out a spreadsheet so that would kind of fit into their current workflow of how they assemble their research data.
Isaac Slavitt: We posed a question to our team. "How might we build a system that's a thin wrapper around this machine learning model that has an intuitive user interface so that researchers can just sort of upload their videos and get out spreadsheets of predictions?" They don't have a strong preference as to what output it's in. They just need to be able to work with it, and everybody knows how to work with spreadsheets, if that makes sense.
Jonan Scheffler: Well, and more than that, they're able to write little formulas into their spreadsheets to further extrapolate on the data. This is part of the value of things. I think these interstitial products that kind of glue companies together, you see a lot of startups around this space where small software companies have problems finding docs, internal documents, and so a lot of these startups built these intermediary services that simplify this process.
Jonan Scheffler: Dataclips is a very good example of this kind of thing where I as a developer, I have access to the database, and I can get into the data, but it becomes quite a task for me when everyone on the team needs business intelligence data, and they've got to come to me, and I've got to write these things. If I could just drop a bit of SQL into a website that then they can tweak a little bit, they don't have to come back to me if they decide they also want to search New York. They can just add it to their query, and it empowers people in the organization to kind of shift the load of work around.
Isaac Slavitt: That's exactly right, and we've actually used Dataclips in that way. Since we use Heroku Postgres for our hosted database, we can connect this. For people who aren't familiar with the tool, there's a kind of clean user interface where you can put a SQL query, and then you can see the results below, and you can export it as a spreadsheet or a variety of different ways, I think.
Isaac Slavitt: When we've had partners say, "Hey, we want to see where the leaderboard is as of today," we don't want to write custom code to do this all the time, and we don't want to go in and poke around in our production systems, so it's nice to have a dataclip that we can just grab a quick download and send over to them.
Jonan Scheffler: Yeah. I love Dataclips. I actually didn't know of the existence of that product before I came to work at Heroku, but it has been incredibly valuable to me. Tell me a little bit about the structure of Driven Data's applications. You have drivendata.org, the actual web application. Is that a Rails app?
Isaac Slavitt: It's a Django app.
Jonan Scheffler: Okay. This makes sense to me, of course, and should have been my first guess, because if I am not mistaken, the machine learning community is all but entirely dominated by Pythonistas. You don't think that's true anymore, maybe?
Isaac Slavitt: Sort of.
Jonan Scheffler: Is it changing?
Isaac Slavitt: I think it's ... No, I think it's going in that direction. I would say it's probably predominantly the Python/PyData ecosystem. There are still a whole lot of very serious users of R in the data science world, especially people who are in research or academia. R is very popular among data scientists who have more of a statistics background, but especially with tools like TensorFlow and Keras being very Python-centric, I would say that the majority of at least the AI, if not most machine learning or data scientist practitioners, have been moving in the Python direction.
Jonan Scheffler: I think I spoke to some data scientists once who told me that they would write... They would come up with their models in Python, but then they would port them to Java so that they could run more quickly on the JVM in order to train up their models. Does that pattern still apply?
Isaac Slavitt: Yeah. I think there's an interesting trend in our field where at first, you kind of just had data scientists, which pessimists were saying was just a rebranding of other titles. I think there's a legitimate truth that a lot of the skillsets are older than the term data science, but five years ago, a data scientist was doing everything, so they were doing the exploratory data analysis, building models, and then trying to figure, maybe, how to get this working in production in some way. Either that, or their work would end once they built the model, and then they throw it over the wall, and the traditional software engineering structure would have to pick that up and run with it.
Isaac Slavitt: Now, especially in the last, I would say, two to three years, we're seeing more and more organizations have data scientists and data engineers who sit between in the data scientists and the software engineering stuff. The data engineer's job is to get data out of all of these varied systems into a format that the data scientists can use, then also help convert the resulting research code, really, that comes from the data science process to be more like production code.
Isaac Slavitt: This is very organization specific. Some organizations just have software engineers who do a little data scientist. Some have very clear firewalls, and the data scientists just finish up, and then throw it over the wall. Some organizations have a sort of hybrid structure where data engineers bridge that gap, or they push the responsibility to the data scientists to get up to speed on their software best practices so that their work can be more directly adopted into the engineering organization.
Jonan Scheffler: You are using Heroku to ship drivendata.org, and the Zamba project uses Heroku to host the model. Is that correct?
Isaac Slavitt: Yeah. The Zamba project hosts the web application and the database and the queuing layer on Heroku. The only part that isn't on Heroku is the compute task which runs the model outputs, the model predictions, and that happens on a service called Paperspace which provides sort of ephemeral containers that are specially suited for GPU-enabled, compute-intensive tasks. The entire kind of platform is on Heroku. The only part that gets computed elsewhere is the GPU-intensive prediction part of it.
Jonan Scheffler: You're using these Paperspace instances. Is it more of like a function as a service style platform? Are you just giving it a bit of code to run, or are you configuring and setting up a server yourself?
Isaac Slavitt: Yeah, it's sort of ... They have a few different offerings, and I don't want to try to sum up all of them. I'm not totally familiar with all. They have some offerings that seem to be more kind of permanent computing environment, but the one that we're using is, I believe, called Gradient, and it's for these kind of batch jobs where you just need some compute at a certain term.
Isaac Slavitt: It's a little like a function as a service, so you give it a container. It knows how to run that container. It has whatever new inputs you're giving it, and then it puts the outputs wherever you want. For us, we ingest those back into the web application so that we can display those outputs in a helpful way for researchers and let them kind of manipulate that data and export it.
Jonan Scheffler: At some point, an API call comes back, and you said about making a spreadsheet for someone.
Isaac Slavitt: Yeah, that's exactly right. The way it works is, let's say I'm a researcher at Max Planck Institute, and I just got my new batch of videos back from the field. What I want to do now is I want to get these all classified, so I go on Zamba Cloud. I either upload the files directly if they're small enough, or I point the web application to an SFTP server that has all of the files. Then, the application takes all those files, copies them over to Amazon S3, which is where the data is stored. Then, it kicks off a job, one of these compute jobs on Paperspace we were just talking about.
Isaac Slavitt: At that point, what we kind of need to do is asynchronously babysit a process that is happening elsewhere, so we have this Heroku scheduler script that on a certain heartbeat will ask Paperspace, "Hey, which jobs do you still have running, and what status are they? Have they succeeded? Have they failed," so that it can update our state on the web application side.
Jonan Scheffler: You're basically using promises just as API calls to connect to an SFTP server, get the data up into S3. Then, when the API tells you that your job is finished, get that data back and put into a spreadsheet. It's simple, really. It's just that that you have to do.
Isaac Slavitt: Yeah, that's exactly right. We even use that sort of promises concept in the application. What we call the Heroku scheduler script, its job is to poll, so it's trying to update the state of these long-operations that it doesn't have any visibility into directly.
Jonan Scheffler: That's interesting, but Paperspace being a very specialized tool sounds like it's exactly what you would want to use for this thing, so that's a good thing to keep in mind. You've got this Django app for Zamba running up on Heroku, and you mentioned that you're using Postgres. Is this the database that you use across your applications? As far as Driven Data is concerned, Postgres is the one true database.
Isaac Slavitt: That's the one true database. I'll fight anyone who says otherwise.
Jonan Scheffler: I have actually offered to fight people live on the floor at Dreamforce. As I'm giving a demo in their booth, I said, "Postgres is the one true database. It supplants all other databases. Come fight me."
Isaac Slavitt: Yeah, and I'm not a database expert, but some of those advanced features, it just feels like we couldn't live without. The JSONB fields, we make heavy use of in the Zamba application because we have our own database representation of these processing jobs and predictions and all of those things, but it's also occasionally important to us to look at the last payload that we got back from the API. It's just awesome that all we have to do is stuff that into a JSONB, so there's no ... These days, you don't really have to choose between structured or unstructured, or SQL and NoSql. You can kind of have the best of both worlds, especially if you use a modern database like Postgres.
Jonan Scheffler: I 1,000% agree, and this is actually almost becoming a best practice for me in my applications where I do store my API payloads. It's so useful when you're trying to debug across a service-oriented architecture. I've got multiple applications in my microservices architecture, and I'm trying to debug a request across each of those. Having the actual payloads that were returned from the various APIs stored in a JSONB field is incredibly valuable, and I do try to set it up for at least all of my lower-volume API calls in my applications. It's super useful.
Isaac Slavitt: Yeah. To be honest, I think Postgres is not the only relational database. I think at this point, if you choose any of them, you're probably fine. They have a good amount of parity these days. What I think is funny and kind of my favorite genre of post on Hacker News or other places where developers talk about stuff is when people are about to start a new business or something, and they're talking about using the hot, new event-sourcing data scheme or something like that, a true big-data problem that Google is grappling with when they don't even have five users whose data they need to store yet. I kind of think that that's like the Dilbert cartoon of our time.
Jonan Scheffler: Any other interesting corners of the Heroku platform you want to ... anything you feel like you're doing is maybe novel you want to share with us?
Isaac Slavitt: I have to be completely honest in saying that nothing that we are doing on Heroku is novel, and I consider that a good thing.
Jonan Scheffler: I agree with you. I agree with you. I was thinking this earlier when you were talking about the database, that people are lining up to use these new technologies and these new tools, and a lot of times, we're reinventing things that already existed, or we're making small, progressive steps forward. It's not necessary to jump to the newest, hottest thing all the time.
Jonan Scheffler: Use boring technology. I used Heroku long before I ever worked here, because I never thought about it ever. It would run my application, and occasionally I'd get an email, and it would say, "Hey, there was a critical error with your database. Everything went to hell, and your whole production database was deleted, but we saved you, and here's your email letting you know that we restored it from backup while you were sleeping. Now, you have no further obligations to pursue." That is the kind of technology that I want.
Isaac Slavitt: Yesterday, I got an email that said, "Hey, did you know that one of the indexes on your primary production database is corrupted from the application layer? Hey, by the way, the way you fix this is you just run this command." I ran that command, and it worked out well.
Jonan Scheffler: See, this is exactly what I need more of in my life. So much of my day is just figuring out what obscure bug I've managed to encounter. My worst days as a developer are when I spend my first four or six hours of the day screwing around with my development environment, everything's broken, trying to get my pipeline set up so that I can ship my applications effectively. There are so many things that will already go wrong with your development environment. Simplify the pieces that you can. Use boring technology.
Isaac Slavitt: We've had people say, "You could just have a co-located server somewhere, and you could do this, that, and the other thing." I always say, "I know my limits. I'm not a dumb person on my good days, but I don't really want to be responsible for applying critical security updates to 8,000 components of this Linux system running somewhere in the cloud that I don't really understand, so why don't I focus on the data science and software development parts and let somebody smart take care of the rest?"
Jonan Scheffler: I love playing around with security, but I'd never trust myself to harden a production server. The idea is just crazy to me.
Isaac Slavitt: You don't want to play with security. You don't want to put your living in jeopardy dealing with that if you're not a security expert.
Jonan Scheffler: Exactly. Exactly. Well, Isaac, I think I have mostly run out of things to ask you about. Actually, that's never going to be true, because I am fascinated by Driven Data, and I'm going to keep watching these competitions. I'm looking forward to a day where I know enough machine learning that I can dabble in some of these, because I'm really impressed by the work you're doing. Again, I applaud you for your commitment to open source. I really, really appreciate that you're releasing these models under the MIT license specifically, my favorite license, the set-and-forget open-source license that actually contributes back to the world and lets people use it however they want.
Jonan Scheffler: You're doing great work, Isaac, and I thank you so much for joining me. I really appreciate your time.
Isaac Slavitt: My pleasure, and thank you for supporting the Zamba project with the Heroku credits that you gave us.
Jonan Scheffler: I am happy to do it, and we will continue supporting Zamba as long as we are able. For science.
Isaac Slavitt: For science.
Jonan Scheffler: Have a good day, Isaac. Bye-bye.
Isaac Slavitt: You, too.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Developer Advocate, Heroku
Jonan is a developer at Heroku and an aspiring astronaut. He believes in you and your potential and wants to help you build beautiful things.
Co-Founder & Data Scientist, DrivenData
Isaac is a co-founder and data scientist at DrivenData. He holds a master's in Computational Science and Engineering, and a BS in Operations Research.
More episodes from Code[ish]
Becky Jaimes and Chris Castle
Heroku Dataclips has been around since 2012, and it's still a reliable part of the Heroku Postgres ecosystem. Dataclips lets you quickly, easily, and safely access your database, allowing you to share the results with others to see.... →
Ryan Townsend and Chris Castle
When it comes to web performance, there are plenty of trade-offs to make to ensure a page renders as quickly as possible. Ryan Townsend joins us from Shift Commerce to talk about how milliseconds of delay can cause millions of dollars in... →
Raúl Barroso, Alasdair Monk, Jon McCartie, Annie Sexton, Niklas Richardson, and Chris Castle
Heroku is a remote-first company, and for some employees, it's their first time working on a distributed team. Five different Herokai talk about what's worked (and what hasn't), ranging from their home office setup, the necessity in... →