82. Processing Large Datasets with Python

Deeply Technical
August 4th, 2020
Episode 82
34:27

Also listen via

82. Processing Large Datasets with Python

Hosted by Greg Nokes, JT Wolohan

J.T. Wolohan is the author of "Mastering Large Datasets with Python," a book that helps Python developers adopt functional programming styles in their their project prototyping, in other to scale up towards big data projects. Greg Nokes, a Master Technical Architect with Heroku, initiates their conversation by lying out what Python is and what it's being used for. As a high-level scripting language, Python was primarily used by sysadmins as a way to quickly manipulate data. Over the years, an ecosystem of third-party packages have manifested around scientific and mathematical approaches. Similarly, its web frameworks have shifted towards asynchronous flows, allowing developers to ingest data, process them, and handle traffic in more efficient ways.

J.T.'s book is all about how to move from small datasets to larger ones. He lays out three stages which every project goes through. In the first phase, a developer can solve a problem on their individual PC. This stage typically deals with datasets that are manageable, and can be processed with the compute hardware on hand. The second phase is one in which you still have enough compute power on your laptop to process data, but the data itself is too large. It's not unreasonable for machine learning corpus to reach five terabytes, for example. The third phase proposed is one where an individual developer has neither the compute resources to process the data nor the disk space to store it. In these cases, external resources are necessary, such as cluster computing and some type of distributed data system. J.T. argues that by exercising good programming practices in the first phase, the third "real world" phasing will require little modification of your actual data processing algorithms.

Links from this episode

"Mastering Large Datasets with Python" teaches you to write code that can handle datasets of any size
Amazon EMR is a popular way to parallelize data processing in the cloud

Show Notes

Greg:
Welcome to Code[ish]. This is Greg Nokes, Master Technical Architect with Heroku. I've got J.T. Wolohan on with me, who is author of "Mastering Large Datasets with Python". Hi, J.T..

J.T.:
Hey, Greg.

Greg:
We're going to talk about large datasets with Python. Now, I'm a Rubyist at heart, and I have been for 10-15 years now. I know Python is a language, but the only thing I really know about it is it was named after Monty Python. But I understand that it's somewhat like Ruby, but other than that, I don't know much about it. So can you tell me a little bit about Python and why people choose it?

J.T.:
Sure, yeah. So Python has a lot of the same benefits as Ruby, being a high level scripting language. They both allow you to do a lot of powerful things relatively concisely with code. The main benefits of Python are the data processing ecosystem that's arisen up around the language and within the language. It's really been a place that the scientific and data science, machine learning communities have gathered. And when you combine that with the lightweight web frameworks, pretty similar to what you find in Ruby, and even some of the more fully featured web frameworks, you get a really powerful combination, right?

J.T.:
So you can do a data processing, you can do data analytics, and you can use it for your web applications and databases and all that stuff as well. Super versatile in that sense.
<!– more –>
Greg:
That's nice. I am familiar with the Django web framework. That's probably the 800 pound gorilla in the Python ecosystem.

J.T.:
Yeah, that's the big one. It's the popular one. It's been around for a while. It's very robust. Most of the other Python web frameworks are less opinionated than Django or at least lighter than Django where they don't come with batteries included. Popular ones include Flask and Pyramid. And then more recently, folks are starting to release asynchronous web frameworks that can make asynchronous calls like Quart. Quart is a Flask port that makes asynchronous calls and Starlette is a Flask-like web framework that is asynchronous first.

Greg:
That's pretty cool. That's a good intro into the Python ecosystem. Thank you. I've looked at Django in the past and I already know Rails pretty well, and I didn't really want to learn another large opinionated framework like that. So maybe I'll take a look at one of those lightweight ones like Starlette. That sounds interesting to me.

J.T.:
Yeah. If you think about Python as a data processing language or a data analytics language, the lightweight frameworks make a lot more sense because all they're really intended to do is serve data up through an API. So if you're deploying a machine learning algorithm and you just need to layer an API on top, something like Starlette is great because it can make calls to your machine learning routines, which are also written in Python, but it doesn't necessarily… It doesn't have any sense of admin rights or privileges in it, right? You can't like build a blog out of the box with it.

Greg:
That makes a lot of sense, because half of the battle of… Machine learning is great, but if you can't display that data to systems that are going to consume it or websites that are going to consume it, or whatever, then it's pretty useless to do all that work on it.

J.T.:
Right. Yeah. That's exactly right. There's certainly a place for data science as research and machine learning as research. And you can learn things by doing machine learning. It's got strong roots in the academic community. But in industry, what we really want to do is we want to embed machine learning in our systems, typically those are web applications of various stripes. And Python has a really convenient way of doing that because you can write your web applications or the web application integration in the same language as you write your machine learning software.

Greg:
I often work with companies that are designing streaming services, streaming-based architecture, with Kafka as a backbone. Do you see folks using that sort of an architecture much with machine learning, or are you more in the exposing it to a web applications via APIs?

J.T.:
Most of what I see is folks exposing it to web applications via APIs. Machine learning algorithms are… You can use them on the streaming data, right? Kind of like piece by pieces as things come in, but you can't as easily train most of the algorithms on streaming data. They usually have some batch component. If you look at what folks do for… They're updating their recommender systems or something like that, right? They'll update them every few hours and then update them all at a time. They won't update them action by action by action. Some people do do that, but most of them will update the systems as a whole. That doesn't mean the systems can't take advantage of new information that they have about you.

J.T.:
It's just that the underlying model that's being used to assess the information that's available hasn't changed.

Greg:
You wrote "Mastering Large Datasets with Python." I assume the book is about machine learning and using Python to accomplish that?

J.T.:
Sort of, yeah. The book is about how individuals can adopt a scalable programming style that allows them to prototype on small datasets and move to increasingly large datasets up to the type of environment that you would see if you're working on a web scale, industry scale, enterprise scale problem.

Greg:
That's interesting. Because in my world, often we talk about three different levels of an application. The first is development where you got mock data, very small sets, light resources. The next is staging where you're running smoke tests and maybe some load tests against it. Then of course, the third is production where you're turning everything up to 11 and going as fast as you can. So it sounds like using the similar thought process for building machine learning.

J.T.:
That's exactly right. And I actually think about it in three stages as well based on the types of parallelism you need to solve the problems in a satisfiable way. I imagine the first phase in which you can solve the problem on your individual laptop or your individual PC. And so this is a problem in which the dataset and the compute hardware are contained to your machine. So this is any prototype scale problem, right, will fit on a laptop and you'll be able to solve it with the compute resources that you have. The second phase is one in which you have enough compute resources to process it on your laptop in a satisfiable timeframe, but the data itself doesn't fit on your laptop.

J.T.:
So this is a problem where you might want to process, I don't know, five terabytes of data, right? You can't store that on a standard issue laptop, but you could certainly process it on a laptop in a reasonable amount of time, certainly less than 24 hours. And then the third problem I have or I propose is that you've got a problem where both the compute resources and the data needs have to be external to whatever machine it is that you're working on, right? So you basically need cluster computing and some type of distributed data system. And this is where you have maybe hundreds of terabytes of data that you need to process, and this is where you really start thinking about big data problems.

Greg:
And how do you approach this big data problems? Two parts of that question. First is how do you accomplish the clustering? Are you doing parallelism across different instances, or are you packing a whole bunch of resources into one huge instance? And then how do you approach the data problem? How do you store that much data? What sort of tools do you see people use?

J.T.:
So storage becomes less of a problem in the modern age, right, with the cloud. In the book, I talk about S3 and EMR as examples of how you might solve this problem. We actually released a free companion book all about object storage. But for those who aren't familiar with object storage, it allows you to store data in pretty much whatever format you want, and then you can ingest those data files with either individual scripts or scripts being run by a Hadoop cluster or something like that, right? We talk about how you can use EMR, which is Amazon's Elastic MapReduce service. It's a virtualized Hadoop cluster that you can rent by the second, and it'll run scripts of your choosing against data that's stored in S3.

J.T.:
The nice thing about this is that you can store as much data as you want in S3 and allocate as many resources you need for EMR. And so the sky is the limit or, I guess, the budget is the limit on what you can do with the system. And of course, I always feel the need to add that Microsoft and Google both have synonymous or analogous systems. So you're not locked in to AWS or any of the cloud vendors if you choose to go with them, but that will typically be how you deal with kind of these big distributed processes.

Greg:
So I've heard the term MapReduce a lot over the years, and I've done a little bit of research into it, but I've never actually had to use it in a production system or a system at all. Most of the datasets I work with fit well within a terabyte, so I just use Postgres. Can you explain a little bit about what MapReduce is? Maybe like a 60 second version?

J.T.:
Sure. Yeah. So MapReduce is a programming pattern where you take one function and you apply it over a large amount of data to transform that data, and then you take another function and you apply it over the transformed data to collect all of the data into some new format. Classic example is counting words in documents. So the first function translates each document into a list of words and counts, and then the reduced function basically accumulates all of those counts. So at the end, you have a list of words and the accumulated counts from all the documents.

Greg:
Okay. That's cool. Thank you. That helps a lot. So you could do the same sort of stuff in traditional 30 year old SQL. It would just take a lot more work in the programming language in between somewhere.

J.T.:
Yeah. And the big benefit of Map and Reduce is that it constrains the work that you want to do in such a way that it makes it inherently parallelizable. So you can trivially parallelize your MapReduce code so you can use it on problems that don't require parallelism. But then when you do need parallelism, you can invoke them using the same pattern. So there becomes no transition between I'm working on a prototype to now I'm working on a problem that's kind of in this in between space to now I'm working on a large data problem where I need cloud computing resources.

Greg:
That's really useful. I always say that development, staging, and production should look exactly the same from an infrastructure perspective, since I'm an infrastructure person. And it sounds like MapReduce gives you the ability to make that code work almost exactly the same, no matter what size you're working on all the way up to perhaps even petabytes of data.

J.T.:
Right. That's exactly right. It's almost the de facto standard for your massive scale computing tasks. The Hadoop ecosystem and Spark ecosystem are the default standards for the petabyte scale data tasks. Their explicit support for the MapReduce style makes it an obvious choice for anybody who thinks that they're going to need to scale to that size.

Greg:
Fundamental question here, is Python a functional language or is it an object-oriented language?

J.T.:
Python is object-oriented, and I talk about some of the opportunities to use Python in a functional way in the book. The Python community is a little split about what the language wants to be. In a lot of ways, they're trying to be everything for everyone. There's strong functional support in Python, but it's not optimized for functional programming in any way. Sometimes you're doing things out of step with how the language was designed. That said, you can write clean functional code in Python. You can write functional looking code in Python. There are libraries in the Python ecosystem for all of the functional built-ins that you would find in a standard functional language.

J.T.:
So Python is not a functional language, no, but it's got functional language support. It's a little bit intentional, but it's certainly not the preferred way to write Python, but there are communities and pockets of the Python community that support the functional style.

Greg:
That's interesting that Python is so mutable that you can approach both or you can support both approaches to programming.

J.T.:
Yeah. And that's a strength of Python, but it's also a weakness. I think one of the reasons that I wanted to write the book was because I saw that people who were being trained in data science programs weren't getting any opinionated way of writing code or programming because they're coming up through these data science programs, maybe haven't taken some computer science classes, maybe having a computer science degree, but usually not having been full-time developers. And so they didn't have a traditional object-oriented style. They didn't have a traditional functional style.

J.T.:
What they got using Python was just a mishmash of I know how to use this library. I know how to use this library. I know how to use this library, but I don't have an overarching philosophy about how to write code and what good code should look like. So part of what I'm trying to do here is propose a semi-functional approach to writing code for data science and analytics and data processing.

Greg:
That's really cool because coming out of the Ruby community style and code style can be the genesis of many interesting conversations, if you want. So it's interesting that I am a very opinionated person, tabs not space or spaces not tabs. I don't want to start that conversation. So I get the opinionation, absolutely, and I get that communities need to have kind of an agreement on what code is going to look like and how it's going to be written both from a readability perspective, but also from a perspective of, if I'm going to write a add-on for a language or a Python extension, it should be written in a way that anybody else can look at it and understand it, and it will work correctly with the rest of the ecosystem. So yeah, I absolutely get that.

J.T.:
Yeah, it's interesting. The Python ecosystem is so diverse that you get quite a hodgepodge of different styles, even in areas where the community has tried to define standards. There will be rogue developers who go by their own standards. And then just because their library or packages is quite useful, it'll get adopted and nobody will take the time to go back and redo the code to fit the conventions. You can get things that are quite out of convention, which is one of the more frustrating pieces about Python.

Greg:
The first time I actually heard of Python was… I don't want to say how long ago, but it was being used mainly as a scripting language for sysadmins. Because that's my background. So that's kind of when I first bumped up against it. I had just learned Ruby though, so I was like, well, you guys do your stuff in Python, I'll do my stuff in Ruby, and we'll all be happy. I had the ability of where I worked at the time to do that, so they kind of left me alone. It's been around for a while, so I can see it has a very diverse community behind it. Both sysadmins, web front-end developers, big data people, and the Ruby community definitely doesn't have that. We're primarily I think web folks.

Greg:
I think that's actually a pretty good strength of the Python community is you do have that diverse ecosystem behind it.

J.T.:
Yeah, it's definitely a strength. It gives people the sense that in Python that they can solve any problem that they need, which is doable in any of the modern languages, right? But in Python, there tend to be tools out there to support you on that journey and examples of people trying to solve the same problem that you are with Python.

Greg:
So I just realized that some folks listening might be more object-orientated and not really understand functional or might not even be coders at all. So could you give a quick outline of the difference between an object-orientated and a functional style?

J.T.:
Yeah. I think the names draw the distinction. In an object-oriented style, you spend a lot of time defining your data structures and types and what they do, and everything is about these things doing other things. So it's about your objects more than your verbs. Functional programming kind of flips that around and it's really about your verbs. You define the verbs and then the nouns are kind of just there. You pay less attention to the types of data and the different structures, so you'll tend to have less… You won't spend time to finding lots and lots of classes typically, although you might define types, you'll define lots of functions that operate on data and do things with that data.

J.T.:
Typically, functional languages won't operate on things in place. That's highly frowned upon. What functional languages prefer to do is return new and updated objects. Languages that really support the functional style have lots of compiler niceties that make that lower overhead. One of the reasons why the functional style is picking up in popularity is because the increasing RAM that we have with modern machines supports the higher overhead that comes with returning new data types and new data structures all the time. Whereas with object-oriented programming, you can operate on everything in place and everything is really… It can be really efficient.

J.T.:
If you're constantly returning new objects and creating new data structures in a language like Python that isn't compiled, you're going to have a lot more overhead. Even if you get some benefit out of it, it might not have been possible to use a functional style to solve certain problems 15 years ago because you might just not have enough RAM.

Greg:
I learned object orientation, again, far too long ago than I'd like to admit when I first picked up C++. And the instructor I had at the time told me that you model the world with objects. So you have a person object and that person object contains all the data about a person in records. And then you can have like, I want to change my mailing address as a function inside of that person objects. It's hard to break out of that thinking because I tried to pick up a language called Elixir a few years ago, and my first project on a new language is, of course, I'm going to write a blog. And I sat down and I started trying to figure out how to write a blog without objects.

Greg:
And honestly, I couldn't. I'm so entrenched in object orientation that I really couldn't come up with how to do that. It's interesting that that sort of paradigm really can define how you look at coding and how you look at the world. I think for some people it can be difficult to break out.

J.T.:
I think that the different paradigms have different tendencies and are better suited for different areas of programming or challenges of programming, right? And they're probably a functional purists who would disagree with this, but I think object-oriented programming tends to be better for web programming, for tasks like building a blog. But if you are processing large amounts of data or you are defining mathematical operations that you want to apply to data, then functional programming is a more suitable language style paradigm because the logic gets be separate from whatever task you're trying to solve.

Greg:
Sort of like Unix tooling, which is near and dear to my heart.

J.T.:
That's exactly right.

Greg:
Data into a Unix tool, it should know where it is or what it's doing. That almost brings me to thinking, maybe functional programming is what you should be using for things like event streaming architectures, because you just have these events being fired at you and they might change a little bit. Maybe you don't have an object defined about what this event that looks like. You're definitely not storing it. You're just processing and putting a result back into the stream. So maybe I should start looking at that for those sorts of architectures.

J.T.:
I think that's right. I think Scala is a really popular language for problems like that, and it's got great streaming support. And I don't know how familiar you are with it, but it's a functional derivative of Java. Compiles to the JVM. So you can work with the Java ecosystem and all the benefits there. And a lot of streaming oriented, big, free, and open source software projects are being written in Scala for exactly that reason. And it brings all the benefits of parallelism that you would get from a functional language, so trivial parallelism with Map and Reduce.

J.T.:
You also get pretty fast code because it compiles really, really nicely down to Java bytecode, and the functional methods that you come up with will be compiled really nicely into Java bytecode and Java classes.

Greg:
One thing that I'm kind of reflecting on right now though is with Python's sort of two-headedness, whether it has some functional support and it is object-orientated as well, I think that gives you a… And I want your opinion on this, but it might give you an interesting tool set to approach especially machine learning problems, because you can lean on the object orientation when you need to deal with data structures, but then you can use the functional side of it when you do need to do that sort of functional processing.

J.T.:
Yeah, and a lot of people… So the machine learning process is interesting. Machine learning kind of flips the script on the Map and Reduce paradigm. It's more of a reduce and then map process where you need to create something from lots and lots of data and then you need to apply that thing to lots and lots more data. Typically, people have used classes for machine learning primarily because a lot of the early machine learning software comes out of the Java community and a lot of the software developers were familiar with the object-oriented style. I think there's a real opportunity to distill it into a functional style.

J.T.:
And I think increasingly you're seeing functional programming communities start to build machine learning ecosystems. And I think when one of them gets it right, if they could ever mount the momentum to chip away at Python's pretty big headstart on becoming the language of the machine learning community, I think there's a real opportunity there. The one tricky area is that functional languages tend to be descriptive languages, so you tend to describe what you want to do, and then a lot of the actual work is defined by the compiler. That works well when you need to do general data transformations.

J.T.:
It works less well when you are doing huge matrix algebra and you need to optimize the moving around of bits or you're working with a GPU and you're optimizing the moving around of bits again, right, which is a lot of what deep learning is. How applicable the functional languages are to deep learning is a question. There have been some strides. There are some functional GPU support libraries that are competitive with the most highly optimized kind of procedural libraries. But I still wouldn't say that they're the default or even… I can't recommend them in good faith for somebody who has a production use case in mind.

Greg:
If you were a new programmer or new to the machine learning environment, besides reading your book, of course, what steps would you advise someone to take to learn more about this topic?

J.T.:
So I think one of the best things that anybody who's new to machine learning or programming can do is to learn several different languages, because this will give you a broader perspective on whatever language you do end up choosing to work in, and you'll learn more ways of solving the problems. You also learn the patterns. When you only work in one language, I don't think you get the opportunity to see how the things you are doing in that language are part of more general patterns that programmers use to solve problems.

J.T.:
But if you work in Python and then you learn some Java and then say you learn a functional language like Scala or Closure, you'll learn different ways of solving problems, and you'll see the different patterns of solving problems. And that'll help you when you need to go off and solve challenging problems that you haven't faced before, which tends to be what trips us up as programmers.

Greg:
Yeah, absolutely. I did not follow that advice as a young programmer and that's why I got into sysadmin work. I learned C++ and then basically stopped. So I'm still suffering from that because like I said earlier, when I look at a problem, I see objects. I have a really hard time kind of understanding the functionality of it. And I think that the streaming architectures that I'm seeing people start to build now is sort of opening my eyes to the other side of it, and it's making me want to maybe pick up some skills in that side of the world and looking forward to doing that.

J.T.:
It's definitely a challenge. When I started learning Pascal, learning how to solve problems without using for loops, broke my brain a couple of times. But once you kind of break it down and understand recursion and how to work without side effects, you start to see your other problems in a different light. So even when you're working in Python and you can really do whatever you'd like, you understand the problem that you're dealing with in a different way.

Greg:
Yeah, exactly. And I think that's the important part is I've given folks that advice of, hey, learn a couple languages. Don't just learn Ruby or whatever it is you're learning. Try to spread your wings a little bit. I think that's the important part of that is you can learn different ways of solving the same problem, and then when you come back to the language that you want to use, whichever one that is, you can bring that skillset, that other way of looking at it. And then not everything's the for loop that you're going to… You're not going to hammer every problem with a for loop.

Greg:
You might say, "Hey, it's actually more important or more efficient to use recursion here. I can do that in Ruby, so I will." That makes a lot of sense.

J.T.:
You don't want to get in the position where you're a developer and you only know one tool, and so every problem has to be solved with that tool, right? My teams all run kind of polyglot shops where we'll use JavaScript for the JavaScript problems, we'll use Python for the Python problems, we'll use Scala for the Scala problems, R for the R problems, and mixing shell scripting where we can't use anything else.

Greg:
So if you're interested in your book, where can you get it?

J.T.:
Manning is the publisher and they're at manning.com. You can also search for it on Amazon by just searching the title Mastering Large Datasets with Python, or you could pick a copy up at my website, which is jtwolohan.com.

Greg:
Cool. And isn't there a discount code for it?

J.T.:
Yes, there is. I think it's podish19.

Greg:
It is. Yeah, it's podish19 is the discount code for 40% off the book.

J.T.:
And all other Manning products I believe as well, so that includes other books. They've got a great lineup of books on different data science topics, Python programming topics, as well as some functional languages too, if you're interested in that. They've got books on Closure and even functional JavaScript, I believe, if you wanted to do something really strange.

Greg:
That's something… I try to stay away from JavaScript.

J.T.:
Yeah, I don't blame you there.

Greg:
After this podcast actually comes out, you and I will probably be tweeting out a few opportunities to get some codes to get the book for free. So if you're listening, go ahead and like, follow, and subscribe, and you can maybe get a code for the book for free. And J.T., thank you so much. It was fantastic talking to you. I really appreciate the time and hopefully we can chat again.

J.T.:
Thanks so much, Greg.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

Hosted By:

Greg Nokes

Director of Product Management, Heroku Data, Heroku
@tsykoduk

with Guest:

JT Wolohan

Sr. Lead Data Scientist, Booz Allen Hamilton
@jwolohan