Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.
18. The Making of Trailhead
Hosted by Chris Castle, with guests Tyler Montgomery and Shaun Russell.
Trailhead is an e-learning platform that teaches a wide range of topics, from Git and Swift, to Marketing and Analytics. It was also built and launched in just six short weeks. Tyler Montgomery and Shaun Russell join us from Salesforce to talk about the genesis of the project and the scaling issues they've faced over the last few years.
Tyler Montgomery, Trailhead's engineering director, and Shaun Russell, its principal engineer, kick off a conversation with Chris Castle as to how Trailhead came about. One of Salesforce's developer evangelists, Josh Burke, wanted to create some teaching material for classes he taught. The idea was that students wouldn't just read some content and take a quiz; they would perform real actions, such as making a dummy user an admin, and an API call would assert that they accomplished the task successfully.
Due to its tight deadline of just six weeks before Dreamforce, the Trailhead team built the app using Ruby and Rails, and hosted the site on Heroku. Although they've seen huge growth, a lot of naive technical decisions have lead to a mix of addressing performance issues as well as launching new features over the past few years. Their largest near-outage came about when hundreds of thousands of students in India began to use the site all at the same time in order to participate in a competition. Heroku was able to scale up, but this exposed many problems which, when the traffic subsided, better prepared the Trailhead team for future scaling issues after all the code fixes were in place.
The conversation concludes with advice on scaling up an application on Heroku. Shaun suggests an APM tool like New Relic to stay on top of performance problems before they become an issue. Tyler suggests implementing an entire pipeline of tooling--PagerDuty, errors logged into Slack, segmented environments for staging and production--before continuing work on any feature code.
Links from this episode
Chris: Hello, and welcome to Code[ish]. We have another great episode for you today. We're going to be talking with some engineers behind Trailhead today. And what is Trailhead? Well, I'll let our guests answer that, but first, let's do some introductions.
Chris: Tyler, can you tell us a little bit about who you are, what you do, what you work on, and any other little snippets or nuggets that you want to share about yourself?
Chris: Gotta have people organizing the people. So, Shaun, how about you? Who are you? What do you do?
Shaun: Yeah, I'm Shaun Russell. I'm a principal engineer at Trailhead. I've been with Salesforce for three years, and before that, I actually worked with Tyler at LivingSocial. I've played many roles throughout my career, but right now, I'm a full stack with a focus on UI. I'm working with Trailhead's design system right now, and I'm a huge fan of Heroku. I've been using your product since 2010 pretty regularly.
Chris: Oh, well thanks. I think I'm similar there, even before I started working at Heroku, I looked up my creation date in the Heroku database, and it was the end of 2009 or 2010 or something like that. What programming languages have you had professional experience with, Shaun?
Shaun: Yeah, so I've been programming professionally for a while. So, I started my first paying gig in 2003, and it was PHP, because that's what everything was back then.
Chris: Right. Java or PHP.
Chris: Cool. So, how about ... Let's answering what is Trailhead? Tyler, what is Trailhead, from an end-user's perspective?
Tyler: So, it's a hands-on learning platform. You come in, you learn about all these different subjects, and you can actually test out your knowledge on a real Salesforce environment to figure out how to do some stuff.
Tyler: And so, I would have loved this back in the day, trying to work with Salesforce data and trying to work with Salesforce APIs as a Rails developer, Ruby developer. We didn't have it then, so now, it's pretty powerful to be able to integrate this business data that a lot of companies have in Salesforce with your cool apps that you have on Heroku. And so we really see it as a way to empower everyone to ... how to use Salesforce better, and how to really take advantage of everything that Salesforce has to offer, so ...
Shaun: Yeah, it's really a lot more fun and interactive way of learning something than reading documentation. We have content even about learning Git and Swift and other technologies on there; it's not just Salesforce.
Tyler: Yeah, Joel, I think that Atlassian, we have other partners Atlassian and GitHub have contributed content, Apple has contributed content. So it's not just about Salesforce. So now we've kind of expanded beyond Salesforce ecosystem and our partners are starting to see, hey this is a really interesting way for people to learn and started contributing.
Chris: That's interesting. I think there's something, maybe you know this, but I think there's something really powerful in what you folks have figured out in that the technology industry or technology world, whether that's a Salesforce developer or a Salesforce admin as the examples of that.
Chris: Presently there a lot of barriers for people to get into technology whether that's very complex documentation or a male-dominated culture who communicate in ways that males identify with. I think it's pretty interesting and cool what Salesforce is trying to do here, has done and is trying to do as other topics outside of Salesforce are added to this in just making technology and building technology more accessible to others.
Tyler: Yeah, that's our, it's one of our big values is trying to empower everyone. There's a bunch of great stories. You can go on trailhead.salesforce.com or make it easy trailhead.com, you can read through a bunch of the stories that we have for people making these transitions.
Tyler: And it's really cool to see people go from non-technical jobs within a year or two to these really well paying technical jobs, and that's why I love doing it every day.
Chris: That actually is a good point to build cool stuff or helping people build cool stuff, because that's kind of the mission or a value of Heroku too to help developers build cool stuff.
Tyler: Yeah, it's been really useful for us to be there. But yeah, Trailhead is built on Heroku. We have a number of pipelines, I was trying to count them up, it's too many to count now quickly. But we have a number of applications all running on mostly Ruby on Rails. We have some JRuby apps, we have some Node apps, kind of all running of Heroku.
Tyler: So it makes it really easy for as we've grown over the last five years to really try things out and try and figure out how to expand and how to get bigger because this was a side project that got out of control.
Chris: Yeah, I was just going to ask that. Before we go into the more deeper technical pieces. How did Trailhead start up or where did the idea come from?
Tyler: Yeah, so before Shaun and I's time, this is about five years ago now, one of our developer evangelists, Josh Burke, he's still with the company, he's going around the country as developer evangelists do, they go teach classes, they show up at user groups, they go to conferences, and he wanted a way to know what level of Salesforce skills that the group he was presenting to or the class he was teaching had before he got there.
Tyler: He just wanted a way to test their knowledge, and so he fired up a little Node application to connect to the Salesforce API, so you'd sign up for what we call a Developer Edition. So it has all of the same seed data in it, you go sign up for it, you grab it, it's yours, it has a couple of logins, and then you OAuth with his app. And then his app can use the Salesforce APIs to test your knowledge.
Tyler: And you can do simple things like hey, go log in and make Chris an Admin. Create a user and make Chris an admin on the Salesforce instance. And so you come back to the application, you click check the challenge, it reaches out to the APIs, and says oh no, Chris isn't an admin, it's the wrong license type. And then you go back in and you fix it, and then yay, you pass, you win.
Tyler: And so Josh showed this to some of the other developer evangelists. They're like this is awesome. We need to make this, and go and do a thing, we need to make this real. So they got in a room, they said okay. And we're going to debut this at Dreamforce which is Salesforce's big conference in the fall. And they said that's in six weeks. Ready, set, go.
Tyler: So they did an initial sprint, and that's how we ended up on Heroku, was developer.salesforce.com was already hosted on Heroku, was already a Rails app. And they said well, this is a prototype. Essentially, we have a prototype in Node, but we already have this thing running, we already kinda have all the URLs, all the stuff hooked up. So let's just build it.
Tyler: So lived at /trailhead on developer.salesforce.com, and it was a Rails app already on Heroku because it was just easy to get something up and running on this in six weeks. And that was the constraining factor there. So they spiked on, they got it done, they shipped it for Dreamforce. They had one slide in the developer keynote announcing. And that was it.
Tyler: And then it was just kind of like we'll see what happens. So from there, we talk about you earn badges on trailhead. Over the next year, so that was in 2014, so we had 11 badges that we launched. And over the next year, 300 thousand badges were earned in the community in the next year so it just kind of caught fire. It was like whoa, this is crazy.
Tyler: So at this point we'd earned 300 thousand badges in all of 2015. Fast forward, we're doing more than 600 thousand badges a month. So the scale we've grown is just more than 10x in the last couple of years, especially in the last three year period we've been growing 10x. So everything around us has been growing like that, so the team has doubled every year that I've been here. And we're just going like crazy.
Tyler: So and being on Heroku has really helped us with that right because it's pretty easy just to scale as we go. Not that it's been easy, scaling a web app and growing is not easy. It can be really difficult.
Tyler: So that's kind of the genesis story, that's where our company came from.
Chris: That makes sense. I was going to ask, yeah Shaun about scaling. Of course many people know that yeah it's easy to just drag a slider, run a CLI command or make an API call to horizontally scale. Or you could even do similarly vertically scale going up Dyno size.
Chris: But I imagine there's other things, or there's always other things you have to do to scale. It's not always that simple. What are some of the other things or other challenges you've faced in scaling?
Shaun: One of the things that we've struggled with and a lot of, it's kind of typical Ruby on Rails applications is N+1 queries. So you could have made sure that you have good developer practices to limit those and good monitoring--we use NewRelic--to catch those early and stomp them out.
Shaun: And once of you have under control. It's looking at request queuing, making sure where provision correctly and that our Puma configuration is adequate and appropriate and our garbage collection is tuned in, both the object allocation and things like that.
Shaun: People come on board and everyone has something to contribute from their past experiences using it. And everyone knows how it works. They know the CLI, it's great.
Chris: Yeah, I remember you saying to me earlier that one of, when you saw that Trailhead was using Heroku that that was a positive from a hiring perspective.
Shaun: Yeah at least for me it was a big positive. So, I worked at a start up in 2010, and we were initially on Heroku. And I took that team from a solo developer, me, through 25 people. And we grew on Heroku the whole way up until we had to become HIPPA compliant.
Chris: Oh yes okay.
Shaun: And then that came in and Heroku Shield wasn't a thing then.
Chris: Right yep.
Shaun: So I had to port that all over to AWS with office works and Chef and all that. And never again. So yeah.
Chris: Yeah, you seemed deeply. It's almost a perfect experiment that you've done. You've seen the same end user result implemented in two places, and not just like a simple Rails app with a database but something that is used by many users.
Shaun: Yeah, I had to transition that off of Heroku and try to replicate the developer experience too while supporting a growing team on AWS. And we just simply weren't able to do that. This was before Doku was a thing and ElasticBeanstalk so, it was a unique experience that I don't have to think about anymore.
Chris: Sounds good. That's true. So there was an event that required sudden and unexpected scale. Can you tell us the story about this event?
Tyler: Yeah, so we had an event, it is now known Astrogedden, Astro is our mascot, and it's Astrogedden. And so what happened is we have a group called trail interest students. And they work with universities on getting students learning Trail-ed.
Tyler: And they set up an event in India with a hundred different colleges in India, and said over two days, the college that owns the most badges is going to win some prize money, and the person who earns the most badges, the top three badge earners in 24 hours will earn some prize money and the fame and adoration of their peers.
Tyler: Now is learning like this in 24 hours, 48 hours going to stick. Who knows? But that was kind of the thing, and everybody on this thing. And for us it was just let's get a bunch of new people in Trailhead and see what sticks.
Tyler: And unfortunately they forgot to tell us, the engineering team about it. And so I'm sitting there on I think it was Wednesday night. I'm sitting on the couch with my wife at 9:30 PagerDuty goes off saying hey, the site's having some availability problem. And she looks over and says hey, what's happening in India because she knows, India is our second largest user group outside of America.
Tyler: And so we've had issues, it's 9:30, that's 10 AM India time. 9:30 PM is 10 AM India, so she knows the drill. I think every spouse knows the hates the PagerDuty sound. And so I get online I start looking, hmm the site is super super slow. So I started pinging people to get online, and I start looking at Google Analytics and stuff like that. Oh my gosh, all of India is on Trailhead right now is what it felt like to use just in terms of scale.
Tyler: And it was pretty amazing, and so we were able to throughout that event throughout those two days, we ended up scaling our dynos almost 100x to keep up with the load. We had some really crappy queries that were slowing things down on a couple of pages.
Tyler: And the problem that we had is the students were all following this workflow, so everyone, all of them are all signing up at one time. So if you want to load test your site, this is a great way to do it. Just organize an event in India, get several thousand people all from real, it was entire computer labs, you can imagine your entire university computer labs, your libraries, everyone just sitting there in a big hall doing the same thing.
Tyler: So they're all going to sign up, and they're all going to, we have a feature called Trail Mixes, and it's a playlist of content. So they had a Trail Mix they made for the event and so we had some problems with signup at the time, some queries that were not performed like Shaun was talking about N+1 queries and stuff are always the bane of existence. And then Trail Mix had this query problem, we knew it was a problem the more stuff you added to the playlist, the slower it got. And they happened to just add a whole ton of stuff to their Trail Mix.
Tyler: And it was really slow. Since then, we've fixed this problems right, and this is part of doing the post-mortem, doing the root cause analysis stuff which was great. So this was a really great load test to figure out really highlight some of these issues.
Shaun: Made us think a lot more about caching, and I believe also we may have had some synchronous API calls in the sign up process.
Tyler: Yeah, so at the time we were integrating with the Salesforce org to do single sign on. So Salesforce has a single sign on product, so our Rails app didn't have to manage talking to LinkedIn and Twitter and Facebook and all that, so it just talks to the Salesforce org, and so that we had some synchronous queries with that Salesforce org.
Tyler: And so we were just DDOSing ourselves every time we'd go through this different flow. So and you know, we fixed a lot of this stuff. This is two years ago at this point. I think Shaun you went on a rampage on our homepage trying to make it faster. I think you're still on that project.
Tyler: It's kind of a never-ending.
Shaun: Cache all the things.
Tyler: Yeah, so and that's one of those things is you grow, you don't expect these things to happen, and they're interruptive and they kind of highlight some of the weak spots that most developers kind of know, that thing's kind of crappy but you know it's not used that often.
Tyler: But then you have one of these events that really highlights that. So it was really good to get some of this technical debt on the roadmap and really get the PMs and the business owners to see that this thing is really important. We need to invest time, and we need to take away from maybe some of the feature work.
Tyler: Because when you're in something that's growing this quickly. Maybe Shaun you could talk to this too. We are always trying to do the next feature the next thing. And trying to push push push to keep going and you have this wave, it feels like you're just surfing a wave of attention and interest. And you want to keep going on that wave, so it's really hard to stop.
Tyler: So Shaun I don't know if you can talk about that at all, what it's been like to make these trade offs as we've grown.
Shaun: Like Tyler was saying before, we had a lot of momentum that was unexpected at the beginning, right? And we want to maintain that, so if pushing feature after feature for a while kind of, we operate a lot like a start up within the company.
Shaun: So when you're building features fast, you're not necessarily thinking about optimization, and some of that goes out the window to hit deadlines and whatnot.
Shaun: And so I think events like that prompt you to kind of regroup and address some things and to try to prioritize that into future backlog and whatnot.
Tyler: Yeah, so to sort of round out this event, what happened right was this event, we scaled our dynos to keep up, and we got to the point where Redis has started falling over because we had so many connections and so many dynos and we were like what is going on.
Tyler: And so the cool thing was is on our team, we don't have anyone that does DevOps. We have people at varying levels knowing how to manage stuff on Heroku, but we have no dedicated DevOps engineers and we've never hired anyone in that capacity which is really cool from a management perspective with using Heroku.
Tyler: I can just have people focused on building features and optimizing things, and not necessarily worrying about networking and all the other stuff Heroku provides for us, which I think most people I know use Heroku you know that's why.
Tyler: So Redis starts following over, oh my gosh, we're keeping the site alive as we're going. And it kind of coming in ways as students are going to sleep and coming online. We could tell they're working late through the night, and they're not signing off until two AM, and then the other people that signed up earlier are getting up early at five AM to get on the competition. So we really didn't have a whole lot of, you didn't have any of that nice go to sleep, wake up curve that you typically see in your volume. And we got a little bit of reprieve through the middle of the afternoon, our time, here on the west coast.
Tyler: And so, Redis starts falling over. We have two engineers do a hot swap to the next version of Redis above where we're at just to get more availability. And we did it without any down time which was amazing. And we kept the set alive.
Tyler: And then we got to the point where we scaled down, and we were literally having problems with Postgres connections and connection pooling. At that point we were like, we're doing everything we can to keep the site alive, and our technical architect made the point, at this point most apps sort of just said we're turning off and we can't complete the event, please just stop.
Tyler: And we were able to get through it. We were able to get through it and the site was slow, but they were able to survive and then afterwards the traffic went away, and then we were able to scale everything back down.
Tyler: There's no, I've talked about it before with Heroku, there's no huge capital outlay for us. "We've buy new servers and rack them as fast as we can!"
Chris: And then the traffic goes away and they just sit there.
Tyler: And then the traffic goes away and you're just sitting with this monster server that does nothing. I've been there before in the past. So yeah, it was really cool.
Chris: It was a 24 hour event?
Tyler: It was 48 hours, so it was two days of this.
Chris: We got paged, I think you said it was Wednesday at nine PM you got paged, so were you just awake for 48 hours or did you have kind of tag teams?
Tyler: We kind of tag teamed it, so we are distributive team but we're not around the global. We're basically just in the Western hemisphere, and so we got things that first night just patched up to get through the night. And I think we all signed off around two AM or so.
Tyler: And then we came back on the next morning a little groggy and okay, things are okay. People are asleep. Traffic is starting to die down a little bit. And then it picked up that night, that Thursday night, it picked up a little bit more. And so I think we were online until two or three AM that night, and then we just called it, okay, Postgres, we can't do a zero downtime swap of our Postgres instance at this point of where we are at at that point.
Tyler: So it was like okay, good enough. We've gone 100X; I think that's all we can do at this point.
Tyler: So we said okay, it's good enough, PagerDuty and stuff going off throughout the night, we'll keep an eye on it, make sure the site's still available, different you have little pings and stuff to make sure the site's alive. Those were working, so we here happy.
Shaun: I'd be interested to see how it would have performed if we had auto-scaling enabled and we were on a bigger Redis instance, how much the would have noticed.
Chris: Yeah does that, well and I'm curious from that experience has that kind of informed other explicit policies or implicit buffer that kind of maintain in your infrastructure services in this ways so that it can absorb and event like this or even maybe something smaller. Do you think about when Dreamforce happens maybe everybody stops visiting Trailhead when Dreamforce happens because because those are all your users.
Tyler: Everyone's coming to San Francisco.
Chris: At these Salesforce events are there big spikes that you have to kind of prepare for ahead of time.
Tyler: We have done some of that prep for Superbowl moments where we're going to be on the actual Mark Benioff keynote, and then the thing is is everyone in the audience watching, they're not coming to the site, so it's kind of like there is no huge rush to it. And Dreamforce week is a normal week for us because most of our trailblazers are at Salesforce, but we're getting a lot of new people signing up on site and checking it out, it's a normal week of volume.
Tyler: But I don't know, Shaun maybe you could talk about how we've approached auto-scaling and how we think about scaling.
Shaun: Yeah, so I believe we're a bit over provisioned right now probably just in case there are any events that weren't communicated. However, I actually am not positive if we have auto-scaling enabled at the moment. I should probably go check that. I know some of that's been in flux, the conversation around that and it's not exactly my ownership duty.
Shaun: Yeah, but I think now our average request response time is down to like 200 milliseconds or something where in a really good place right now to handle increased load. It should scale pretty linearly with hardware I think.
Tyler: I think we kind of touched on it earlier. We're fairly vanilla. Mostly the web app when you log in on Salesforce.com people see that it's one web app, one Rails app running on Postgres and Redis. And I said we keep Trailhead weird on the engineering side, to keep Trailhead boring.
Tyler: And we didn't always do that, it was, we had all kinds of add ons and the developers were trying all kinds of stuff early on. We had CouchDB and we had memcached. And we had all of these add ons and stuff. And over the years, what we've done is we've really trying to standardize on the tools that Heroku provides because we know that one they're going and they're probably going to solve for most of the problems that they're going to have. Postgres is pretty darn good and Redis is pretty darn goo, and Kafka is pretty darn good.
Tyler: We haven't even implemented Kafka yet, we keep talking about it, we keep trying to find use cases for it. But we're getting there as we grow. But we've really, and what allows us to survive Astrogedden, a few months before that we had actually just gotten off of, we had a memcached provider and they were having availability problems where this weird piece of code that if it didn't load, it was trying to store JSON in memcached which is not necessarily a great idea.
Tyler: And sometimes the JSON document was too big to fit in the one megabyte limit for a key on memcached. And so the code if it couldn't come back all of the different pieces, so it's stored in multiple key values, it would chunk up the JSON, and if it couldn't fetch the whole thing back, then it didn't have a JSON document to parse. So it would just say, it was really funny, it just would say, delete everything. And start over. And then it would fetch the documents from CouchDB because we were on a shared hosting thing for CouchDB, and so it had really huge variability.
Tyler: And so we were like, okay, just cash it, and the developers at the time decided to cash it in memcached, so this memcached provider, they would have these millisecond outages. So it just depended if we were in the middle if we were fetching one of these large documents, it would end up just deleting everything from memcached and starting over and it would spark all the traffic over to CouchDB, and then it would just slow down the whole site.
Tyler: And so we had just worked on a few months before that, moving all the cash to Redis, all the data that was in CouchDB, we had worked on moving all of that into Postgres, putting some APIs up for internal for our publishing system. Our publishing was writing everything in the CouchDB, and then we'd have a script that pulled it into Postgres, we moved it and did the thing, put an API in front of it, have a nice interface and then nobody cares what's behind the scenes except for us.
Tyler: And so, we were able to do that so during that event we were able to swap on Redis and just keep pushing Postgres as far as we could possibly take it, which is cool if we hadn't done that work we wouldn't have been able to survive that incident. So that probably.
Tyler: Yeah, and the other thing, we're starting to make this change and this future growth that we're working on, I was trying to change into being an API first organization. We've really just relied on being a web app. But as things have been more complicated and more complex as the team has grown and as what we need to do has grown. We're focusing on how do we, we're starting to go down this road of microservices or just services and services talking to each other.
Tyler: So we really had to enable that, say okay, services have to respond within a certain amount of time, 200 milliseconds is kind of the start our max because you have chained multiple services together that can work a really long request. So we're starting to make those changes as an organization and really relying on, sensible policies since we know some of this stuff is going to happen in the future.
Chris: Yeah makes sense. Totally, well that's the, I feel like there are waves that come in the keep your stack or keep your technology boring ethos where it kind of comes and goes right? Because I think as many developers in the way that other cloud providers market their stuff is it's the new shiny thing you've got to try it out. It's the new block storage service or it's the new ElasticCache service. Or the new, I don't know, whatever they are, all the other cloud infrastructure providers are constantly pumping out new services or new things to use, AI. Image classification things.
Chris: And we want to use them. We want to use the new hotness as a developer, but there's kind of, I see it coming in multiple waves and different developer communities this idea of that's great but keep production boring. Have your stuff in production be boring, don't get too fancy. Maybe another analogy of that is in the Ruby community where with Ruby you can do all of this crazy meta-programming but do you really want to do that and want that in production so that you or someone else has to debug that when something goes wrong. I don't know if you've run into that Shaun at all? Either one?
Shaun: Yeah. I feel like you really want to be pragmatic. You want to use the stack that's going to keep your team productive, and not just follow the shiny. And that's kind of what we're doing. And Ruby on Rails it might not be the newest thing, but it's not the oldest thing either. And it's definitely a pleasure to work with.
Shaun: Yeah, and to your point about the meta-programming, I'm not someone who wants to work with people who get all over that either.
Chris: Yes. Yeah, their intent wasn't to make your life stressful but it can when that suddenly is a huge problem because it's so much harder to debug.
Shaun: Yeah, at some point it's people showing off how clever they are.
Tyler: You also have a flexibility too in that where the tools that we're choosing and using on a daily basis, everyone is trying new stuff and wants to be more productive and so we do have some freedom to that. And as a major it's important to empower your team to explore. But it's also important, we kind of support this thing is production, and I've been on plenty of projects that that have been just sunk because of this.
Tyler: Like let's try this new cool thing, and it's just like now we're spending all of our time figuring out what's going on with this new fancy NoSQL database and why it's performing weird on this version of CentOS, and whatever. That's just not fun. It stops being fun after a while. It's fun at the beginning, and then it stops being fun when it's in production and you're up all night trying to deal with it.
Chris: Yeah, but you can't use the new thing or the next new thing. You have to maintain.
Tyler: And right now we're working on a new project where we needed, there was no support, there was no Ruby support for this project we're working on. And there's really great support in the Java community, and we're like okay, can we get something to work at JRuby. Can we pull in this package and make it work, and we're using a little bit of meta-programming and Ruby to make it easier for all of the other developers on the team to be able to contribute in Ruby but not have to worry about JRuby and Java and having to know all of that because most of our team are Rubyists.
Tyler: And so that was really cool to be able to do that. We got some great support from folks at Heroku and trying to set up our first JRuby instead, and how to run with that and getting that going, which was good, and it was good that we good. But we had to be real pragmatic about it. It's not easy, it's not set and forget, but it's also not just trying a new thing just to try the new thing. It's really to solve this specific program that we're having, and there's just no support in Ruby for this thing we're trying to do.
Chris: That's cool. Well let's switch to trailhead as a commercial product. This is a new thing right?
Tyler: Yeah, this is a new thing, we just launched back in March. We just GA-ed My Trailhead, which is cool so now you can have trailhead all to yourself and put all of your own content in there.
Chris: Sorry to be clear this is like, so right now when you go to trailhead.com or trailhead.salesforce.com, it's all hosted and managed by Salesforce. This is now, My Trailhead will be, it's a SAS product right, so it's still hosted by you and manage by you, like my company's separate and private instance of Trailhead, is that correct?
Tyler: Yeah, that's pretty close. Again, this is one of those things that happens is people at the company were like hey, this Trailhead thing is cool. We really like it for learning, but it'd be cool if we could put, I think a lot of good products start with why couldn't we do blah blah. So it was why couldn't we put our employee training in here? And can we make it so that when you're a Salesforce employee, you could log in to Trailhead, and see--and back then you would see a mix of the Salesforce employee content and the public trailhead content.
Tyler: And I said, well why couldn't we do that, and we said well sure let's try it out. So we did that, and the interesting thing that we found doing that was the big groups that were producing content for that was a group of Salesforce called Market Readiness, and they are the people that train our salespeople. And at Salesforce, we have a whole lot of salespeople. And that's the majority of our size of our company. And when Shaun I worked at Living Social, and it was like we had four thousand salespeople and a few hundred developers.
Tyler: And the hard part, and I didn't know that because I didn't really work with salespeople before I kind of know, sale people they're going to come with features, tell us to do stuff that we don't want to do. Right, so we kind of have an adversarial relationship with salespeople.
Tyler: But trying to empathize with what they have to do with at an enterprise company the size of Salesforce, we've made how many acquisitions, there's how many things to sell. How do you train several thousand salespeople to talk about the company and talk about the project all in the same way? And that was really the problem they were trying to solve.
Tyler: And that's what we somehow hit on it with Trailhead, and this kind of ethos of learner first. So it's not a learning management system. It's not go do this training. It's we're going to write really good training materials for the sales team or our HR decision. Write stuff to, we also do all of our compliance training now in Trailhead.
Tyler: And so, we're going to make it really good, really accessible. Yes we are going to sometimes say it's required for you to do, but most of the time it's not. And most of the we just put it out there, and managers will encourage teams to do it and they'll put goals around it and stuff. But it's not top down driven, it's more bottom up.
Tyler: And so what we found over doing that is we launched this thing. This is really the first big feature we worked on after Shaun and I joined the team in 2016. So we launched that and super badges for the first Trailhead DX. And we launched that in, and it started just growing and growing and growing inside the company. And what we found was people, the more badges they earned on Trailhead, of their own volition, they were just better at talking to customers about stuff. They just knew all of the different products we were going to sell. And it's just a better customer experience.
Tyler: As a customer you come with a question, and your salesperson can, one tell you if you're using Sales Cloud for instance, you're just using the normal CRM product. You're like, hey we want to do email marketing, I heard you guys do that? The salesperson instead of saying hey let me get back to you with the subject matter expert in a couple weeks, if they've done the marketing cloud trainings, they kind of know how to talk about it. And so it's a different conversation. And then they can link the customer to the publicly available stuff that do the hands on challenges and see if it will solve their needs instead of this traditional let me find someone and then we'll pitch you. It's more of a collaboration working together.
Tyler: So that was really cool, and so our customers went wait, how are you doing this. And the salespeople went oh, we use Trailhead internally. And the customer says well, we want it. And so we said okay, well we have to do a lot of things to get it there. And so big thing for us initially, how do we do multi-tendency. How do we plan for multi-tendency?
Tyler: And so we figured out from the beginning but we definitely made some changes along the way to make sure a customer data is secure and only customers can see it. And that's our big thing at Salesforce, trust being number one value at Salesforce, is how do we make sure that your customer data is secure, it's on Heroku which is kind of a new thing, we're one of the first products on Heroku. And so that's been a really cool thing for us to do, and to launch a product at Salesforce on Heroku has been a really neat partnership between the two worlds.
Tyler: So yeah, so that's how it's been going. And the last couple years, it's been how do we implement this thing so our customers can use it to create content and develop it and release and all of kind of pieces that we just piece-mealed together and duct taped together to get it to work internally. It's how do we productize this whole process. And that was a really long, it was a two year journey for us to do that.
Chris: What piece of advice would you give to a developer, a development team who are scaling up on Heroku? They found product market fit, their user base is growing, they have salespeople they have marketing people and they're needing to scale up their architecture or their system. If you had to give them one piece of advice what would that be?
Shaun: All right. I'll go first. First I would say, integrate the New Relic plugin or an equivalent, hunt down any glaring performance issues, make your endpoints respond rather uniformly, and then set up auto-scaling based off of request queuing.
Chris: Yeah, so an APM, New Relic or something similar to that, an APM plug in or add on and auto-scaling okay. What about you Tyler, anything from a team perspective?
Tyler: Yeah for a team perspective for me, the big thing is getting your iterations short and being able to have predictability in your release cycle. And so for us that's getting pipelines set up to make it so that review apps, people can literally do quality control, you can run tests on review apps you can have, you have have product owners look at stuff on review apps, making sure you can have that so that you have that really tight integration, the feedback loop between the developer and the stakeholders, so we can really make sure that the thing's getting dialed in and it's not weeks or months long process to release, just making that cycle shorter and make sure you're releasing solid stuff as you go that's going through all of the checks all the way out of the door.
Chris: Yep. Makes sense. That's actually something, forget who I learned it from but I think it applies to any infrastructure or platform, and the simplest thing was hey, when you're creating Heroku app for something that you are going to share publicly or share maybe even just internally to your team or your company, create two apps at the same time.
Chris: So normally, at least I would always just create one app. And then when I decide I need some sort of staging environment or Dev environment then I create another one. But it's nice to do both at the same time because then you can make sure the configuration matches for each of them as you're tweaking it in those early stages. And then out of the box I have staging and then I have production.
Tyler: Yeah, just start with the pipeline. We found it was way easier especially if we spun up new services, just start with pipeline for having, we do four stages, we have review app, we have development and we have staging and we have production. So it kind of allows us to have different parts of our process and different checks happen all those stages.
Tyler: We've got a release candidate and it goes from development to staging, and then staging to production. And so it just allows, just plan for being in production, plan for doing it right, Plan for being awesome at releasing stuff that you know works.
Tyler: And that's a big thing for us. Early on it was just like just ship it, ship it, ship it. Right? And it was like well, we shipped some stuff that wasn't good and we skipped some quality checks, we skipped performance checks and so of course it's working backwards into that. I think every organization goes through that.
Tyler: As we've learned as we've started spinning up new services, let's just plan for that. Let's just get all the things integrated, let's get, you know, our APM hooked up, let's get our monitoring figured out so it dumps into Slack and to PagerDuty. We get that figured out first because we know we're going to need that.
Tyler: And then let's go, let's get that walking scale. We call it walking scale, let's get an app and a pipeline that looks like it works, automate the deployments, let's do all the stuff we know we're going to need so it's not like oh man, we have product market fit, it's not like this thing is blowing up and we're just going to have to put all that stuff to the side.
Tyler: And that's what we dealt with for the first couple of years is trying to get all of those pieces working right and it's way harder to do it later than it is to do it upfront.
Chris: Yeah totally.
Shaun: I actually wanted to touch on review apps. Those are so helpful and they really were a game changer for our QA process as a developer being able to spin up a review app for a pull request and just send the link to stakeholders and get validation, send a link to QA so they can run their external automation against it. Get it approve and then merge up and move on.
Tyler: If there's problems you can fix them right there and right then and they're not on another test.
Shaun: Yeah it keeps you in the flow or focused on that issue or helps you get back into the flow, you don't have to switch branches and redeploy locally or restart up a branch locally. You can see this review app and push changes to it.
Chris: Cool, well thanks for joining us. It was great to have both of you on Code[ish].
Tyler: Yeah, thanks for having us Chris. Appreciate it.
Shaun: Yeah really enjoyed it, thank you!
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Director, Developer Advocacy, Heroku
Chris thrives on simplicity and helping others. He writes code, prototypes hardware, and smiles at strangers, helping developers build more and better
Engineering Director, Salesforce
Tyler likes building awesome things people care about with awesome people. He has been with Salesforce growing Team Trailhead since 2016.
Principal Software Engineer, Salesforce
More episodes from Code[ish]
Jim Jagielski and Alyssa Arvin
Jim Jagielski is the newest member of Salesforce’s Open Source Program Office, but he’s no newbie to open source. In this episode, he talks with Alyssa Arvin, Senior Program Manager for Open Source about his early explorations into open... →
Lisa Marshall and Greg Nokes
This episode of Codeish includes Greg Nokes, distinguished technical architect with Salesforce Heroku, and Lisa Marshall, Senior Vice President of TMP Innovation & Learning at Salesforce. Lisa manages a team within technology and product... →
Innocent Bindura and Greg Nokes
How do you know an application is performing well beyond the absence of crash reports? Innocent Bindura, a senior developer at Raygun, shares the company's tools and utilities, discusses the importance of monitoring P99 latency, and talks... →