mongodb-for-python-developers/transcripts/ch8-performance/7.txt

00:01 Now that you've seen how to create indexes in the shell in Javascript effectively,
00:04 let's go and see how to do this in MongoEngine.
00:07 I think it's preferable to do this in MongoEngine because that means
00:11 simply pushing your code into production will ensure
00:14 that the database has all the right indexes set up for to operate correctly.
00:19 You theoretically could end up with too many,
00:21 if you have one in code and then you take it out
00:23 but you can always manage that from the shell,
00:26 this way at least the indexes that are required will be there.
00:29 I dropped all the indexes again, let's go back through our questions here
00:33 and see how we're doing.
00:36 It says how many owners, how many cars,
00:38 this is just based on the natural sort however it's in the database
00:41 there's really nothing to do here,
00:44 but this one, find the 10 thousandth car by owner, let's look at that;
00:48 that is going to basically be this name, we'll use test,
00:55 it doesn't really matter what we put here
00:57 if we put explain, this should come back as column scan or something like that,
01:01 yeah, no indexes, okay, so how long did it take to answer that question?
01:06 Find the 10 thousandth owner by name,
01:12 it didn't say by name, I'll go and add by name,
01:16 well that took 300 milliseconds, well that seems bad
01:21 and look we're actually using sorting,
01:24 we're actually using paging skip and limit those types of things here,
01:27 but in order for that to mean anything, we have to sort it,
01:31 it's really the sort that we're running into.
01:34 Maybe I should change this, like so,
01:38 sort like so, we could just put one, I guess it's the way we're sorting it,
01:47 so here you can see down there the sort pattern name is one
01:49 and guess what, we're still doing column scan.
01:53 Any time you want to do a filter by, a greater than, an equality,
01:56 or you want to do a sort, you need an index.
01:59 Let's go over to the owner here, this is the owner class
02:04 and let's add the ability to sort it by name or
02:08 equivalently also do a filter like find exactly by name,
02:12 so we're going to come down here
02:14 we're going to add another thing to this meta section,
02:16 and we're going to add indexes,
02:20 and indexes are a list of indexes,
02:25 now this is going to be simple strings
02:28 or they can be complex subdictionaries,
02:31 for composite indexes or uniqueness constraints, things like that,
02:34 but for name all we need is name.
02:38 Let's run this, first of all, let's go over here
02:41 and notice, if I go to owners and refresh, no name,
02:46 let's run this code, find the 10 thousandth owner by name,
02:52 19 milliseconds, that's pretty good,
02:55 let me run it one more time,
02:57 15 yeah okay, so that seems pretty stable,
03:00 and let's go over here and do a refresh, hey look there's one by name;
03:03 we can see it went from what was that,
03:08 something like 300 milliseconds to 15 milliseconds, so that's good.
03:11 How many cars are owned by the 10 thousandth owner,
03:15 so that's 3 milliseconds, but let's go ahead and have a look at this question anyway.
03:19 How many cars are owned by the 10 thousandth owner,
03:22 so here's this function right here that we're calling
03:25 it doesn't quite fit into a lambda expression, so we put it up here
03:28 so we want to go and find the owner by id,
03:30 that should be indexed right, that should be indexed right there
03:34 because it's the id, the id always says an index,
03:36 and now we are saying the id is in this set,
03:40 so we're doing two queries, but both of them are hitting the id thing,
03:44 so those should both be indexed and 3 milliseconds,
03:47 well that really seems to indicate that that's the case.
03:50 How many owners own the 10 thousandth car, that is right here.
03:54 So we'll go find the car, ask how many owners own it.
03:59 Now this one is interesting, so remember when we're doing this
04:02 basically this in query, let's do a quick print of car id here,
04:11 so if we go back over to this, we say let's go over to the owners
04:17 save your documents, so this is going to be car ids,
04:21 it's going to have an object id of that,
04:26 all right, so run this, zero records, apparently this person owns nothing,
04:33 but notice it's taking 77 milliseconds, we could do our explain again here
04:37 and column scan, yet again, not the most amazing.
04:43 So what we want is we want to have an index on car ids, right
04:48 because column scan, not good,
04:50 I think it's not really telling us in our store example
04:53 but for the find it definitely should be.
04:55 So we can come back to our owner over here,
04:58 let's add also like an index on car_ids,
05:02 If we'd run this once again, just the act of restarting it
05:05 should regenerate the database, how long did it take over here—
05:09 a little late now isn't it, because I did the explain,
05:13 I can look at this one, how many cars,
05:16 how many owners does the 10 thousandth car have,
05:19 66 milliseconds, if we look at it now—
05:22 how many owners own the 10 thousandth car, 1.9 milliseconds,
05:29 so 33 times faster by adding that index, excellent,
05:34 find the 50 thousandth owner by name, that's already done.
05:38
05:40 Alright we already have an index on owners name so that goes nice and quick,
05:45 and how is this doing, one millisecond perfect,
05:48 this one is super bad, the cars with expensive service 712 milliseconds,
05:52 alright so here, we're looking at service history
05:56 and then we're navigating that .relationship, that hierarchy,
06:00 with the double underscore, going to the price,
06:02 greater than, less than, equal it doesn't matter,
06:05 we're basically working with this value here, this subdocument.
06:08 Let's go over to the car and make that work,
06:11 now the car doesn't yet have any indexes but it will in a second,
06:14 so what we want to do is represent that here
06:17 and in the the raw way of discussing this with MongoDB
06:21 we use . (dot) not double underscore, so . represents the hierarchy here.
06:25 Let's run that again, notice expensive service, 712,
06:30 cars with expensive service, instead of 712 we have 2.4 milliseconds,
06:39 now notice that first time I ran it there is was a pause,
06:42 the second time it was like immediate,
06:45 and that's because it basically was recreating that index
06:47 and that pause time was how long that index took to create.
06:51 So here we have cars with expensive service,
06:53 now we're getting into something more interesting, look at this one with spark plugs,
06:58 we're querying on two things, we're querying on the history and the service,
07:04 let's actually put this over in the shell so we can look at it.
07:07
07:19 I've got to convert this over, do the dots there,
07:23 this is going to be the dollar greater operator, colon, like so,
07:30 all right, so we're comparing that service history.price
07:35 and this one, again because you can't put dots in normal json,
07:39 do the dot here and quotes, and this one is just spark plugs,
07:46 alright, let's run this, okay 22 milliseconds,
07:52 how long is it taking over here— 20 milliseconds,
07:56 so that's actually pretty good and the reason I think it's pretty good is
07:59 we already have an index on this half
08:02 and so it has to just basically sort the result, let's find out.
08:05
08:11 Winning plan, index on this one, yes, exactly
08:14 so this one is just going to be crank across there
08:18 but we're going to use at least this index here, this by price
08:22 so that gets part of the query there.
08:25 Now maybe we want to be able to do a query just based on the description
08:30 show me all the spark plugs, well that's a column scan,
08:33 so let's go back and add over here one for the description.
08:40 Now how do I know what goes in this part,
08:44 see I have a service history here, if we actually look at the service record object
08:49 it has a price and description, right
08:52 so we know that that results in this hierarchy of
08:54 service history.price, service history.description.
08:57 If we'd run this again, it will regenerate those and let's go over here
09:01 and run this, and let's see, now we're doing index scan on price,
09:09 what else do we got, rejected plans, okay so we got this and query
09:18 and it looks like we're still using the— yes, oh my goodness,
09:24 how about that for a mistake, comma, so what did that do
09:28 that created, in Python you can wrap these lines and that just created this,
09:33 and obviously, that's not what we want, that comma is super important there.
09:38 So let me go over here and drop this nonsense thing,
09:41 try this again, I can see it's building index right now,
09:47 okay, once again we can explain this, okay great,
09:51 so now we're using price and actually we use the description this time
09:58 and you can see the rejected plan is the one that would have used the price,
10:04 so we're using description, not price,
10:06 and how long does it take to run that query— 7.9 milliseconds, that's better
10:13 but what would be even better still is if we could do
10:16 the description and price as a single thing. How do we do that?
10:22 This gets to be a little trickier, if we look at the query we're running,
10:25 we're first asking for the price and then the description,
10:30 so we can actually create a composite index here as well,
10:35 and we do that by putting a little dictionary, saying fields
10:39 and putting a list of the names of the fields
10:44 and you can bet those go like this,
10:48 now this turns out to be really important, the order that you put them here
10:52 price and the description versus description price, for sorting,
10:56 not so much for matching, run it one more time,
11:00 alright, expensive cars with spark plugs,
11:04
11:07 here we go, look at that, less than one millisecond,
11:10 so we added one index, it took it from like 66 milliseconds down to 15,
11:16 and then, we added the description one, it turns out that was a better index
11:21 and it took it from 15 to 9, we added the composite index,
11:24 and we took it from 9 to half a millisecond, a 0.6 milliseconds, that is really cool.
11:31 Notice over here, this got faster, let's go back and look at what that is.
11:36 Load cars, so this is the one we are optimizing
11:40 and what are we doing here— let me wrap this so you can see,
11:43 we're doing a count, okay, we're doing a count
11:46 and so it's basically having the database do all the work
11:48 but there's zero serialization.
11:52 Now in this one, we're actually calling list
11:55 so we're deserializing, we're actually pulling all of those records back
11:59 and let's just go over here and see how many there are,
12:03
12:08 well that's not super interesting, to have just one, is it,
12:12 alright, that's good, but let's actually make this just this,
12:17
12:23 let's drop this spark plug thing and just see
12:26 how many cars there are with this,
12:30 okay there we go, now we have some data to work with,
12:33 65 thousand cars had 15 thousand dollar service or higher,
12:36 after all, this is a Ferrari dealership, right.
12:39 Now, it turns out it's a really bad idea to pull back that many cars,
12:43 let me stop this, let's limit that to just a thousand here as well.
12:52
12:54 Okay, so we're pulling back thousand cars because we're limited to this
13:00 and we're pulling back a thousand cars here.
13:03 But notice, this car name and id versus the entire car
13:08 so let's go over here cars with expensive service, car name and id,
13:13 so notice the time, so to pull back and serialize those thousand records
13:17 took actually a while, so it took one basically a second,
13:21 and if we don't ask for all the other pieces,
13:25 if we just say give me just the make, the model and the id,
13:29 here we're using the only keywords, it says don't pull back the other things
13:34 just give me the these three fields when you create them,
13:37 it makes it basically ten times faster,
13:40 let's turn this down to a 100 and see, maybe get a little more realistic set of data.
13:44 Okay, there we go, a 100 milliseconds down to 14 milliseconds,
13:49 so it turns out that the deserialization step in MongoEngine is a little bit expensive
13:55 so if you like blast a million cars into that list, it's going to take a little bit.
14:01 If we can express like I only want to pull back these items,
14:05 than it turns out to be quite a bit faster,
14:10 in this case not quite faster, but definitely faster.
14:15 Let's round this out here and finish this up.
14:17 Here we're asking for the highly rated, highly priced cars,
14:20 we're asking like hey for all the people that come and spend a lot of money
14:26 how did they feel about it?
14:29 And then also what cars had a low price and also a low rating,
14:33 so maybe we could have just somehow changed our service
14:37 for these sort of cheaper like oil change type people.
14:39 It turns out that that one is quite fast,
14:42 this one we could do some work and fixing one will really fix the other
14:46 so we have this customer rating thing, we probably want to have an index on,
14:52 and we already have one on the price,
14:54 so I think that that's why it's pretty quick actually.
14:57 Go over here, and we don't yet have one on the price, on the rating rather,
15:03 so we can do that and see if things get better,
15:07 not too much, it didn't really make too much of a difference,
15:12 it's probably better to use the price than it is the rating,
15:16 because we're kind of doing that together, so we're also going to go down here
15:19 and have the price and customer rating,
15:21 one of these composite indexes, once again,
15:24 and maybe if we change price one more time,
15:29 rating and price— it doesn't seem like we're getting much better,
15:36 so down here this is about as fast as we can get, 16 milliseconds
15:40 and this is less than one millisecond, so that's really good.
15:44 The final thing is, we are looking for high mileage cars,
15:47 so let's go down here and say find where the mileage of the car
15:51 is greater than 140 thousand miles, do we have an index on that,
15:55 you can bet the answer is no.
15:58 Now we could go to the shell and see that, but no we don't have one,
16:01 so let's go up here and add one more,
16:04 and this is in fact the only index we have here in this thing
16:07 that is on like just plain field, not one of these nested ones like this;
16:13 so maybe we also want to be able to select by year,
16:16 so we could have one for year as well. I'm going to add those in.
16:21 Now this high mileage car goes from a hundred and something milliseconds
16:26 down to six, maybe one more time just to make sure,
16:28 yep, 5, 6, seems pretty stable around there.
16:32 So we've gone and we've added these indexes
16:34 to our models, our MongoEngine documents by adding indexes
16:40 and we can have flat ones like this, or we have these here,
16:48 and we also can have composite ones or richer things,
16:52 if we create a little dictionary and we have fields and things like that.
16:57 Similarly an owner, we didn't have as many things we were after
17:00 but we did want to find them by their name and by car id,
17:03 so we had those two indexes,
17:05 honestly this is just a simpler document than the cars.
17:08 So with these things added here, we can run this one more time
17:11 and see how we're doing that code all runs really quick,
17:14 if we kind of scan through here, there's nothing that stands out like super bad,
17:18 5 milliseconds, half, 18, 6, half, 1, 3, 1, let's say,
17:26 this one, I really wish we could do better,
17:29 it just turns out there is like so many records there
17:32 that if we run that here you can see that the whole thing runs in one millisecond,
17:38 super, super fast, we can't make it any faster than that.
17:41 The slowness is basically the allocation,
17:45 assignment, verification of 100 car objects.
17:48 I'd like to see a little better serialization time out of MongoEngine,
17:53 if you have some part of your code that has to load tons of these things
17:56 and it's super performance critical, you could drop down to PyMongo,
18:00 talk to it directly and probably in the case where you're doing that
18:05 you don't need to pull back many, many objects,
18:07 but also you can see that if we limit what we ask for down here,
18:12 that goes back to 14 miliseconds which is really great,
18:15 here we're looking at a lot of events, this is like 16 thousand
18:21 or no, 65 thousand, that's quite a bit, this one is really fast,
18:25 this one is really fast, so I feel like from an index perspective
18:28 we've done quite a good job, how do we know we're done?
18:32 I guess this is the final question, this has been a bit of a long—
18:35 how do we know we're done with this performance bit?
18:39 We know we're done when all of these numbers come by
18:43 and they're all within reason of what we're willing to take.
18:47 Here I have set this up as these are the explicit queries
18:51 we're going to ask and then we'll just time them,
18:54 like your real application does not work that way.
18:56 How do you know what questions is your applications asking and how long it's taking.
19:01 So you want to set up profiling, so you can come over here
19:05 and definitely google how to do profiling in MongoDB,
19:08 so we can came over here and let's just say, db set profiling level
19:13 and you can use this function to say I'm looking for slow queries
19:18 and to me slow means 10 milliseconds, 20 milliseconds something like that,
19:23 it will generate a table called system.profile and you can just go look in there
19:29 and see what queries are slow, clear it out,
19:33 run your app, see what shows up in there
19:35 add a bunch of indexes, make them fast, clear that table,
19:38 then turn around and run your app again,
19:43 and just until stuff stops showing up in there,
19:46 you can basically find the slowest one, make it faster, clear out the profile
19:51 and just iterate on that process, and that will effectively like gather up
19:55 all of the meaningful queries that your app is going to do,
19:59 and then you can go through the same process here
20:01 to figure out what indexes you need to create.