[ Most efficient way to do a SQL 'INSERT IF NOT EXISTS' ]
Which of the following would perform better?
(1) **INSERT IGNORE**
cursor.execute('INSERT IGNORE INTO table VALUES (%s,%s)')
(2) **SELECT or CREATE**
cursor.execute('SELECT 1 FROM table WHERE id=%s')
if not cursor.fetchone():
cursor.execute('INSERT INTO table VALUES (%s,%s)')
I have to do this patter millions of time so I'm looking to find the best performance for this pattern. Which one is preferably? Why?
Answer 1
With most performance issues, the best approach is to try it both ways and measure them to see which is actually faster. Most of the time, there are many small things which affect performance that aren't obvious on the surface. Trying to predict the performance of something ahead of time often takes longer than conducting the test and may even be impossible to do with any accuracy.
It is important, though, to be as careful as possible to simulate your actual production conditions exactly. As I said before, small things can make a big difference in performance, and you'll want to avoid invalidating your test by changing one of them between your test environment and the production environment.
With SQL performance, one of the most relevant items is the content of the database during the test. Queries which perform well with a few rows become very slow with many rows. Or, queries which are fast when all the data is very similar become very slow when it is very diverse. The best approach (if possible) is to create a clone of your production database in which to run your tests. That way, you're sure about not fooling yourself with an inaccurate test environment.
Once you've got your tests running, you may want to run your database's explain plan
equivalent to find out exactly what is going on with each approach. This will often allow you to start tuning both to remove obvious issues. Sometimes, this will make enough difference to change which is faster, or even suggest a third approach which beats both of them.
Answer 2
The insert ignore
is the better method, for several reasons.
In terms of performance, only one query is being compiled and executed, rather than two. This saves the overhead of moving stuff in and out of the database.
In terms of maintenance, only having one query is more maintainable, because the logic is all in one place. If you added a where
clause, for instance, you would be more likely to miss adding it in two separate queries.
In terms of accuracy, only one query should have no (or at least many fewer) opportunities for race conditions. If a row is inserted between the select
and insert
, then you will still get an error.
However, better than insert ignore
is insert . . . on duplicate key update
. The latter only avoids the error for duplication problems. insert ignore
might be ignoring errors that you actually care about.
By the way, you should be checking for errors from the statement anyway.
Answer 3
For a single or couple of entries, I would use the first approach "INSERT IGNORE" without any doubts.
We don't know much details about your case, but in case you have bulk inserts (since you mentioned you need to run this millions of time), then the key to boot up your insert performance is by using 1 insert statement for bulk of entries instead of an insert statement per entry.
This can be achieved either by:
Using INSERT IGNORE.
INSERT IGNORE INTO table VALUES (id1,'val1'), (id2,'val2')....
Or, what you can do is do a single select statement that, for a bulk of entries, gets the existing entries ie: SELECT id FROM table WHERE id in (id1, id2, id3....) Then programmatically, in your code, exclude from the initial list the ones retrieved from db. Then run your INSERT statement:
INSERT INTO table VALUES (id1,'val1'), (id5,'val5')..
Normally, we would expect that INSERT IGNORE Bulk inserts would be optimal since handled by the db engine, but this cannot be guarantee. Therefore, for your solution better do a small validation for both cases using bulk of data.
If you don't want to run a small comparison test to validate, then you can use the INSERT IGNORE bulk inserts, (this is needed in both cases), during your test, in case you noticed slowness you can try the second approach.
Normally, the second approach would be fast since the first select is done on bulk of ids (pk) so the query is fast and much better than running a select per entry.Filtering ids programmatically is also fast.