[ 3 table inner join with django ]

I'm creating a report that lists patient deaths by date and by doctor. There will be thousands of deaths throughout the course of the study and hundreds of doctors, and I want my reports to run quickly (page load under 1 second).

I want to create a queryset object I can use to create the table. I'll do something like this for the table -

for doc in doctors:
    html += "<tr><td>" + str(doc) + "<td>"
    for period in time_periods:
        count = my_new_queryset.filter(gp = doc)
                    .filter(date__gte=period['start_date'])
                    .filter(date__lte=period['end_date'])
        html += "<td>" + str(count) + "</td>"
    html += "</tr>"

In sql the query looks like this -

SELECT  patient.name,
        death.date,
        patient_gp_link.gp
FROM    patient
INNER JOIN death
ON      patient.id = death.patient
INNER JOIN patient_gp_link
ON      patient.id = patient_gp_link.patient
WHERE patient_gp_link.is_main = true;

The (simplified) models look like -

class GP(models.Model):
    #....

class Patient(models.Model):
    #....

class Death(models.Model):
    patient = models.ForeignKey(Patient)
    date = models.DateField()

class PatientGPLink(models.Model):
    gp = models.ForeignKey(GP)
    patient = models.ForeignKey(Patient)
    is_main = models.BooleanField(default = False)

I just can't see how to create the queryset object that corresponds to this sql. Can django do this or should I be resorting to using raw sql?

Answer 1


A QuerySet can do the joins with .select_related(). You will also need to read about lookups that span relationships. my_new_queryset would look like:

Patient.objects.filter(patientgplink__is_main=True).select_related('death')

Your could also try to make the doctors to .annotate() the count per doctor.

Answer 2


Expanding on my comment to your answer, I tried to write a single queryset that would do what you want, but it doesn't seem possible based on your data model.

Your data model doesn't actually describe the date range that a particular doctor treated a particular patient. The query you're writing is going to return the wrong results, regardless of how you write the query. What you're currently saying is the following:

"If a patient dies, then any doctors that have treated this patient that have is_main=True are going to be marked as responsible". (Responsible might not be the right word here, but you should get the idea).

Now, the above is fine if you are only ever assigning a single GP as is_main to a particular patient.. but your data model doesn't enforce this, and could be prone to error. Especially if the is_main changes after a patients death. I would structure the data model as one of the following:

class GP(models.Model):
    name = models.CharField(max_length=64)

class Patient(models.Model):
    name = models.CharField(max_length=64)

class Death(models.Model):
    current_gp = models.ForeignKey(GP)
    patient = models.ForeignKey(Patient)
    date = models.DateField()

class Consultation(models.Model):
    gp = models.ForeignKey(GP)
    patient = models.ForeignKey(Patient)
    start_date = models.DateField()
    end_date = models.DateField(blank=True, null=True)

Or..

class GP(models.Model):
    name = models.CharField(max_length=64)

class Patient(models.Model):
    name = models.CharField(max_length=64)

class Death(models.Model):
    patient = models.ForeignKey(Patient)
    date = models.DateField()

class Consultation(models.Model):
    gp = models.ForeignKey(GP)
    patient = models.ForeignKey(Patient)
    start_date = models.DateField()
    end_date = models.DateField(blank=True, null=True)

The first structure has the benefit of allowing really nice queries that will be extremely performant, at the cost of requiring extra information to be entered at the time of the patients death. However, the Consultation (formerly PatientGPLink) model will have all the information required to deduce this information. You could also make the Death.current_gp be a ManyToManyField to support multiple GPs being responsible for the patient.

The second structure can glean the same information, but requires datetime filtering, which will join another table, and make the query slower and more complicated.

All of this is slightly irrelevant if you're very conscious about maintaining that is_main field, in that the data will be correct. But let me show you how to query the information you want in a (probably) more efficient manner from your view:

def my_view(request):
    doctors = GP.objects.all()
    periods = get_time_periods() # however it is you do this...
    smallest_date = get_smallest_date(time_periods)
    largest_date = get_largest_date(time_periods)
    deaths = Death.objects.select_related(depth=1).filter(date__range=(smallest_date, largest_date))
    # build the results table with initial count of 0 to account for all doctors
    # {period: {doctor: count}}
    results = dict((period,{doctor: 0}) for doctor in doctors for period in periods) 
    for death in deaths:
        for period in time_periods: # I'm assuming this is a small range of values
            if death.date > period['start_date'] and death.date < period['end_date']:
                results[period][death.current_gp] += 1 # add a death to the count

Then, in your template, you have the results table with all the precomputed information:

<table>    
{% for period, lookup in results.items %}
    {% for doctor, deaths in lookup.items %}
        <tr> 
            <td>{{ period }}</td> 
            <td>{{ doctor }}</td> 
            <td>{{ deaths }}</td> 
        </tr>
    {% endfor %}
{% endfor %}
</table>

For a total of 2 SQL queries. There is more manual processing this way, but computing the results should be rather quicker than querying the database num_doctors * num_timeperiods + 1 times which is what you're currently doing.

Edit:

To make it work with your current model structure (if you really can't change the models...), you incorporate your answer with mine, and end up with a view that's very similar to the original one I wrote. I'm eliding all the comments since they'll be just the same as above. I've put comments in where I've changed the original view.

     def my_view(request):
        doctors = GP.objects.all()
        periods = get_time_periods()
        smallest_date = get_smallest_date(time_periods)
        largest_date = get_largest_date(time_periods)
        # we make depth=3 so it spans via the PatientGPLink over to GP
        deaths = Death.objects.select_related(depth=3).filter(date__range=(smallest_date, largest_date)).filter(patient__patientgplink__ismain=True)
        results = dict((period,{doctor: 0}) for doctor in doctors for period in periods) 
        for death in deaths:
            for period in time_periods: 
                if death.date > period['start_date'] and death.date < period['end_date']:
                    # and we change how we access the GP
                    results[period][death.patient.patientgplink.gp] += 1

Still only two queries, though those queries are larger (spanning multiple tables).

Answer 3


Thanks to jpic for info on lookups that span relationships (specifically the 'backwards' functionality) +1 from me. I've included my updated code just in case anyone else comes looking.

for doc in GP.objects.all():
    html += "<tr><td>" + str(doc) + "<td>"
    patients = (Patient.objects.filter(patientgplink__gp=doc)
                            .filter(patientgplink__is_main=True)
                            .select_related(depth=1))
    for period in time_periods:
        count = (patients.filter(death__date__gte=period['start_date'])
                            .filter(death__date__lte=period['end_date'])
                            .count())
        html += "<td>" + str(count) + "</td>"
    html += "</tr>"