Thoughts of a Thinking Craftsman: django

Showing posts with label django. Show all posts

Wednesday, July 11, 2012

Django connection pooling using sqlalchemy connection pool

As you know, Django uses new database connection for each request. This works well initially. However as the load on the server increases, creating/destroying connections to database starts taking significant amount of time. You will find many questions about using some kind of connection pooling for Django on sites like StackOverflow . For example, Django persistent database connection.

At BootStrapToday we use sqlalchemy's connection pooling mechanism with Django for pooling the database connections. We use variation of approach by Igor Katson described in http://dumpz.org/67550/. Igor's approach requires patching Django which we wanted to avoid. Hence we created a small function that we import in one of __init__.py (or models.py) (i.e. some file which gets imported early in the application startup).

import sqlalchemy.pool as pool
pool_initialized=False

def init_pool():
     if not globals().get('pool_initialized', False):
         global pool_initialized
         pool_initialized = True
         try:
             backendname = settings.DATABASES['default']['ENGINE']
             backend = load_backend(backendname)

             #replace the database object with a proxy.
             backend.Database = pool.manage(backend.Database)

             backend.DatabaseError = backend.Database.DatabaseError
             backend.IntegrityError = backend.Database.IntegrityError
             logging.info("Connection Pool initialized")
         except:
             logging.exception("Connection Pool initialization error")

#Now call init_pool function to initialize the connection pool. No change required in the
# Django code.
init_pool()

So far this seems to be working quite well.

Monday, July 04, 2011

Optimizing Django database access : Part 2

Originally published on BootstrapToday Blog

Few months back we moved to Django 1.2. Immediately we realized that our method 'optimizing' database queries on foreign key fields doesn't work any more. The main reason for this 'break' was Django 1.2 changes to support multiple databases. Specially the way 'get' function is called to access value of a ForeignKey field.

In Django 1.2 ForeignKey fields are accessed using rel_mgr. See the call below.

rel_obj = rel_mgr.using(db).get(**params).

However, manager.using() call returns a queryset. Hence if the manager has implemented a custom 'get' function, this 'custom get' function is not called. Since our database access optimzation is based on 'custom get' function. This change broke our optimization.

Check Django Ticket 16173 for the details.

Also our original code didn't consider multidb scenarios. Hence query 'signature' computation has to consider the db name also.

from django.db import connection
from django.core import signals

def install_cache(*args, **kwargs):
    setattr(connection, 'model_cache', dict())

def clear_cache(*args, **kwargs):
    delattr(connection, 'model_cache')

signals.request_started.connect(install_cache)
signals.request_finished.connect(clear_cache)

class YourModelManager(models.Manager):
    def get(self, *args, **kwargs):
        '''
        Added single row level caching for per connection. Since each request
        opens a new connection, this essentially translates to per request
        caching
        '''
        model_key = (self.db, self.model, repr(kwargs))

        model = connection.model_cache.get(model_key)
        if model is not None:
            return model

        # One of the cache queries missed, so we have to get the object
        # from the database:
        model = super(YourModelManager, self).get(*args, **kwargs)
        if model is not None and model.pk is not None:
            connection.model_cache[model_key]=model
        return model

There are minor changes from the first version. Now this manager class stores the 'db' name also in 'key'.

Along with this change you have to make one change the core Django code. In file django/db/fields/related.py in in __get__ function of 'ReverseSingleRelatedObjectDescriptor' class, Replace the line

         rel_obj = rel_mgr.using(db).get(**params)

by line

         rel_obj = rel_mgr.db_manager(db).get(**params)

This will ensure that YourModelManager's get function is called while querying the foreignkeys.

Saturday, February 05, 2011

Optimizing Django database access : some ideas/experiments

Originally published on BootstrapToday Blog

As we added more features to BootStrapToday, we started facing issues of performance. Page access was getting progressively slower. Recently we analyzed page performance using excellent Django Debug Toolbar and discovered that in worst there were more than 500 database calls in a page. Obviously that was making page display slow. After looking at various calls, analyzing the code and making changes, we were able to bring it down to about 80 calls and dramatically improving the performance. Django has excellent general purpose caching framework. However, it’s better to minimize the calls and then add caching for remaining queries. In this article, I am going to show a simple but effective idea for improving the database access performance.

In Django, a common reason for increased database calls in ForeignKey fields. When you try to access the variable representing foreign key typically it results in a database access. Usually suggested solution to this problem is use of ‘select_related’ function of Django query set API. It can substantially reduce the number of database calls. Sometimes it is not possible to use ‘select_related’ (e.g. you don’t want to change how a third party app works or changing the queries requires significant change in the logic etc).

In our case, we have Foreign Key fields like Priority, Status etc on Ticket models. These fields are there because later we want to add ability to ‘customize’ these fields. However, the values in these tables rarely change. Hence usually this results in multiple queries to get the same data. Usually these are ‘get’ calls. If we can add a simple caching to ‘get’ queries for status, priority etc, then we can significantly reduce the number of database calls.

In Django, a new connection is opened to handle a new request. Hence if we add model instance cache to ‘connection’ then query results will be cached during handling of one request. New request will make the database query again. With this strategy we don’t need complicated logic to clear ‘stale’ items from the cache.

from django.db import connection
from django.core import signals

def install_cache(*args, **kwargs):
    setattr(connection, 'model_cache', dict())

def clear_cache(*args, **kwargs):
    delattr(connection, 'model_cache')

signals.request_started.connect(install_cache)
signals.request_finished.connect(clear_cache)

class YourModelManager(models.Manager):
    def get(self, *args, **kwargs):
        '''
        Added single row level caching for per connection. Since each request
        opens a new connection, this essentially translates to per request
        caching
        '''
        model_key = (self.model, repr(kwargs))

        model = connection.model_cache.get(model_key)
        if model is not None:
            return model

        # One of the cache queries missed, so we have to get the object from the database:
        model = super(YourModelManager, self).get(*args, **kwargs)
        if model is not None and model.pk is not None:
            connection.model_cache[model_key]=model
        return model

As a side benefit, since the same model instance is returned for same query in subsequent calls, number of duplicate instances is reduced and hence memory foot print is also reduced.

Thoughts of a Thinking Craftsman

Announcement