DataRow and DataTable serialization in .Net
I recently had to resolve a performance problem for one of our Pivotal systems and I thought I should share the findings. Pivotal has what are called server and client tasks which, as their name suggest, constitute logic that runs on the client and server side, respectively. The platform allows to pass objects between the server and client layers of the applications, and also to pass data between different layers/tasks on the server side. Internally this is accomplished using serialization. The platform uses this to pass data (using ADO.Net DataSet, DataTable and DataRow) from the server to the client, and from server to server. It turns out that passing a DataRow from server to client could affect systems performance under certain conditions.
The short version is the following: when passing a DataRow from a server task to another server task the platform is actually serializing the whole DataTable to which the data row belongs. The bigger the DataTable the longer it takes to serialize and reach the other task. Following is a more detailed explanation of this particular case.
Scenario
Let’s assume that we have a task that needs to work with a several data records. The task is designed so the work is split in two subtasks: one subtask reads all the required records from the database, and another subtask takes care of processing a specific record. There are many reasons why this is a good design practice: modularity, separation of concerns, transaction handling, etc. As an example, lets assume that we need to design an archiving process that is executed monthly. Every month there could be thousands of records to archive. If we do the archiving in a single task then we could have transaction issues (usually the transaction timeout is set to 60 seconds, and if the archiving takes longer than this then the transaction will be rolled back). To avoid transaction issues we create one task that selects all records that require archiving in a non-transactional context (all we are doing is a read) and then, for each record (or a batch of records if we want more performance) we call another task, executing under a transactional context, that takes care of the archiving.
Processing each record individually might not be the best approach here. Processing records in batch will be better. But for the sake of simplicity, and to illustrate the actual problem, I will explain the scenario using individual record processing. The concepts also apply to the case where batch record processing is used instead.
The Implementation
We’ve decided to implement our process using the design mentioned above, two separate tasks, one for selecting records and another to process such records:
- The first task creates a DataTable (using the platform API to retrieve data from the database) containing all the records that need to be processed. It then loops for each of the data rows in the table, passing each row (a DataRow object) to the other task that takes care of processing the row.
- The second task receives a DataRow as input, creates the transaction, and processes the record.
Back to our example, the archiving process does a query on the database retrieving all the records for the previous month (it could be thousands) and puts them in a DataTable, and then for each DataRow calls another task that actually does the archiving.
The Performance
We could think that our process is well designed: we are doing just one read on the database (the first task) and then we are processing each record individually (the second task). We could assume that the time to process an individual record is a constant, and what changes is the number of records to process. In other words, if each record is processed in a time t, then processing n records will take t*n time.
It turns out that, under this scenario, the time to process each record will be proportional to the number of records to process. The time to process each individual record is no longer a constant t, but a function on the number of records to process. We can express the total time as f(n)*n.
Obviously, f(n)*n is much bigger than t*n as n increases (our case). So why is this happening?
The Reality
When objects are passed between layers/tasks (in our case, passing a DataRow between the first task and the second task) these are serialized. In other words, each task will be executing on its own thread and in the case of this platform, if we want to pass parameters between these threads we need to serialize them (either using XML or binary serialization).
The problem is that in order to pass a DataRow as a parameter we have to serialize the DataRow. This might not be obvious but the serialization of the DataRow not only serializes the individual record, but it also serializes the whole DataTable to which the row belongs. So for example, if the first task creates a DataTable of 1000 records, when we serialize each record and pass the DataRow to the second task, we are actually serializing the whole DataTable and passing all the 1000 records to the second task (and we would this 1000 times).
This explains the formula stated before. If I have to process 1000 records it’s actually serializing the whole table 1000 times. The more records has the table the longer it will take to serialize it, and the longer will take to process each individual record.
The Correct Implementation
So, what can we do to solve this problem and have a scenario where the processing time is t*n? The right implementation would have the first task that creates the DataTable to pass the record id (primary key) instead of the DataRow to the second task that processes the record. The first task still creates the DataTable, but this table will contain only one field: the record id. This id will be passed to the second task, which will have to make a database read to get all the information for the particular record, and then process it.
But with this scenario we are doing n+1 database reads. So, why is this more efficient than the previous scenario where we only made 1 read to the database? The answer lies in the fact that, for this platform in particular, the time to do a database read is much lower than the time to serialize a big DataTable. So, if each database read takes d time, our processing time will be d+(t+d)*n, which will be a lot lower than d+f(n)*n, for big n.