DataRow and DataTable serialization in .Net

November 6. 2013 0 Comments

Posted in:
pivotal
.net

I recently had to resolve a performance problem for one of our Pivotal systems and I thought I should share the findings. Pivotal has what are called server and client tasks which, as their name suggest, constitute logic that runs on the client and server side, respectively. The platform allows to pass objects between the server and client layers of the applications, and also to pass data between different layers/tasks on the server side. Internally this is accomplished using serialization. The platform uses this to pass data (using ADO.Net DataSet, DataTable and DataRow) from the server to the client, and from server to server. It turns out that passing a DataRow from server to client could affect systems performance under certain conditions.

The short version is the following: when passing a DataRow from a server task to another server task the platform is actually serializing the whole DataTable to which the data row belongs. The bigger the DataTable the longer it takes to serialize and reach the other task. Following is a more detailed explanation of this particular case.

Scenario

Let’s assume that we have a task that needs to work with a several data records. The task is designed so the work is split in two subtasks: one subtask reads all the required records from the database, and another subtask takes care of processing a specific record. There are many reasons why this is a good design practice: modularity, separation of concerns, transaction handling, etc. As an example, lets assume that we need to design an archiving process that is executed monthly. Every month there could be thousands of records to archive. If we do the archiving in a single task then we could have transaction issues (usually the transaction timeout is set to 60 seconds, and if the archiving takes longer than this then the transaction will be rolled back). To avoid transaction issues we create one task that selects all records that require archiving in a non-transactional context (all we are doing is a read) and then, for each record (or a batch of records if we want more performance) we call another task, executing under a transactional context, that takes care of the archiving.

Processing each record individually might not be the best approach here. Processing records in batch will be better. But for the sake of simplicity, and to illustrate the actual problem, I will explain the scenario using individual record processing. The concepts also apply to the case where batch record processing is used instead.

The Implementation

We’ve decided to implement our process using the design mentioned above, two separate tasks, one for selecting records and another to process such records:

The first task creates a DataTable (using the platform API to retrieve data from the database) containing all the records that need to be processed. It then loops for each of the data rows in the table, passing each row (a DataRow object) to the other task that takes care of processing the row.
The second task receives a DataRow as input, creates the transaction, and processes the record.

Back to our example, the archiving process does a query on the database retrieving all the records for the previous month (it could be thousands) and puts them in a DataTable, and then for each DataRow calls another task that actually does the archiving.

The Performance

We could think that our process is well designed: we are doing just one read on the database (the first task) and then we are processing each record individually (the second task). We could assume that the time to process an individual record is a constant, and what changes is the number of records to process. In other words, if each record is processed in a time t, then processing n records will take t*n time.

It turns out that, under this scenario, the time to process each record will be proportional to the number of records to process. The time to process each individual record is no longer a constant t, but a function on the number of records to process. We can express the total time as f(n)*n.

Obviously, f(n)*n is much bigger than t*n as n increases (our case). So why is this happening?

The Reality

When objects are passed between layers/tasks (in our case, passing a DataRow between the first task and the second task) these are serialized. In other words, each task will be executing on its own thread and in the case of this platform, if we want to pass parameters between these threads we need to serialize them (either using XML or binary serialization).

The problem is that in order to pass a DataRow as a parameter we have to serialize the DataRow. This might not be obvious but the serialization of the DataRow not only serializes the individual record, but it also serializes the whole DataTable to which the row belongs. So for example, if the first task creates a DataTable of 1000 records, when we serialize each record and pass the DataRow to the second task, we are actually serializing the whole DataTable and passing all the 1000 records to the second task (and we would this 1000 times).

This explains the formula stated before. If I have to process 1000 records it’s actually serializing the whole table 1000 times. The more records has the table the longer it will take to serialize it, and the longer will take to process each individual record.

The Correct Implementation

So, what can we do to solve this problem and have a scenario where the processing time is t*n? The right implementation would have the first task that creates the DataTable to pass the record id (primary key) instead of the DataRow to the second task that processes the record. The first task still creates the DataTable, but this table will contain only one field: the record id. This id will be passed to the second task, which will have to make a database read to get all the information for the particular record, and then process it.

But with this scenario we are doing n+1 database reads. So, why is this more efficient than the previous scenario where we only made 1 read to the database? The answer lies in the fact that, for this platform in particular, the time to do a database read is much lower than the time to serialize a big DataTable. So, if each database read takes d time, our processing time will be d+(t+d)*n, which will be a lot lower than d+f(n)*n, for big n.

How does Pivotal sends emails?

November 4. 2013 1 Comments

Posted in:
pivotal

Pivotal can send emails from either client or server side:

Client side: Pivotal will always use MAPI (Outlook or LotusNotes) to send emails. You just need to have your email client appropriately configured.
Server side: here we have two options:

Use the email client (Outlook or Lotus Notes)
Use SMTP

In all the projects where I’ve participated the SMTP option for server side emails has always been used, mostly because SMTP is the preferred choice for IT and because having Outlook configured on a server is not the preferred choice. This is why development projects (either developed on premise or on the Client’s infrastructure) always need to be using SMTP. Of course, if SMTP is the option we’ll be using, unit and functional tests will need to be made using SMTP.

Following is a brief overview of the configuration for having server side emails working on Pivotal.

Server side emails using Outlook

Emails are sent by the PBS (Pivotal Business Server) component. PBS executes under a security context identified by credentials. By default, when PBS is first installed, these credentials will be "Interactive User”. This means that the PBS will be executing with the credentials of the currently logged-in user. This also means that when an email is sent using Outlook the PBS will be using the Outlook profile for the logged-in user. The right thing to do is to configure the PBS to use a specific account. In this case the PBS will be using the Outlook profile configured for such account (in the example the “macv607\pbs” user). You will need to login on the computer with that account and configure Outlook to send/receive emails. Do some tests to make sure that the Outlook client works fine.

Server side emails using SMTP

Emails are sent by the PBS, but these will be sent using the SMTP protocol. The SMTP configuration is made using the Pivotal Configuration Utility. If the “Use SMTP for Sending E-mail” is checked then the PBS will use SMTP as its first option for sending email. This means that it will not use Outlook. This is the right choice for most projects.

Considerations when using SMTP

Using Office 365: Office 365 uses strict security protocols (secure SMTP and requires authentication). The SMTP configuration screen for Pivotal doesn’t allow to use secure SMTP. This means that you will not be able to use SMTP for Office 365. If this is the case then you must use Outlook to send server side emails.
For development, the common thing to do is to configure a local SMTP server. You could use any SMTP server, but I recommend hMailServer, a powerful free email server that supports SMTP, POP, IMAP, etc.
You will use Outlook to see the emails sent by the PBS. You will have to configure Outlook to use the SMTP server you set up for the development/tests. Here is an example of how to configured Outlook to access the hMailServer: