Beware LINQ Group By custom class

Actually this one is pretty obvious. But if you are focused on implementing some complex logic, it is easy to forget about this requirement. Let’s assume there is following piece of code:

var processedData = 
    from q in rawDataFromDb
    group q by new ProductCodeCentreNumPair 
        { ProductCode = q.Code, CentreNumber = q.CentreNumberIdentifier } into g
    select g
    ;

public class ProductCodeCentreNumPair
{
    public string ProductCode;
    public string CentreNumber;
}

This will compile and run, however it will not distinguish ProductCodeCentreNumPair instances. I.e. it will not do the grouping, but produce a sequence of IGroupings, each for corresponding source item. The reason is self-evident, if we try to think for a while. This custom class does not have custom equality comparison logic implemented. Default logic is based on ReferenceEquals so, as each separate object resides at different memory address, they all will be recognized as not equal to each other. Even if they contain the same values in their fields (strings behave like value types, although they are reference types). I used the following set of overridden methods to provide my custom equality comparison logic to solve the problem. It is important to note, that GetHashCode is also needed in order for the grouping to work.

public override bool Equals(object other)
{
    if (ReferenceEquals(null, other)) return false;
    if (ReferenceEquals(this, other)) return true;
    if (this.GetType() != other.GetType())
    	return false;
    return Equals((ProductCodeCentreNumPair)other);
}

public override int GetHashCode()
{
    return ProductCode.GetHashCode() + CentreNumber.GetHashCode();
}

public bool Equals(ProductCodeCentreNumPair other)
{
    var result =
    other.CentreNumber == CentreNumber &&
    other.ProductCode == ProductCode
    ;
    return result;
}

Alternatively you can use anonymous types, I mean:

var processedData = 
    from q in rawDataFromDb
    group q by new { ProductCode = q.Code, CentreNumber = q.CentreNumberIdentifier } into g
    select g
    ;

will just work. This is because instances of anonymous classes have automatically generated equality comparison logic based on values of their fields. Contrary to ReferenceEquals based implementation generated for typical named classes. They are most frequently used in the context of comparisons, so it seems reasonable.

One more alternative is to use a structure instead of a class. But structures should only be used if their fields are value types, because only then you can benefit from binary comparison of their value. And even having structs instead of classes requires implementing custom GetHashCode. By not implementing it, there is a risk that 1) the auto-generated implementation will use reflection or 2) will not be well distributed across int leading to performance problems when adding to HashSet.

When profiling a loop, look at entire execution time of that loop

Today’s piece of advice will not be backed by concrete examples of code. It will be just loose observations of some profiling session and related conclusions. Let’s assume there is a C# code doing some intense manipulations over in-memory stored data. The operations are done in a loop. If there is more data, there are more iterations of the loop. The aim of the operations is to generate kind of summary of data previously retrieved from a database. It has been proved to work well in test environments. Suddenly, in production, my team noticed this piece of code takes about 20 minutes to execute. It turned out there were about 16 thousand iterations of the loop. It was impossible to test such high load with manual testing. The testing only confirmed correctness of the logic itself.

After an investigation and some experiments it turned out that bad data structure was to blame. The loop did some lookup-heavy operations over List<T>. Which are obviously O(n), as list is implemented over array. Substitution of List<T> for Hashset<T> caused dramatic reduction of execution time to a dozen or so of seconds. This is not as surprising as it may seem because Hashset<T> is implemented over hashtable and has O(1) lookup time. These are some data structures fundamentals taught in every decent computer science lecture. The operation in a loop looked innocent and the author did not bother to try to anticipate future load of the logic. A programmer should always keep in mind the estimation of the amount of data to be processed by their implementation and think of appropriate data structure. By appropriate data structure I mean choice between fundamental structures like array (vector), linked list, hashtable, binary tree. This can be the first important conclusion, but I encouraged my colleagues to perform profiling. The results were not so obvious.

Although the measurement of timing of the entire loop with 16 thousand iterations showed clearly that hashtable based implementation performs orders of magnitude better, but when we stepped over individual loop instructions there was almost no difference in their timings. The loop consisted of several .Where calls over entire collection. These calls took something around 10 milliseconds in both List<T> and Hashset<T> implementations. If we had not bothered to measure entire loop execution, it would have been pretty easy to draw conclusion there is no difference! Even performing such step measurement during the development can me misleading, because at first sight, does 10 milliseconds look suspicious? Of course not. Not only does it look unsuspicious but also it works well. At least in test environment with test data.

As I understand it, the timings might have been similar, because we measured only some beginning iterations of the loop. If the data we are searching for are at the beginning of the array, the lookup can obviously be fast. For some edge cases even faster than doing hashing and looking up corresponding list of slots in a hashtable.

For me there are two important lessons learned here:

  • Always think of data structures and amount of data to be processed.
  • When doing performance profiling, measure everything. Concentrate on amortized timings of entire scope of code in question. Do not try to reason about individual subtotals.

My Visual Studio workflow when working on web applications

I use IIS Express for majority of my ASP.NET development. Generally I prefer not to restart IIS Express because it may take some time to load big application again. Unfortunately Visual Studio is eager to close IIS Express after you finish debugging. In past, there were some tips on how to prevent such behavior, for instance Edit and Continue should be disabled, and then IIS Express stayed after having finished debugging. I observed after Visual Studio 2015 Update 2 this no longer works. But there is yet another way to preserve IIS Express. We must not stop debugging, but detach the debugger instead in order to accomplish this. I prefer creating custom toolbar button:

Right click on toolbar Customize Commands tab Toolbar Debug Add command Debug Detach all

An important caveat is also to use Multiple startup projects configured in the solution properties. A web application typically consists of many projects, so it is convenient to run them all at once.

Having IIS Express continuously running in the background I would like to introduce how to quickly start debugging. Here I also prefer custom toolbar button:

Right click on toolbar Customize Commands tab Toolbar Standard Add command Debug Attach to process

Or even better, keyboard binding:

Tools Options Environment Keyboard Debug.AttachtoProcess

And now we face the most difficult part: choosing correct iisexpress.exe process. It is likely there will be a few of them. We can search for the PID manually by right clicking IIS Express icon Show All applications and we can view the PID. But this may drive you crazy as you have to do this many times. I recommend simple cmd script which invokes PowerShell command to query WMI:

@echo off
powershell.exe -Command "Get-WmiObject Win32_Process -Filter """CommandLine LIKE '%%webapi%%' AND Name = 'iisexpress.exe'""" | Select ProcessId | ForEach { $_.ProcessId }"

Here I am searching for the process whose path contains “webapi” string. I have to use triple quotes and double percent signs because that is how cmd’s escaping works. The final pipe to ForEach is not necessary, it serves the purpose of formatting the output to raw number, not a fragment of a table. I always have running cmd window so I put this little script into PATH variable and I can view desired PID instantaneously. By the way, Windows Management Instrumentation is tremendously powerful interface for obtaining information about anything in the operating system.

Knowing the PID you can scroll down the process list by simply pressing i letter and then you can visually distinguish the instance with a relevant PID.

Entity object in EF is partially silently read-only

What this post is all about is the following program written in C# using Entity Framework 6.1.3 throwing at line 25 and not at line 23.

We can see the simplest usage of Entity framework here. There is a Test class and a TestChild class which contains a reference to an instance of Test named Parent. This reference is marked as virtual so that Entity Framework in instructed to load an instance in a lazy manner, i.e. upon first usage of that reference. In DDL model obviously TestId column is a foreign key to Test table.

I create an entity object, save it into database and then I retrieve it at line 21. Because the class uses virtual properties, Entity Framework dynamically creates some custom type in order to be able to implement lazy behavior underneath.

Now let’s suppose I have a need to modify something in an object retrieved from a database. It turns out I can modify Value property, which is of pure string type. What is more, I can also modify Parent property, but… the modification is not preserved!. This program throws at line 25 because an assignment from line 24 is silently ignored by the framework.

I actually have been trapped by this when I was in need of modifying some collection in complicated object graph. I am deeply disappointed the Entity Framework on one hand allows modification of non-virtual properties, but on the other hand it ignores virtual ones. This can make a developer run into a trouble.

Of course I am aware it is not good practice to work on objects of data model classes. I recovered myself from this situation with AutoMapper. But this is kind of a quirk, and a skilled developer has to hesitate to even try to modify something returned by Entity Framework.

using System;
using System.Data.Entity;
using System.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main()
        {
            using (var db = new TestDBContext())
            {
                var t = new Test { Value = "Hello" };
                var c = new TestChild { Value = "Hello from child", Parent = t };
                db.TestChildren.Add(c);
                db.SaveChanges();
            }
            using (var db = new TestDBContext())
            {
                // Type of c is System.Data.Entity.DynamicProxies.TestChild_47042601AE8E209C11CC25521C2746A2B9D93EC625A6F20BA3D60926278A3D21}
                var c = db.TestChildren.First();
                c.Value = string.Empty;
                if (c.Value != string.Empty) throw new Exception();
                c.Parent = null;
                if (c.Parent != null) throw new Exception();
            }
        }
    }

    public class TestDBContext : DbContext
    {
        public TestDBContext() : base("Name=default")
        {
            Database.SetInitializer<TestDBContext>(new CreateDatabaseIfNotExists<TestDBContext>());
        }
        public DbSet<Test> Tests { get; set; }
        public DbSet<TestChild> TestChildren { get; set; }
    }

    public class Test
    {
        public int Id { get; set; }
        public string Value { get; set; }
    }

    public class TestChild
    {
        public int Id { get; set; }
        public int TestId { get; set; }
        public string Value { get; set; }
        public virtual Test Parent { get; set; }
    }

}

AngularJS custom validator, an excellent example of duck typing

A StackOverflow answer usually is not worth writing a blog post, but this time I am sharing something, which 1) is an invaluable idea for developers using AngularJS 2) may serve as a beginning of deeper theoretical considerations regarding types in programming languages.

The link to before-mentioned SO:
http://stackoverflow.com/questions/18900308/angularjs-dynamic-ng-pattern-validation

I have created a codepen of this case:

See the Pen AngularJS ng-pattern — an excellent example of duck typing by Przemyslaw S. (@przemsen) on CodePen.0

In the codepen you can see a JavaScript object set up in ng-pattern directive, which should typically be a regular expression. The point is that AngularJS check for being a regular expression is simply a call to .test() function. So we can just attach such test function to an object and implement whatever custom validation logic we need. In here we can see the beauty of duck typing allowing us to freely customize behavior of the framework.

I said this could be a beginning of a discussion on how to actually understand duck typing. It is not that obvious and programmers tend to understand it intuitively rather than precisely, though it is not a term with any kind of formal definition. I recommend Eric Lippert’s blog post on the subject: https://ericlippert.com/2014/01/02/what-is-duck-typing/.

The horror of JavaScript Date

I have heard about difficulties of using JavaScript date APIs, but it was not until recently when I eventually experienced it myself. I am going to describe one particular phenomenon that can lead to wrong date values sent from client’s browser to a server. When analysing these examples please keep in mind they were executed on machine with UTC+01:00 time zone unless I explicitly tell that an example refers to a different time zone.

Let’s try to parse a JavaScript Date object:

 var c = new Date("2015-03-01");
c.toString();
> Sun Mar 01 2015 01:00:00 GMT+0100

What draws my attention is the time value. It is 01:00 which may look strange. But it is not if we correlate this value with time zone information which is stored along with JavaScript object. The time zone information is an inherent part of the object and it comes from the browser, which obviously derives it from the operating system’s culture settings. It turns out these two pieces of information are essential when making AJAX calls, because then .toJSON() method is called. I am making this assumption based on the example behaviour of ngResource library, but other frameworks or libraries probably do the same, because they must somehow convert JavaScript Date object to a universal text format to be able to send it via HTTP. By the way, .toJSON() returns the same result as to .toISOString().

c.toJSON();
> 2015-03-01T00:00:00.000Z

What we have got here is UTC-normalized date and time value. The time zone offset of actual time value used with time zone information allow us to normalize the date when sending it to a server. The most important thing here is that the values stored in Date object are expressed in local time zone i.e. the browser’s one. This implies some strange consequences like the one of being in UTC-xx:xx time zones. Let’s try the same example after setting time zone to UTC-01:00.

var c = new Date("2015-03-01");
c.toString();
> Sat Feb 28 2015 23:00:00 GMT-0100

The problem here is that we have actually ended up with parsed values which are different from their original textual representation i.e. March 1st versus February 28th. But it is still OK providing that our date processing logic relies on normalized values:

c.toJSON();
> 2015-03-01T00:00:00.000Z

However, it can be misleading when we try to get individual date components. Here we try to get day component.

c.getDate()
> 28

But in general the object still can serve the purpose if we rely on normalized values and call appropriate methods.

The problem is that not all Date constructors behave in the same way. Let’s try this one:

var b = new Date('03/01/2015');
b.toString();
> Sun Mar 01 2015 00:00:00 GMT+0100

Now the parsed date although it still contains time zone information derived from the operating system, but it does not contain time value modified by the corresponding offset. In the first example we have 1am time which corresponds to GMT+01:00, here we have just 00:00 time and, of course, we still do have GMT+01:00 time zone information. This time zone information without correctly shifted time value is actually catastrophic. Look what happens when .toJSON() is called:

b.toJSON()
> 2015-02-28T23:00:00.000Z

The result is wrong date sent to a server. This is not the end of the observation. The same phenomenon can also happen in the other way round, i.e. when we are transferring date values from a server to a client. Now let’s assume the server sent the following date and we are parsing it. Please keep in mind that the actual process of parsing may happen implicitly in some framework’s code, for instance when specifying data source for Kendo grid. So the one who parses it for us can be a framework’s code.

var d = new Date("2015-03-01T00:00:00.000Z");
d.toString();
> Sun Mar 01 2015 01:00:00 GMT+0100

As we see, this constructor results in shifted time value just like the Date(“2015-03-01”) one. But when considering displaying these retrieved values we inevitably have to answer the question, whether we aim at showing local time or the server time. We have to remember that in case when the client’s browser is in GMT-xx:xx time zone and we try to show the parsed value (like in c.getDate() example), not the normalized one, this may result in wrong date displayed in front of the user. I say `may`, because this can really be a desired behaviour depending on the requirements. For example, in Angularwe can enforce displaying normalized value by providing optional time zone parameter to $filter('date').

var c = new Date("2015-03-01");
$filter('date')(c, "yyyy-MM-dd", "+0000");
> 2015-03-01

Here we do not worry about internal, actual component values of object c whose prototype is Date. It internally may store February 28th but it does not matter. $filter is told to output values for UTC time. It is also worth mentioning that the Date constructor also assumes that its argument in specified in UTC time. So we populate the Date object with UTC value and return also UTC value not worrying about internal representation which is local. This approach results in the output date and time being equal to the intended input one.

As a conclusion I should write some recommendations on how to use Date object and what to avoid. But honestly I cannot say I gained enough knowledge in this area to make any kind of guidelines. I can only afford making some general statements. Just pay attention to what your library components like, for instance, a date picker control operates on. Is their output a Date object or string representation of the date? What do you do with this output? Is their input a Date object or a string representation and they do the parsing on their own? Just examine carefully and do not blindly trust your framework. I personally do not accept situation when something works and I do now know why it does. I tend to dig into details and try to find the nitty gritty. I have always believed deep research (even one which takes much time) and understanding of underlying technology is worthwhile and I often recall the post by Scott Hanselman who also appears to follow this principle.