Archive for Quick tip

Beware LINQ Group By custom class

Actually this one is pretty obvious. But if you are focused on implementing some complex logic, it is easy to forget about this requirement. Let’s assume there is following piece of code:

var processedData = 
    from q in rawDataFromDb
    group q by new ProductCodeCentreNumPair 
        { ProductCode = q.Code, CentreNumber = q.CentreNumberIdentifier } into g
    select g
    ;

public class ProductCodeCentreNumPair
{
    public string ProductCode;
    public string CentreNumber;
}

This will compile and run, however it will not distinguish ProductCodeCentreNumPair instances. I.e. it will not do the grouping, but produce a sequence of IGroupings, each for corresponding source item. The reason is self-evident, if we try to think for a while. This custom class does not have custom equality comparison logic implemented. Default logic is based on ReferenceEquals so, as each separate object resides at different memory address, they all will be recognized as not equal to each other. Even if they contain the same values in their fields (strings behave like value types, although they are reference types). I used the following set of overridden methods to provide my custom equality comparison logic to solve the problem. It is important to note, that GetHashCode is also needed in order for the grouping to work.

public override bool Equals(object other)
{
    if (ReferenceEquals(null, other)) return false;
    if (ReferenceEquals(this, other)) return true;
    if (this.GetType() != other.GetType())
    	return false;
    return Equals((ProductCodeCentreNumPair)other);
}

public override int GetHashCode()
{
    return ProductCode.GetHashCode() + CentreNumber.GetHashCode();
}

public bool Equals(ProductCodeCentreNumPair other)
{
    var result =
    other.CentreNumber == CentreNumber &&
    other.ProductCode == ProductCode
    ;
    return result;
}

Alternatively you can use anonymous types, I mean:

var processedData = 
    from q in rawDataFromDb
    group q by new { ProductCode = q.Code, CentreNumber = q.CentreNumberIdentifier } into g
    select g
    ;

will just work. This is because instances of anonymous classes have automatically generated equality comparison logic based on values of their fields. Contrary to ReferenceEquals based implementation generated for typical named classes. They are most frequently used in the context of comparisons, so it seems reasonable.

One more alternative is to use a structure instead of a class. But structures should only be used if their fields are value types, because only then you can benefit from binary comparison of their value. And even having structs instead of classes requires implementing custom GetHashCode. By not implementing it, there is a risk that 1) the auto-generated implementation will use reflection or 2) will not be well distributed across int leading to performance problems when adding to HashSet.

When profiling a loop, look at entire execution time of that loop

Today’s piece of advice will not be backed by concrete examples of code. It will be just loose observations of some profiling session and related conclusions. Let’s assume there is a C# code doing some intense manipulations over in-memory stored data. The operations are done in a loop. If there is more data, there are more iterations of the loop. The aim of the operations is to generate kind of summary of data previously retrieved from a database. It has been proved to work well in test environments. Suddenly, in production, my team noticed this piece of code takes about 20 minutes to execute. It turned out there were about 16 thousand iterations of the loop. It was impossible to test such high load with manual testing. The testing only confirmed correctness of the logic itself.

After an investigation and some experiments it turned out that bad data structure was to blame. The loop did some lookup-heavy operations over List<T>. Which are obviously O(n), as list is implemented over array. Substitution of List<T> for Hashset<T> caused dramatic reduction of execution time to a dozen or so of seconds. This is not as surprising as it may seem because Hashset<T> is implemented over hashtable and has O(1) lookup time. These are some data structures fundamentals taught in every decent computer science lecture. The operation in a loop looked innocent and the author did not bother to try to anticipate future load of the logic. A programmer should always keep in mind the estimation of the amount of data to be processed by their implementation and think of appropriate data structure. By appropriate data structure I mean choice between fundamental structures like array (vector), linked list, hashtable, binary tree. This can be the first important conclusion, but I encouraged my colleagues to perform profiling. The results were not so obvious.

Although the measurement of timing of the entire loop with 16 thousand iterations showed clearly that hashtable based implementation performs orders of magnitude better, but when we stepped over individual loop instructions there was almost no difference in their timings. The loop consisted of several .Where calls over entire collection. These calls took something around 10 milliseconds in both List<T> and Hashset<T> implementations. If we had not bothered to measure entire loop execution, it would have been pretty easy to draw conclusion there is no difference! Even performing such step measurement during the development can me misleading, because at first sight, does 10 milliseconds look suspicious? Of course not. Not only does it look unsuspicious but also it works well. At least in test environment with test data.

As I understand it, the timings might have been similar, because we measured only some beginning iterations of the loop. If the data we are searching for are at the beginning of the array, the lookup can obviously be fast. For some edge cases even faster than doing hashing and looking up corresponding list of slots in a hashtable.

For me there are two important lessons learned here:

  • Always think of data structures and amount of data to be processed.
  • When doing performance profiling, measure everything. Concentrate on amortized timings of entire scope of code in question. Do not try to reason about individual subtotals.

Entity object in EF is partially silently read-only

What this post is all about is the following program written in C# using Entity Framework 6.1.3 throwing at line 25 and not at line 23.

We can see the simplest usage of Entity framework here. There is a Test class and a TestChild class which contains a reference to an instance of Test named Parent. This reference is marked as virtual so that Entity Framework in instructed to load an instance in a lazy manner, i.e. upon first usage of that reference. In DDL model obviously TestId column is a foreign key to Test table.

I create an entity object, save it into database and then I retrieve it at line 21. Because the class uses virtual properties, Entity Framework dynamically creates some custom type in order to be able to implement lazy behavior underneath.

Now let’s suppose I have a need to modify something in an object retrieved from a database. It turns out I can modify Value property, which is of pure string type. What is more, I can also modify Parent property, but… the modification is not preserved!. This program throws at line 25 because an assignment from line 24 is silently ignored by the framework.

I actually have been trapped by this when I was in need of modifying some collection in complicated object graph. I am deeply disappointed the Entity Framework on one hand allows modification of non-virtual properties, but on the other hand it ignores virtual ones. This can make a developer run into a trouble.

Of course I am aware it is not good practice to work on objects of data model classes. I recovered myself from this situation with AutoMapper. But this is kind of a quirk, and a skilled developer has to hesitate to even try to modify something returned by Entity Framework.

using System;
using System.Data.Entity;
using System.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main()
        {
            using (var db = new TestDBContext())
            {
                var t = new Test { Value = "Hello" };
                var c = new TestChild { Value = "Hello from child", Parent = t };
                db.TestChildren.Add(c);
                db.SaveChanges();
            }
            using (var db = new TestDBContext())
            {
                // Type of c is System.Data.Entity.DynamicProxies.TestChild_47042601AE8E209C11CC25521C2746A2B9D93EC625A6F20BA3D60926278A3D21}
                var c = db.TestChildren.First();
                c.Value = string.Empty;
                if (c.Value != string.Empty) throw new Exception();
                c.Parent = null;
                if (c.Parent != null) throw new Exception();
            }
        }
    }

    public class TestDBContext : DbContext
    {
        public TestDBContext() : base("Name=default")
        {
            Database.SetInitializer<TestDBContext>(new CreateDatabaseIfNotExists<TestDBContext>());
        }
        public DbSet<Test> Tests { get; set; }
        public DbSet<TestChild> TestChildren { get; set; }
    }

    public class Test
    {
        public int Id { get; set; }
        public string Value { get; set; }
    }

    public class TestChild
    {
        public int Id { get; set; }
        public int TestId { get; set; }
        public string Value { get; set; }
        public virtual Test Parent { get; set; }
    }

}

Kdiff3 as Git mergetool and –auto pitfall

In Git, when merging you can sometimes observe the behavior when a conflict has been somehow auto-magically solved without any user interaction at all and without displaying resolution tool window. I have experienced this with commonly used Kdiff3 tool set up as the default mergetool. It turns out that Git has hardcoded --auto option when invoking Kdiff3. This option instructs Kdiff3 to perform merge automatically and silently and display the window only if it is not able to figure out conflict resolutions itself. My understanding is, that it is intended solely for trivial cases and most of the time the window is displayed anyway. However, I am writing this post obviously because such “feature” once has got me into trouble — the tool fixed conflict, but should not have and, of course, did it wrong. In my opinion it is undoubtedly better to always make the decision on your own, instead of relying on some not well known logic of the tool.

The solution is to define custom tool in .gitconfig with Kdiff3 executable and custom command line parameters. Here is the configuration which I use on Windows:

[merge]
    tool = kdiff3NoAuto
    conflictstyle = diff3

[mergetool "kdiff3NoAuto"]
    cmd = C:/Program\\ Files\\ \\(x86\\)/KDiff3/kdiff3.exe --L1 \"$MERGED (Base)\" --L2 \"$MERGED (Local)\" --L3 \"$MERGED (Remote)\" -o \"$MERGED\" \"$BASE\" \"$LOCAL\" \"$REMOTE\"

One thing cmd.exe is better at than *nix shells (with default configuration)

I know the statement might be considered controversial. I even encourage you to try to prove me wrong, because I wish I knew better solution.

In my every day work I tend to use command prompt a lot. I have both cmd.exe and bash (from Git for Windows) opened all the time. My typical environment comprises numerous directories, I mean more than one hundred. Many of them share parts of their names. The names are long, dozens of characters. I have to move between them over and over again. The problem is that it is not feasible to type longish directory name many times manually.

Now, let’s suppose we have the following (shortened, of course) subdirectories structure of a directory which we are in at the moment:

..
aa
bbaa
ccaadd
eeaaff

When I would like to change the directory to bbaa in cmd.exe I type cd *aa* <Tab> <Tab>, I get the second result of auto completion, I press <Enter> and I have moved to bbaa. If I press <tab> three times, I get ccaadd. And when I press <tab> four times, I get eeaaff. This feature is brilliant. The auto completion works with wildcards and matches not only beginnings of a name. Last, but not least, it allows to cycle through suggestions while in-line editing a command.

The most important part here is: not only beginning of a name (which is, as far as I know, the behavior of a typical Unix shell) AND also ability to have the suggestions inserted in place, not only displayed them below the command prompt.

A Unix shell also does match wildcards, but only displays matched names. It does not offer (or I am not aware of it) a way to instantaneously pass matched name to a command. It only lists relevant suggestions and a user have to then manually re-edit the command so that it has desired argument. cmd.exe is better in that it allows a user to cycle through suggestions while in-line editing command argument. Which is great when it comes to long names of which only some parts can be conveniently memorized by a human.

I propose the following function which could be appended to .bashrc.

function cdg() { ls -d */ | grep -i "$1" | awk "{printf(\"%d : %s\n\", NR, \$0)}"; read choice; if [ "$choice" == "0" ]; then : ; else cd "`ls -d */ | grep -i \"$1\" | awk \"NR==$choice\"`"; fi; }

It is a simplistic function that searches through ls results with grep, parses them with awk and finally picks one of them and calls cd.

Now we can type cdg aa and we get all possible choices:

1 : aa/
2 : bbaa/
3 : ccaadd/
4 : eeaaff/

We simply type the number and we are done being moved to the desired directory. Without the need to manually re-enter the cd command with proper argument. Obviously, in cmd.exe we get this nice auto completion for every command typed in the interpreter, and my solution only solves changing directory use case in bash.

2014.02.02 UPDATE 1: After some deeper research, it turned out that the behavior of cmd.exe can be achieved in bash as well. The following line should be included into .bashrc:

bind '"\t":menu-complete'

However, my solution still serves the purpose as it uses grep -i which makes it case insensitive and thus renders it useful.

2014.02.06 UPDATE 2: I have experienced the second reason my solution is still relevant. It is much faster than pressing <tab> and waiting for the shell to suggest names. This can be observed in an environment with number of directories greater than a few, where MinGW tooling tends to be slow in general.

Python script executed by cron crashes when printing Unicode string

I have written a little Python script for my personal purposes and scheduled it to be run on Raspberry Pi by cron. After some polishing work I was pretty sure it worked well and was successfully tested. What I mean by to be tested is to be executed manually from the shell and to observe the expected results. So far so good.

However when the script was run by cron, it failed at the line where it prints some string containing Unicode characters. The line executes normally when running from the shell. I suspected there is some issue with the standard output of the processes run by the cron, because in such case there is no meaningful notion of standard output.

As it turns out, programs executed by cron have no terminal attached to their standard output. According to this Stack Exchange post the standard output is sent to the system’s mail system and thus delivered to a user this way. One can easily verify this by issuing tty command in the cron and redirecting the output to a file. Something similar to this is not a terminal (message translated directly from my system with Polish locale) should be observed.

The further explanation goes as follows: if there is no terminal attached to the process, the Python interpreter cannot detect the encoding of the terminal (it has nothing to do with the system’s environmental variables describing locale. They are all global and are part of the process’ environment, but they do not affect the non-terminal device being attached to standard output). You can verify this by running a Python script that tries to output terminal’s encoding: print sys.stdout.encoding. The None can be observed. So, the interpreter falls back to Ascii encoding and crashes when printing Unicode characters.

The solution in this case was to enforce the interpreter to use Unicode encoder for standard output.

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)

The output is discarded at all, but these lines prevent the interpreter from using default Ascii encoder which is not appropriate for printing Unicode string.