Unix tools always work. Even when Windows ones don’t



In the video above, you can see, that we cannot drag and drop a file onto a PowerShell script. Conversely, we can easily do this with bash script having installed Git for Windows.

I needed to perform some trivial conversion within a SQL script, i.e. replace ROLLBACK with COMMIT. I thought I would implement it with PowerShell. I am not going to comment on the implementation itself, even though it turned out to be not that obvious. Then I realized, it would be nice, if I could drag and drop a file in question to apply the conversions on it.

This does not work with default configuration of PowerShell. I did not have time to hack it somewhere in the registry, as I assume it is doable. I switched to old, good bash shell instead.

It’s a pity I couldn’t do that with Windows native scripting technology. It is very interesting, that MinGW port of shell has been so carefully implemented, that even dragging and dropping works in non-native environment.

I recall the book Pragmatic Programmer: From Journeyman To Master. There is a whole subchapter about the power of Unix tools. The conclusion was, that over time we would come across plenty of distinguished file formats and tools to manipulate data stored with them. Some of them may become forgotten and abandoned years later, making it difficult to extract or modify the data. Some may work only on specific platforms.

But standard Unix tools like shell scripting, Perl, AWK will always exist. I should say: not only will they always exist, but also they will thrive and support almost every platform you can imagine. They work on plain text, which is easy to process. I am a strong proponent of before-mentioned technologies and I have successfully used them many times in everyday work. It sounds particularly strange among .NET developers, but this what it looks like. The PowerShell simply did not do the trick for me. Perl did. As it always does.

For the sake of the future reference I am including the actual scripts:

PowerShell script:

if (!$args[0])
{
  Write-Host "Please provide target file"
  pause
  exit 1
}

(gc -raw $args[0]) -replace "`nROLLBACK;", "`nCOMMIT;" | sc $args[0]

Bash script running Perl:

#!/bin/sh
perl -pi -w -e 's/^ROLLBACK;/COMMIT;/g;' "$1"

Basic example of SQL Server transaction deadlock with serializable isolation level

Today I am demonstrating a deadlock condition which I came across after I had accidentally increased isolation level to serializable. We can replay this condition with the simplest table possible:

CREATE TABLE dbo.test
(
  Id INT,
  Val varchar(8)
);
GO
INSERT INTO dbo.test VALUES (1, 'aa');
GO

Now let’s open SSMS (BTW, since version 2016 there is an installer decoupled from SQL Server available here) with two separate tabs. Please consider, that each tab has its own connection. Then execute the following statement to increase isolation level in both tabs:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
GO

The following code tries to resemble application level function which does a bunch of possibly time consuming things. These are emulated with WAITFOR instruction. But the point is that the transaction does both SELECT and UPDATE on the same table having those time consuming things in between.

BEGIN TRAN;
  SELECT * FROM dbo.test;
  WAITFOR DELAY '00:00:10';
  raiserror ('after wait', 10,1) with nowait;
  UPDATE [dbo].[test] SET [Val] = 'cccccc' WHERE [Id] = 1;
COMMIT TRAN;

Let’s put the code in both tabs and then execute one tab and the second tab. After waiting more than 10 seconds, which is the delay in code, we will observe an error message on the first tab:

Msg 1205, Level 13, State 56, Line 8
Transaction (Process ID 54) was deadlocked on lock resources with another 
process and has been chosen as the deadlock victim. Rerun the transaction.

This situation occurred in a web application where concurrent execution of methods is pretty common. For application developer it is very easy to be tricked into thinking that having set SERIALIZABLE isolation level we magically make sequential execution of our SQL code. But this is wrong. By setting SERIALIZABLE level we do not automatically switch the behavior of the code wrapped with transaction to the behavior of lock statement known from C# (technically lock is a monitor).

I would advise having a closer look at the instructions wrapped in transaction. In real application the execution flow is much more `polluted` with an ORM calls, but my simplified code from above just tries to model common scenario of reads followed by writes. What happens here is that SQL Server takes a reader lock on the table after executing the SELECT. When we execute the code again in another session we have one more reader lock taken on the table. Now when the first session passes waitfor and comes to UPDATE it needs to take a writer lock and waits (I am purposely using generic vocabulary instead of SQL Server specific one — these locks inside database engine all have their names). We observe the first tab waits more than 10 seconds. This is because when the first tab reaches its UPDATE it needs to take writer lock, but it is locked by the SELECT in the second tab. Conversely, the second’s tab UPDATE waits for the lock taken by the SELECT in the first tab. This is deadlock which fortunately is detected by the engine.

The problem is caused by the lock taken witch SELECT instruction having SERIALIZABLE isolation level set. The lock is not taken in this place with READ COMMITED which is the default level.

I am writing about this for the following reasons:

  • This is very simple scenario from the application point of view: to read some data, update the data, do some things and have all of this wrapped with a transaction.
  • It is very easy to make wrong assumption, that SERIALIZABLE level guarantees that our SQL code will be executed sequentially. But it only guarantees, that if the transactions execute, their observable effects will be as if they both executed sequentially i.e. one after another. But it is your job to make them actually execute not run into a deadlock.

‘Differs only in casing’ class of problems

Error message from TypeScript compiler

This one attacked me yesterday by the end of the day. This is a TypeScript project with some instrumentation done with gulp. The command shown attempts to launch TypeScript compiler. The console is standard git bash for Windows. I didn’t realize what was going on there at the first sight. The messages were confusing and mysterious. I made sure I didn’t have any changes in the repository comparing to the HEAD. I opened the folder in Visual Studio Code and everything was fine. It was only the compiler which saw the problems.

The next day I figured out there is a small detail in my path. Just take a look at directory names casing:

TypeScript compiler ran successfully

After changing the directory using names with proper casing everything was fine. What actually happened was that, I somehow opened the git bash providing the current directory name using lower case letters. It is not that difficult to do that. This is a flaw in the MINGW implementation of bash, I think. It just allows you to change the directory using its name written with whatever case. That is not the problem itself, because cmd.exe allows you to do so as well. The problem is that it then stores the path (probably in PWD environment variable). Some cross-platform tools, which are case sensitive on other platforms, when executed from git bash with mismatched working directory casing may begin to treat such path as separate one, different from the original one. Especially tools which process files using their paths relative to the current directory, like TypeScript compiler for instance.

This can possibly be a wider class of problems and I guess there are other development tools which behave like that when launched from git bash under before mentioned conditions.

In C# interface implementations are not inherited

It may be obvious to some readers, however I was a little bit surprised when I discovered that. Actually, I realized this by looking at a non-trivial class hierarchy in real world application. One can easily think that discussion about inheritance is kind of theoretical area and it primarily appears during job interviews, but it is not true. I will explain real use case and real reasoning behind this hierarchy later in this post, now please take a look at the following program. Generally, the point is that 1) we have to use reference of an interface type and we want more than one specialized implementations of the interface 2) we need to have class B inherit from class A. Without the second requirement it would be obvious: it would be sufficient just to write two separate implementations of IActivity and we are done.

using static System.Console;

namespace ConsoleApplication2
{
    public interface IActivity
    {
        void DoActivity();
    }

    public class A : IActivity
    {
        public void DoActivity()
        {
            WriteLine("A does activity");
        }
    }

    public class B : A
    {
        public new void DoActivity()
        {
            WriteLine("B does activity");
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            IActivity ia = new B();
            ia.DoActivity();
            ReadKey();
        }
    }
}

It prints A does activity despite ia variable storing reference to an object of type B and also despite explicitly declaring hiding of base method. It was not clear to me why is it so. It is obvious that the type B has its own implementation, so why is it not run here? To overcome this I initially created base class declared as abstract:

using static System.Console;

namespace ConsoleApplication2
{
    public abstract class Base
    {
        public abstract void DoActivity();
    }

    public interface IActivity
    {
        void DoActivity();
    }

    public class A : Base, IActivity
    {
        public override void DoActivity()
        {
            WriteLine("A does activity");
        }
    }

    public class B : A
    {
        public override void DoActivity()
        {
            WriteLine("B does activity");
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            IActivity ia = new B();
            ia.DoActivity();
            ReadKey();
        }
    }
}

It prints B does activity, but it also is overcomplicated. Then I came up with simpler solution — it turns out we have to explicitly mark class B as implementing IActivity.

using static System.Console;

namespace ConsoleApplication2
{
    public interface IActivity
    {
        void DoActivity();
    }

    public class A : IActivity
    {
        public void DoActivity()
        {
            WriteLine("A does activity");
        }
    }

    public class B : A, IActivity
    {
        public new void DoActivity()
        {
            WriteLine("B does activity");
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            IActivity ia = new B();
            ia.DoActivity();
            ReadKey();
        }
    }
}

It prints B does activity, but it is not perfect. Method hiding is not a good practice. Finally I ended up with more elegant (and the simplest, I guess) solution:

using static System.Console;

namespace ConsoleApplication2
{
    public interface IActivity
    {
        void DoActivity();
    }

    public class A : IActivity
    {
        public virtual void DoActivity()
        {
            WriteLine("A does activity");
        }
    }

    public class B : A, IActivity
    {
        public override void DoActivity()
        {
            WriteLine("B does activity");
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            IActivity ia = new B();
            ia.DoActivity();
            ReadKey();
        }
    }
}

In here we are using virtual and override modifiers to clearly specify an intention of specializing the method. It is better than previous one, because just by looking at class A we are already informed that further specialization is going to occur.

The real usage scenario is that we have an interface representing Entity Framework Core context. We have two distinct implementations: one uses real database and the other uses in-memory database for tests. The latter inherits from the former because what inheritance is all about is copying code. We just want to have the same methods for in-memory implementation like for regular one, but with some slight modifications e.g. in methods executing raw SQL. We also have to use an interface to refer to these objects, because this is how dependency injection containers work.

As you can see, what might seem purely theoretical, actually is used in real world line of business application. Although I have been pretty confident I understand the principles of inheritance in object oriented programming since my undergrad computer science lectures, but as I mentioned, I was surprised discovering that we have to explicitly put : IActivity on class B. The implementation has already been there. Anyway, this example encourages me to be always prepared to verify assumptions I make.

Beware LINQ Group By custom class

Actually this one is pretty obvious. But if you are focused on implementing some complex logic, it is easy to forget about this requirement. Let’s assume there is following piece of code:

var processedData = 
    from q in rawDataFromDb
    group q by new ProductCodeCentreNumPair 
        { ProductCode = q.Code, CentreNumber = q.CentreNumberIdentifier } into g
    select g
    ;

public class ProductCodeCentreNumPair
{
    public string ProductCode;
    public string CentreNumber;
}

This will compile and run, however it will not distinguish ProductCodeCentreNumPair instances. I.e. it will not do the grouping, but produce a sequence of IGroupings, each for corresponding source item. The reason is self-evident, if we try to think for a while. This custom class does not have custom equality comparison logic implemented. Default logic is based on ReferenceEquals so, as each separate object resides at different memory address, they all will be recognized as not equal to each other. Even if they contain the same values in their fields (strings behave like value types, although they are reference types). I used the following set of overridden methods to provide my custom equality comparison logic to solve the problem. It is important to note, that GetHashCode is also needed in order for the grouping to work.

public override bool Equals(object other)
{
    if (ReferenceEquals(null, other)) return false;
    if (ReferenceEquals(this, other)) return true;
    if (this.GetType() != other.GetType())
    	return false;
    return Equals((ProductCodeCentreNumPair)other);
}

public override int GetHashCode()
{
    return ProductCode.GetHashCode() + CentreNumber.GetHashCode();
}

public bool Equals(ProductCodeCentreNumPair other)
{
    var result =
    other.CentreNumber == CentreNumber &&
    other.ProductCode == ProductCode
    ;
    return result;
}

Alternatively you can use anonymous types, I mean:

var processedData = 
    from q in rawDataFromDb
    group q by new { ProductCode = q.Code, CentreNumber = q.CentreNumberIdentifier } into g
    select g
    ;

will just work. This is because instances of anonymous classes have automatically generated equality comparison logic based on values of their fields. Contrary to ReferenceEquals based implementation generated for typical named classes. They are most frequently used in the context of comparisons, so it seems reasonable.

One more alternative is to use a structure instead of a class. But structures should only be used if their fields are value types, because only then you can benefit from binary comparison of their value. And even having structs instead of classes requires implementing custom GetHashCode. By not implementing it, there is a risk that 1) the auto-generated implementation will use reflection or 2) will not be well distributed across int leading to performance problems when adding to HashSet.

When profiling a loop, look at entire execution time of that loop

Today’s piece of advice will not be backed by concrete examples of code. It will be just loose observations of some profiling session and related conclusions. Let’s assume there is a C# code doing some intense manipulations over in-memory stored data. The operations are done in a loop. If there is more data, there are more iterations of the loop. The aim of the operations is to generate kind of summary of data previously retrieved from a database. It has been proved to work well in test environments. Suddenly, in production, my team noticed this piece of code takes about 20 minutes to execute. It turned out there were about 16 thousand iterations of the loop. It was impossible to test such high load with manual testing. The testing only confirmed correctness of the logic itself.

After an investigation and some experiments it turned out that bad data structure was to blame. The loop did some lookup-heavy operations over List<T>. Which are obviously O(n), as list is implemented over array. Substitution of List<T> for Hashset<T> caused dramatic reduction of execution time to a dozen or so of seconds. This is not as surprising as it may seem because Hashset<T> is implemented over hashtable and has O(1) lookup time. These are some data structures fundamentals taught in every decent computer science lecture. The operation in a loop looked innocent and the author did not bother to try to anticipate future load of the logic. A programmer should always keep in mind the estimation of the amount of data to be processed by their implementation and think of appropriate data structure. By appropriate data structure I mean choice between fundamental structures like array (vector), linked list, hashtable, binary tree. This can be the first important conclusion, but I encouraged my colleagues to perform profiling. The results were not so obvious.

Although the measurement of timing of the entire loop with 16 thousand iterations showed clearly that hashtable based implementation performs orders of magnitude better, but when we stepped over individual loop instructions there was almost no difference in their timings. The loop consisted of several .Where calls over entire collection. These calls took something around 10 milliseconds in both List<T> and Hashset<T> implementations. If we had not bothered to measure entire loop execution, it would have been pretty easy to draw conclusion there is no difference! Even performing such step measurement during the development can me misleading, because at first sight, does 10 milliseconds look suspicious? Of course not. Not only does it look unsuspicious but also it works well. At least in test environment with test data.

As I understand it, the timings might have been similar, because we measured only some beginning iterations of the loop. If the data we are searching for are at the beginning of the array, the lookup can obviously be fast. For some edge cases even faster than doing hashing and looking up corresponding list of slots in a hashtable.

For me there are two important lessons learned here:

  • Always think of data structures and amount of data to be processed.
  • When doing performance profiling, measure everything. Concentrate on amortized timings of entire scope of code in question. Do not try to reason about individual subtotals.