Archive for General programming

The Boy Scout rule and little things that matter

There is a rule called The Boy Scout Rule. In essence, it says that whenever you attempt to modify code, you should also try to improve something in the existing code so that you leave it a little bit better than it was before. By following the rule we can gradually and seamlessly get rid of technical debt and prevent deterioration in software systems.

The rule can be addressed at organizational level in an interesting way. I have come across an idea of a project manager who was responsible for multiple teams dealing with significant amounts of legacy code. They introduced some kind of a gamification to the development process. The teams were supposed to register as many improvements in code as they could and a team who got the biggest number won the game. The prize was some extra budget to spend on team party. Such idea may not be applicable in all organizations, but it clearly shows how to methodically approach the problem of technical debt at management level.

Although I do not immediately recommend the idea of gamification, but I certainly recommend creating some static (not assigned to any sprint) ticket for all the improvements and encourage developers to make even smallest refactor commits under such ticket during they normal development tasks. Below I would like to show some basic indicators that in my opinion qualify for being improved as soon as they are discovered.

  1. Improper naming causing an API ambiguity

    I see a few problems here. When I first saw client code calling GetValue I thought it returns some custom, domain specific type. I needed to search for a method returning string and I skipped GetValue, because it did not look like it returns a string. It was only later that I realized it actually does return a string. If it returns a string, it should be named appropriately.

    More general observation here is that we have 3 ways of converting the type into a string. In my particular case I had 10 usages of GetValue, 45 usages of the operator and 0 usages of ToString in the codebase. When talking to the maintainers, I was told there was a convention not to use the ToString method. That situation clearly shows some adjustments are needed both at the level of code quality and at the development process level. I have nothing against operator overloading, however it is not very frequently used in business code. The code readability is a top priority in such case and being as explicit as possible is in fact beneficial from the perspective of the long term maintenance.

    The unused method should obviously be removed, and the one returning a string should be named ToString. I would keep the overloaded operators, because why not, but I still am a little bit hesitant about using them in new code. It is cool language feature when you write code, but it appears not so cool when you have to read it. Even here, I would consider sacrificing the code aesthetics of the operator in favor on simplistic ToString.

  2. Misused pattern causing an API ambiguity

    This one is very similar to the previous one, as it boils down to the fact, that we can instantiate an object in two ways. When I was introducing some modifications to the code, at first I was forced to answer the question: should I use the constructor or the Create method. Of course, it turns out, indeed there is a slight difference, because Create returns a result object, which is a common way to model logic in a kind of functional way. But still, at the level of API surface we do not see the difference clearly enough.

    The gist of that case is, there is a pattern in tactical (I mean, at the level of the actual source code) Domain Driven Design to use private constructors and provide static factory methods. The purpose of that is primarily to prevent default construction of an object that would render it in a default state that is not meaningful from the business point of view. Also, factory methods can have more expressive names to indicate some specific extra tasks they do.

    The constructor should be made private and the factory method can be named CreateAsResult, if the wrapper type is prevalent in the code base.

The ideas behind such improvements can actually be very simple. Some of them have to do with trivial, but extremely relevant conclusions about engineering a software. For example:

  • any piece of code that slows down a programmer maintaining the code can potentially be considered not good enough
  • the code is written once, but read multiple times and thus, when writing a code, we should optimize for the easiness of reading it

The vital part of that mindset of clearly expressing intention is proper naming. I highly recommend watching excellent presentation CppCon 2019: Kate Gregory “Naming is Hard: Let’s Do Better”. It helps develop a proper way of thinking when writing a code.

Where to put condition in SQL?

Let’s suppose I am modeling a business domain with entities A, B and C. These entities have the following properties:

  • An entity A can have an entity B and C
  • An entity A can have only entity B
  • An entity A can exist without B and C
  • An entity B has not null property Active

I am implementing the domain with the following SQL. I omit foreign key constraints for brevity.

Now let’s suppose my task is to perform validity check according to special rules. I am given an Id of an entity A as an input and I have to check:

  1. If the entity exists and
  2. If it is valid

The existence will be checked by simply looking if corresponding row is present in the result set, and for validity check I will write simple CASE statement. These are my rules for my example data:

  • A.1 exists and has active B.10 and has C.100 => exists, correct
  • A.2 exists and has inactive B.20 and has C.200 => exists, incorrect
  • A.3 exists and has active B.30 and has C.300 => exists, correct
  • A.4 exists and has active B.40 and DOES NOT HAVE C => exists, incorrect
  • A.5 exists and DOES NOT HAVE NEITHER B NOR C => exists, incorrect
  • A.6 does not exist, incorrect

I write the following query to do the task:

My rules include checking if B.Active is true, so I just put this into WHERE. The result is:

AId  Correct 
---- --------
1    1       
3    1       
4    0       

The problem is, I have been given the exact set of Ids of A to check: 1, 2, 3, 4, 5, 6. But my result does not include 2, 5, 6. My application logic fails here, because it considers those A records as missing. For 6 this is fine, because it is absent in table A, but 2 and 5 must be present in the results for my validity check. The fix is extremely easy:

Now the result is:

AId  Correct 
---- --------
1    1       
2    0       
3    1       
4    0       
5    0       

It is very easy to understand, that WHERE is applied to filter all the results, no matter what my intention for JOIN was. When a record is LEFT JOINed, the condition is not met, because values from B are null. But I still need to have A record in my results. Thus, what I have to do is to include my condition in JOIN.

It is also very easy to fall into this trap of thoughtlessly writing all intended conditions in the WHERE clause.

How to run Tmux in GIT Bash on Windows

Tmux running under Git Bash default terminal with two shell processes

I know everyone uses Cmder, but it didn’t work for me. It hung a few times, it has way too many options, it has issues sending signal to kill a process. I gave up on using it. I work with carefully configured default Windows console and believe it or not, it serves the purpose. I also know you can use Windows Subsystem For Linux under Windows 10, which is truly amazing, but I am just talking about the cases where you need standard Git for Windows installation.

When I worked with Unix I liked GNU Screen, which is terminal multiplexer. It gives you a bunch of keyboard shortcuts to create separate shell processes under the same terminal window. The problem is, it is not available under GIT Bash. But it turns out, its alternative — Tmux is.

I did a little research and have found that GIT Bash uses MINGW compilation of GNU tools. It uses only selected ones. You can install the whole distribution of the tools from https://www.msys2.org/ and run a command to install Tmux. And then copy some files to installation folder of Git. This is what you do:

  1. Install before-mentioned msys2 package and run bash shell
  2. Install tmux using the following command: pacman -S tmux
  3. Go to msys2 directory, in my case it is C:\msys64\usr\bin
  4. Copy tmux.exe and msys-event-2-1-4.dll to your Git for Windows directory, mine is C:\Program Files\Git\usr\bin. Please note, that in future, you can see this file with the version number higher than 2-1-4

And you are ready to go. Please note, that I do this on 64-bit installations of Git and MSYS . Now when you run Git Bash enter tmux. My most frequently used commands are:

  • CTRL+B, (release and then) C — create new shell within existing terminal window
  • CTRL+B, N — switch between shells
  • CTRL+B, a digit — switch to the chosen shell by the corresponding number
  • CTRL+B, " — split current window horizontally into panels (panels are inside windows)
  • CTRL+B, o — switch between panels in current window
  • CTRL+B, x — close panel

This is everything you need to know to start using it. Simple. There are many other options which you can explore yourself, for example here http://hyperpolyglot.org/multiplexers.

Update 1: Users in comments are reporting the method not always works. If you have any experiences with this method please feel free to comment, so that we can figure out what are the circumstances under which it works

Update2: I managed to run this on Windows 7, Windows 2012 R2 and Windows 10. My Git installation is set up to use MinTTy console and tmux works only when run from this console, not from default Windows command line console. Still haven’t figured out what are precise requirements for this trick

UPDATE with JOIN subtle bug

I have been diagnosing very subtle bug in SQL code which led to unexpected results. It happens under rare circumstances, when you do update with join and you want to increase some number by one. You just write value = value + 1. The thing is, you are willing to increase the value by the number of joined rows. The SQL code kind of expresses your intent. However, what actually happens is, the existing value is read only once. It is updated 3 times, indeed. But with the same value, incremented only by one.

Unix tools always work. Even when Windows ones don’t



In the video above, you can see, that we cannot drag and drop a file onto a PowerShell script. Conversely, we can easily do this with bash script having installed Git for Windows.

I needed to perform some trivial conversion within a SQL script, i.e. replace ROLLBACK with COMMIT. I thought I would implement it with PowerShell. I am not going to comment on the implementation itself, even though it turned out to be not that obvious. Then I realized, it would be nice, if I could drag and drop a file in question to apply the conversions on it.

This does not work with default configuration of PowerShell. I did not have time to hack it somewhere in the registry, as I assume it is doable. I switched to old, good bash shell instead.

It’s a pity I couldn’t do that with Windows native scripting technology. It is very interesting, that MinGW port of shell has been so carefully implemented, that even dragging and dropping works in non-native environment.

I recall the book Pragmatic Programmer: From Journeyman To Master. There is a whole subchapter about the power of Unix tools. The conclusion was, that over time we would come across plenty of distinguished file formats and tools to manipulate data stored with them. Some of them may become forgotten and abandoned years later, making it difficult to extract or modify the data. Some may work only on specific platforms.

But standard Unix tools like shell scripting, Perl, AWK will always exist. I should say: not only will they always exist, but also they will thrive and support almost every platform you can imagine. They work on plain text, which is easy to process. I am a strong proponent of before-mentioned technologies and I have successfully used them many times in everyday work. It sounds particularly strange among .NET developers, but this what it looks like. The PowerShell simply did not do the trick for me. Perl did. As it always does.

For the sake of the future reference I am including the actual scripts:

PowerShell script:

Bash script running Perl:

Basic example of SQL Server transaction deadlock with serializable isolation level

Today I am demonstrating a deadlock condition which I came across after I had accidentally increased isolation level to serializable. We can replay this condition with the simplest table possible:

Now let’s open SSMS (BTW, since version 2016 there is an installer decoupled from SQL Server available here) with two separate tabs. Please consider, that each tab has its own connection. Then execute the following statement to increase isolation level in both tabs:

The following code tries to resemble application level function which does a bunch of possibly time consuming things. These are emulated with WAITFOR instruction. But the point is that the transaction does both SELECT and UPDATE on the same table having those time consuming things in between.

Let’s put the code in both tabs and then execute one tab and the second tab. After waiting more than 10 seconds, which is the delay in code, we will observe an error message on the first tab:

Msg 1205, Level 13, State 56, Line 8
Transaction (Process ID 54) was deadlocked on lock resources with another 
process and has been chosen as the deadlock victim. Rerun the transaction.

This situation occurred in a web application where concurrent execution of methods is pretty common. For application developer it is very easy to be tricked into thinking that having set SERIALIZABLE isolation level we magically make sequential execution of our SQL code. But this is wrong. By setting SERIALIZABLE level we do not automatically switch the behavior of the code wrapped with transaction to the behavior of lock statement known from C# (technically lock is a monitor).

I would advise having a closer look at the instructions wrapped in transaction. In real application the execution flow is much more `polluted` with an ORM calls, but my simplified code from above just tries to model common scenario of reads followed by writes. What happens here is that SQL Server takes a reader lock on the table after executing the SELECT. When we execute the code again in another session we have one more reader lock taken on the table. Now when the first session passes waitfor and comes to UPDATE it needs to take a writer lock and waits (I am purposely using generic vocabulary instead of SQL Server specific one — these locks inside database engine all have their names). We observe the first tab waits more than 10 seconds. This is because when the first tab reaches its UPDATE it needs to take writer lock, but it is locked by the SELECT in the second tab. Conversely, the second’s tab UPDATE waits for the lock taken by the SELECT in the first tab. This is deadlock which fortunately is detected by the engine.

The problem is caused by the lock taken witch SELECT instruction having SERIALIZABLE isolation level set. The lock is not taken in this place with READ COMMITED which is the default level.

I am writing about this for the following reasons:

  • This is very simple scenario from the application point of view: to read some data, update the data, do some things and have all of this wrapped with a transaction.
  • It is very easy to make wrong assumption, that SERIALIZABLE level guarantees that our SQL code will be executed sequentially. But it only guarantees, that if the transactions execute, their observable effects will be as if they both executed sequentially i.e. one after another. But it is your job to make them actually execute not run into a deadlock.