The bookmarks problem

I have been using Mozilla based web browsers since 2003. Back in the days, the application was called Mozilla Suite, then in 2004 the Firefox showed up using the same engine, but with completely new front end. I migrated my profile over the years many times, but I always kept bookmarks. Some of my bookmarks surely remember those early days before Firefox (yet, majority of the oldest are no longer valid, because sites were shut down). The total number of my browser bookmarks gathered over that time is over 1k. And this is `the problem`.

I had several attempts to clean up and organise this huge collection. I have tried to remove dead ones and to group them in folders. I have tried using keywords and descriptions to be able to search more effectively. But with no success. Now I have something about dozen of folders, but I still find myself in trouble when I need to search for particular piece of information. The problem boils down to that: I absolutely remember what the site is about, I am absolutely sure I have it in my collection but I cannot find it because either it has some strange title or words in URL are meaningless (Firefox searches only within titles and urls, because obviously that is all it can do).

I realized I need a tool which is much more powerful when it comes to bookmarks searching. I could not find anything to satisfy my requirements so I implemented it myself. Today I am introducing BookmarksBase which is an open source tool written in C# to solve this issue.

BookmarksBase.Search

BookmarksBase embraces a concept that may seem ridiculous: why don’t we pull all textual contents from all sites in bookmarks. Do you think it is lots of data? How much it would be? Even if you were to sacrifice a few hundreds of megs in order to be able to search really effectively, isn’t it worth that space?

Well, it turns out it takes much less space than I originally expected and the tool works surprisingly fast, although it is implemented in managed code without any distinguished optimizations. First we have to run separate tool to collect data (BookmarksBase Importer). Downloading + parsing takes maybe a minute or two. Produced index file containing all text from all sites in bookmarks, which I call bookmarksbase.xml in my case is only 12 MiB (over 1000 bookmarks). Then we can run BookmarksBase Search that allows us to perform actual searching within contents/addresses/titles. Surely, when you have bookmarksbase.xml created you can run whatever serves the purpose for you e.g. grep, findstr (in Windows) or any kind of decent text editor that can handle big amounts of text. I crafted XML so that it can be easily readable by human: there is new lines, and the text is preserved in nice column of fixed width (thanks to Lynx — see source for details).

More details and download link are available on GitHub

PowerShell — my points of interest

I have never used PowerShell until quite recently. I successfully solved problems with bunch of other scripting languages e.g. Python, Perl, Bash, AWK. They all served the purpose really well and I did not feel like I need yet another scripting language. Furthermore, PowerShell looks nothing like any of those technologies that I am familiarized with, so I refused to start learning it many times.

However, when you work as a .NET developer, chances are sooner or later you will come across a solution implemented with PowerShell. It could be, for instance, a deployment script and you will have to maintain it. This happened to me a while ago. Although modification that I committed was relatively simple and I made it up rather quickly with little help of Google, I decided to dig into the subject and check few more things out. What I found after a bit of random research was quite impressive to me. I would like to share three main features I found so far and I consider valuable in a scripting technology. At the bottom of this post I also put some code snippets for quick reference how to accomplish particular tasks.

1. Out-GridView

In PowerShell you can manipulate format of the output in many ways. You can generate HTML, CSV, white space formatted text tables etc. But there is also an option to view output of a command with WPF grid that has built-in filter. Look at the effect of Get-Process | Out-GridView command — this is functionality you get out of the box with just a few keystrokes!

Out-GridView

Out-GridView

2. Embedding C# code

This feature seems quite powerful. If you need more advanced techniques in your script you can basically implement them inline using C# and then just invoke them.

Add-Type @'
using System;
using System.IO;
using System.Text;
      
public static class Program
{
    public static void Main()
    {
        Console.WriteLine("Hello World!");
    }
}
'@
 
[Program]::Main()

3. XML parsing done simply right

Any time I had to do some XML parsing in my scripts using other languages I always felt somewhat confused. This is not sort of things that you just recall from your head and type as a code. You have to use specific APIs, you have to call them in specific way, in specific order etc. I do not mean this complicated in any way, it is not, but it is cumbersome in many languages. I always had to look things up in a cheat-sheet. Not any more 🙂 From now on, I will always lean toward the simplest, and perhaps basically the best implementation of XML parsing:

$d = [xml][/xml] "12"
$d.a.b

This outputs 1. Yes, it is as simple as that. You basically call member properties with appropriate names that match XML nodes.

I am sharing these features because I did not imagine a scripting language can offer something as powerful. And this possibly is only a tip of an iceberg, as I just scratched the surface of PowerShell world. I also suggest checking out little script I wrote to explore PowerShell functionalities: managesites.ps1. It may be useful for ASP.NET developers — it allows you to delete sites from IIS Express config file.

Miscellaneous code snippets:

  • if (test-path "c:\deploy"){ "aaa" }
  • $f="\file.txt";(gc $f) -replace "a","d" | out-file $f — this one is particularily important, because equivalent functionality of in-line editing in MinGW implementation of Perl and SED seems not to work correctly
  • foreach ($line in [System.IO.File]::ReadLines($filename)){ }
  • -match regex
  • ( Invoke-WebRequest URL | select content | ft -autosize -wrap | out-string )
  • reflection.assembly]::LoadWithPartialName("Microsoft.VisualBasic") | Out-Null
    $input = [Microsoft.VisualBasic.Interaction]::InputBox("Prompt", "Title", "Default", -1, -1);
  • foreach ($file in dir *.vhd) { }
  • Set-ExecutionPolicy unrestricted

You are billed for turned off Azure VMs as well

If you are new to Microsoft Azure you will barely guess that. When you shut down your virtual machine, compute hour counter counts just like when it is running and you have to pay for it as well. This “minor” detail is not explained in many official introductory documentation materials I have read. I have realized that only because I am kind of person who likes to re-verify things over and over again and that is why I went to my account’s billing details. I had used my VM just for few days and each day only few hours and after that I saw nearly 200 compute hours in my bill.

Indeed, there are reasonable technical reasons why even powered off machine is billed too. When you create a virtual machine you consume data center resources and they have to remain allocated for you e.g. IP address, CPU cores, storage etc. It does not matter if it is running as this resources still must be reserved and ready for you.

The solution for this problem is to use Azure Powershell to control your virtual machines. The default options of stopping command does also what is called deallocation and then the payment counter stops.

Below I present quick reference of relevant commands.

  1. You need to “log in” to your Azure account from PowerShell. You do this either with Import-AzurePublishSettingsFile filename or with Add-AzureAccount commands. Use the former if you would like to use profile settings file downloaded from the portal, and use the latter if you prefer to just type Microsoft account credentials and have the shell store them for you. In both cases credentials are stored in C:\Users\**name**\AppData\Roaming\Windows Azure Powershell.
  2. Use Get-AzureSubscription to list your subscriptions.
  3. Use Select-AzureSubscription -SubscriptionName **name** to switch the shell to apply following commands to this subscription.
  4. Use Get-AzureVM to list your virtual machines, their names and their states.
  5. Use Stop-AzureVM -ServiceName **name** -Name **name** to shut down and deallocate a virtual machine.
  6. Use Start-AzureVM -ServiceName **name** -Name **name** to power on a virtual machine.

When you close the shell, and open it again you do not have to log in to your Microsoft Account again, but before you are able to control virtual machines you have to select subscription first.

My configuration files

This post is mostly for my personal reference, as it is useful to have one, easily accessible place for quick lookups of configuration files for commonly used tools. There is also special link to this post: https://blog.pjsen.eu/conf


.gitconfig

https://github.com/przemsen/main/blob/master/configs/.gitconfig (raw)

.bashrc

https://github.com/przemsen/main/blob/master/configs/.bashrc (raw)

.vimrc

https://github.com/przemsen/main/blob/master/configs/.vimrc (raw)

git-prompt.sh — Git Bash for Windows

https://github.com/przemsen/main/blob/master/configs/git-prompt.sh (raw)


Main GitHub repository

https://github.com/przemsen/main

My GitHub + first simple project published

I have eventually set up my GitHub account and published some of my code. The URL of the account is:

https://github.com/przemsen.

And the first and very basic project is:

https://github.com/przemsen/WebThermometer.

WebThermometer

WebThermometer

WebThermometer is a WPF application to be used as a desktop gadget. It repeatedly downloads (default is 5 min. interval) current temperature from arbitrary web site and displays it. I personally find it useful as I like to observe current weather conditions right from my computer. I tried to write in a way so that it can easily be modified for use with other data sources. You can also download already compiled and ready to run version from my Polish blog.

My plan is to successively select some of my entire projects and some code snippets which in my opinion are and/or will somehow be valuable to show and demonstrate. You can freely modify and recompile all of the published code providing that you specify it has originally been authored by me.

PS. Today auto updating mechanism of my WordPress failed (apparently this sometimes happens) and I ended up with damaged entire installation. I restored from backup and I apologize for deleting few comments since last 2 months.

The basics do matter

Recently I have spotted the following method in the large C# code base:

This code works and does what it supposed to do. However, I had a slight inconvenience while debugging it. I tend to frequently use Visual Studio DEBUG->Exceptions->CLR Exceptions Thrown (check this out!) functionality which to me is invaluable tool for diagnosing actual source of an exception. The code base relied heavily on this very ConvToInt method, thus it generated lots of exceptions and caused Visual Studio to break in with the debugger over and over again. I then had to disable CLR Exceptions Thrown to protect myself from being hit by flying exceptions all the time. Having switched this off I ended up with somehow incomplete diagnosing capabilities. It is bad either way. So, what I did was basically simple refactoring:

This method also works. One can even argue for better performance of this code, because throwing exceptions is considered to be slow. And this is also correct. Although performance was not key factor here (for line of business applications rarely is), but I measured it anyway. I ran both methods in a for loop 5 million times in release mode having wrapped them with appropriate calls to the methods of Stopwatch class. The results are surely not surprising. For valid string representations of a number, the former method (i.e. one using System.Convert) gave the average result of

663 milliseconds

and the latter (i.e. one using TryParse) gave the average result of

642 milliseconds

We can safely assume both methods have the same performance in this case. Then I ran the test with a not valid string representation of a number (i.e. passing “x” as an argument). Now the TryParse version gave the average result of:

546 milliseconds

And the System.Convert version, which indeed repeatedly threw exceptions gave the (not average, I ran this once) result of

233739 milliseconds

That is a huge difference in 3 orders of magnitude. Then I was fairly convinced my small and undoubtedly not impressive refactoring was right and justified. Except that it is not correct. It has worked well and has been successfully tested. But after a few weeks, when a different set of use cases was being tested, the application called ConvToInt with -1 in the second argument. It turned out, that the method returned 0, not the -1 for invalid string representations of a number. What I want to convey here is:

TryParse sets its out argument to 0, even if it returns false and did not successfully convert a string value to a number.

I scanned the code base and have found this pattern a few times. Apparently I was not the only programmer to not know this little fact about TryParse method. Of course, it is well documented (http://msdn.microsoft.com/en-us/library/f02979c7.aspx). The problem with this very API to me seems even more serious. The 0 value is supposed to be the most frequently used number value when it comes to string conversion failure in general. However, in a construct like this above, it comes from TryParse itself, despite the fact that it is provided by the caller and, more importantly, is primarily expected to be used as a default number value in case of failure. One can easily get into trouble when he or she expects (and passes as argument) different default value, e.g. -1 and still receives 0 because TryParse works this way by definition. Obviously the solution here is to add an if statement:

The performance does not get significantly worse because of this one conditional statement, I measured it and it is roughly the same.

The lessons learned here:

  • Exceptions actually ARE EXPENSIVE. This is NOT a myth.
  • Do not rely on the value passed as out variable to TryParse method in case of a failure. Always back up yourself with an if statement and check for the failure.
  • More general one: learn the APIs, go to the documentation, do not simply assume you know what the method does. Even if it comes to basics. The descriptive and somewhat verbose method name can still turn into an evil when ran under edge cases. Always be 100% sure about what is the contract between the API authors and you, i.e. what the API actually does.