Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Post Categories

Programming

Innocent looking things can make a huge difference. One of the dark arts a good programmer should master are regular expressions. You can parse complex data with them very quickly but the syntax of Regular Expressions looks like black magic. At MSDN there are a lot of well documented examples online which is useful when you have forgotten how this quantifier or that grouping expression does work. There is also some advice on performance which can be very helpful in specific occasions. When it comes to optimizing regular expressions there is very little material online. One of the better posts deals with Java Regular Expressions which can help to tune .NET Regular expressions as well.

The vivid readers do already know what the topic for today will be: How can I optimize Regular Expressions?

Suppose you have a log file and you want to parse log messages out of it. You create a regular expression read the file in parse it and display the parsed messages. But if the file size exceed a few MB the viewer takes quite some time to load the messages. You investigate and check how fast the file reading is. At my machine I can read a 20 MB file in 0.2s so reading was obviously not the problem. But the parsing did take over 16s! My first thought was to remove the slow regular expressions and rewrite the parser. But after reading some of the material from the above links I thought that my regular expression could cause an awful amount of backtracking. The log file looks like this:

Message: This is a test message
Id: Id of Message
Severity: Warning
Process Name: C:\Test.exe
Process Id: 4093
Properties: xxxxx

And the Regular expression used to parse the messages

 

        static Regex slow = new Regex("Message: (?<message>.*)\r\n" +
                               "Id: (?<id>.*)\r\n" +
                               "Severity: (?<severity>.*)\r\n" +
                               "Process Name: (?<processname>.*)\r\n" +
                               "Process Id: (?<pid>.*)\r\n" +
                               "Properties: (?<props>.*)\r\n", RegexOptions.Singleline | RegexOptions.Compiled);

The well known trick to compile the regular expression did not help much. But the expression did work fine and I could extract the parts of the message with some code like this:

            string pMsg;
            string id;
            string severity;
            string processName;
            string processId;
            string properties;
 
            Match m = rex.Match(msg);
            if( m.Success)
            {
                    pMsg = m.Groups["message"].Value;
                    id = m.Groups["id"].Value;
                    severity = m.Groups["severity"].Value;
                    processName = m.Groups["processname"].Value;
                    processId = m.Groups["pid"].Value;
                    properties = m.Groups["props"].Value;
 
            }

To try out new regular expressions there is an indispensable tool called Regulator where you can paste regular expressions and test strings into it and it will display the matches directly. Another nice feature is that it can translate any cryptic regular expression into plain English. After optimizing the regular expression I got these numbers:

Did parse 20000 messages in 0,59s 33.898 messages/s
Did parse 20000 messages in 0,09s 235.294 messages/s
Fast version is 7 times faster

If you expect that the regular expression does look dramatically different I must disappoint you. It is nearly the same. But as I said the beginning its the innocent looking things that can make a huge difference.

        static Regex fast = new Regex("Message: (?<message>[^\r\n]*)\r\n" +
                               "Id: (?<id>[^\r\n]*)\r\n" +
                               "Severity: (?<severity>[^\r\n]*)\r\n" +
                               "Process Name: (?<processname>[^\r\n]*)\r\n" +
                               "Process Id: (?<pid>[^\r\n]*)\r\n" +
                               "Properties: (?<props>[^\r\n]*)\r\n", RegexOptions.Singleline | RegexOptions.Compiled);

Did you spot the difference? I did only replace the (?<name>.*) with (?<name>[^\r\n]*) to get a 7 times speedup. Not bad. You might ask why this is the case? The deeper reason is that the .* operator is greedy and does try to match as much as possible. If you would apply the regular expression to a file with many messages it would match only one giant message. To do this it has to scan the entire string while trying to match the other .* expressions as well. By adding [^\r\n]* instead of .* I did limit the expression to one line only. In the worst case scenario the the expression will match at most one line which is exactly what I wanted.

After parsing the messages we can create a message object and stuff the parsed message parts into a object. I did check with my trusty YourKit memory profiler the memory consumption after reading a 20 MB log file and was surprised that I did consume over 50 MB of memory! How can this be? The reason is simple: The log file ASCII but .NET and Windows use UTF-16 as storage mechanism for strings. That does mean that every ASCII character which does take 1 byte on the disk will take 2 bytes in memory! That is bad but we can improve it quite a lot by playing dirty tricks. Why else would you read this stuff? It is obvious that if we look at the data we process very much redundant information is kept in memory. Especially log messages will contain many identical fields like process name, process id, host name, …

This is the first time I can use the well known but never used String.Intern method with a real reason to do so:

            string myHostName;
            public string HostName
            {
                get { return myHostName; }
                set
                {
                    if (value != null)
                    {
                        // Save space by using the space for identical strings only once
                        myHostName = String.Intern(value);
                    }
                    else
                    {
                        myHostName = null;
                    }
                }
            }

 

When we fill our message objects we can play smart and intern the fields which tend to hold a lot of identical data. This little trick did shave off about 10 MB of additional memory. Not bad. But there is more. You remember that .NET strings are stored in the wasteful UTF-16 Encoding. But other encodings do exist. Why not use the UTF-8 encoding which uses only half of the memory?

            byte [] myCompleteMessage;
            public string CompleteMesage
            {
                get { return Encoding.UTF8.GetString(myCompleteMessage); }
                set
                {
                    if (value != null)
                        myCompleteMessage = Encoding.UTF8.GetBytes(value);
                    else
                        myCompleteMessage = null;
                }
            }

This makes sense if the field is not accessed very often. I did keep the complete unparsed message around for a special display mode. By using this trick I could squeeze out another 20 MB. After applying these optimization the application did consume after parsing a 20 MB log file a little less than 20 MB additional CLR memory. That is much better and it did cost me only changing a few property getter and setters. Making the parsing multi threaded is left as an exercise for the reader. Besides that I am running out of good ideas how to make it much faster.

Let me ask you dear reader what was your best performance optimization you are still proud of?

posted on Tuesday, June 22, 2010 10:24 AM

Feedback

# re: Slow – Faster - Really Fast 6/23/2010 3:37 PM Michael Schall
What about trying the "lazy *"?

instead of switching to (?<id>[^\r\n]*)
use (?<id>.*?)

http://msdn.microsoft.com/en-us/library/az24scfc.aspx


# re: Slow – Faster - Really Fast 6/24/2010 7:01 AM Alois Kraus
I tried the non greedy version but it is still two times slower than the greedy version. I have read somewhere that .*? is normally slower than .* because it needs to check if the other expressions could match as well if it tries to match less. This does cost some time.

Here are the results:

Did parse 20000 messages in 0,15s 129.870 messages/s (non greedy)
Did parse 20000 messages in 0,09s 229.885 messages/s
Fast version is 2 times faster


Yours,
Alois Kraus


Post A Comment
Title:
Name:
Email:
Comment:
Verification: