« The Biggest Obstacle| Main | The Other End of the Telescope »

F# Versus Microsoft's Regex. A Lesson in Types

| | Comments (9)

The more I play with F# the more fun it gets. Take for instance this problem I was having last week.

I was doing some text processing. As part of that, I set up a few RegExes. As I continued coding, I realized that I was following a pattern: apply the regex, check the match count, and if there were a bunch of matches, either get the first or last item. So the code looked like this.


#light

 

open System

open System.Text.RegularExpressions

 

 

let regFindNumbersInParens = new Regex(@"\([0-9]{1,4}\)")

 

let doSomeStringStuff (s:string) =

    let parenNumMatches = regFindNumbersInParens.Matches(s)

    if parenNumMatches.Count > 0

        then parenNumMatches.[parenNumMatches.Count - 1].Value

        else ""

    // more stuff here



The more I got to thinking about it, the more I realized there were a lot of situations where I wanted the first or last item in a bag of stuff. Of course, this is already available for a bunch of stuff, but what about things that only had an IEnumerable, like Regex Matches? What I'd like is something like Matches.Last or Matches.First

The RegEx collection implements ICollection and IEnumerable, so we can get an enumerator and walk it. (This was actually the first thing I tried). But poking around, it gets easier still.

The generic case is very easy to do. If you want to get the first item, it looks like this:

let firstEnum (e:IEnumerable) = e |> Seq.cast |> Seq.nth 0

Note the use of "Seq.cast" -- this is because if all you have is an IEnumerable, you can have a bunch of anything. You really don't know. Hopefully every time you get a new item from the collection you get one of the same things, but it's not strictly required. So F# adds "Seq.cast" to take a bunch of IEnumerables and fake-out a strongly-typed collection. It's not perfect, but it's good enough for government work.

The last item gets a little trickier, since sequences aren't necessarily all "there" when you compute them. I mean heck, you could have a RegEx match that gives you a million results. Ideally you wouldn't walk the collection unless you had to. But, if you're going to the end, you're walking the list. So to find the last item, you have to convert to an array first:

let lastEnum (e:IEnumerable) = 
    e |> Seq.cast |> Seq.toArray |> (fun x->x.[x.Length - 1])


Note that I'm not checking to make sure the collection has anything in it. That's simple enough to do, but let's stick with the plot.

So I throw my Regex MatchCollection at this wonderful hunk of code -- and it barfs.

I throw a lot more collections that implement IEnumerable, and it works fine.

WTF?

After a LOT of googling around, I came across this thread about RegEx.Matches. In it, Tomas Petrice explains that the RegEx collection is one of the "old" ones -- it implements an IEnumerable, but not an IEnumerable<'a>

If all of this is sounding complex, the problem is that RegEx was coded a long time before generic collections, and nobody has updated the code. So while it looks like a collection, smells like a collection, and may even taste like a collection, it's not the same kind of collection as all the other ones.

Thanks Microsoft!

So let's fix this so it will stop annoying folks.

The first thing I tried to do was to write one function that would take either an IEnumerable OR an IEnumerable<'a>

I failed miserably.

I still believe this is possible, but hell if I can get it to work. Perhaps by wrapping the code in a type and using method overloads, but I digress. What I did find out with this adventure is that programmers have tremendous flexibility with types in F#. For instance, I can write a function that takes a type with certain conditions.


let inline getNameValue xItem = 
  let name =  (^a : (member get_Name  : unit -> XName ) xItem)
  let value = (^a : (member get_Value : unit -> string) xItem)
  name, value


There's a heckuva lot going on here, and I don't want to get into a detailed F# discussion. Suffice it to say that this method only takes an argument that has accessors for Name and Value, not that implement some kind of interface. So I could decide I would write a method that worked on any object that implemented a method called "Foo" and you could pass me any object that implemented such a method -- no matter where you got it. It's called "duck typing", and it's an entirely different thing from what most .NET developers are used to.

I also found that I could extend built-in types with new stuff. For instance, say I wanted a function that would give me "ForEach" capability on anything that implemented an IEnumerable. Sure, I could write a stand-alone function, but why not just make it part of anything that has an IEnumerable. Whiz-bang, easy enough to do:

    type System.Collections.Generic.IEnumerable<'a> with 
        member this.ForEach(f:'a -> unit) = 
            for item in this do 
               f item


Hell, a body could go ape-wild with all this power, adding on dozens of methods to each of the primary system types.

Maybe not a good thing to do!

Finally I found out I could write a function that took any number -- float, int, bigint, whatever -- and work with it.

module NumericLiteralG = begin
    let inline FromZero() = LanguagePrimitives.GenericZero
    let inline FromOne() = LanguagePrimitives.GenericOne
end

let inline factorize n = 
  let rec factorize n j flist =  
    if n = 1G then flist 
    elif n % j = 0G then factorize (n/j) j (j::flist) 
    else factorize n (j + 1G) (flist) 
  factorize n (1G + 1G) [] 


There's a whole nest of goodness in there. How can you write a strongly-typed function which takes any type of number? But that's a story for a later day

So what was my answer? Here's the dump:


#light

open System
open System.Text.RegularExpressions
open System.Collections
open System.Collections.Generic

type System.Collections.Generic.IEnumerable<'T> with
    member this.getLast = 
        this |> Seq.toArray |> (fun x->x.[x.Length - 1])
    member this.getFirst = 
        this |> Seq.nth 0

type System.Collections.IEnumerable with
    member this.getLast  = 
        this |> Seq.cast |> Seq.toArray |> (fun x->x.[x.Length - 1])
    member this.getFirst = 
        this |> Seq.cast |> Seq.nth 0

type System.Text.RegularExpressions.MatchCollection with
    member this.getLast =
        this.getLast :> Match
    member this.getFirst =
        this.getFirst :> Match
    member this.toSeq =
        seq {for i = 0 to this.Count - 1 do yield this.[i]}
    member this.toArray =
        [|for i = 0 to this.Count - 1 do yield this.[i] |]
    

let regFindNumbersInParens = new Regex(@"\([0-9]{1,4}\)")
        
let doSomeStringStuff (s:string) = 
    let lastitem = regFindNumbersInParens.Matches(s).getLast
    // more stuff here


This code allows me to throw a "getFirst" or a "getLast" on any kind of collection I'm given -- whether it's a shiny new-fangled IEnumerable<'T> collection or just an old-hat .NET 1.0 RegEx Collection.

What I wanted initially was to be able to throw any untyped collection at the function and have it give me a strongly-typed last element, but my F# Kung-Fu has failed me. Clocking in at around 18 lines of code, this seems a little long to me, but it's always that way.

In either case, works good enough for now. And perhaps somebody else will run into this RegEx problem and be able to find some help here.

The cool thing here is the degree of freedom you have with playing around under the hood. This is not always a good thing, but like everything else, a little bit will go far.

UPDATE: Fred Leitz wrote in and pointed out that my function finding the last element in seqences is a bit whacked. No point in converting to an array just to go to the end. So it looks much better like this:


let lastSeq<'T> = Seq.reduce (fun a b -> b)


Couple interesting points here.

First, converting to the array and checking for "Length" walks the structure anyway -- which is what our new code does. So the old way copied it all over to a new structure, walked the array to get the length, then walked it again to get the last item. The new one just picks up 2 items from the sequence at a time, returning the second one. When you're done, the second item will be the last item in the sequence. No copying, no double-work.

Second, if you're going to be walking the sequence a bunch, finding the length, getting individual items, etc, you should go ahead and make it an array. But, as in this case, you make it, find the last element, then throw it away, there's no point in using up memory and CPU cycles.

One of the things about Functional Programming that strikes me as completely different from OOP-world is the degree to which you are always inspecting your variable structures and deciding which forms of structures, collections, and functions work best with the data you have. Many times in OOP we have abstracted away so much of the "guts" of the problem that the programmer is dealing in abstractions far removed from the physical implementation of the data and programming. This makes OOP easier to understand, design, code, and maintain, but it also makes it more prone to bone-headed mistakes.

Thanks Fred!

9 Comments

MatchCollection implements ICollection which contains Count. If your Last function checked for common list/collection types first, it wouldn't have to walk the whole thing.

It does indeed. Thanks Andy!

However I wanted the more generic IEnumerable, not ICollection.

Having made that decision, I'm stuck walking the list.

But the other way works great too. And it looks like RegEx.Matches gets the entire collection at once (probably a mistake, but that's a blog for another day) So there's nothing wrong with your approach.

Another thing that I could have done is override all the IEnumerable and ICollection methods inside the Matches Collection and 'fake out' having an IEnumerable under the hood. This was actually one of my choices, but I gave up on it when I realized that there were probably other IEnumerables lurking around in .NET someplace. Better to be a little more generic. It was a design decision.

One of the neat things about programming is how many different solutions you can have for the same problem.

"So I throw my Regex MatchCollection at this wonderful hunk of code -- and it barfs."

Huh? :)

The following works for me:

let a = (new Regex ".*").Matches("foo")

let lastEnum (e:IEnumerable) =
e |> Seq.cast |> Seq.toArray |> (fun x->x.[x.Length - 1])

lastEnum a

Hold on - that should have explicit type parameters in the call to lastEnum (lastEnum Match) and if you want to avoid a compiler warning, also in the lastEnum definition.

(should've used preview...sorry)

By "works for me" do you mean that it compiles? Or that you can do with it what you think you can? ;)

Try putting this next:

let p = (lastEnum a)

Console.WriteLine p.Value

Should get the Value property of the last Match, right?

But it doesn't. And there began the blog entry. :)

Yes. I know!

But if you explicitly cast the arguments, it's not much of a generic function anymore, is it?

So, assuming I just wanted the one function for RegEx.Matches, I could have lived with that. But the goal was to write a generic function that found the last item of a bag of stuff that exposed IEnumerable (the old one, not the new one)

The result of my code in F# interactive is:
val it : Match = {Captures = seq [...];
Groups = seq [...];
Index = 3;
Length = 0;
Success = true;
Value = "";}

The lastEnum function is perfectly generic:
val lastEnum : IEnumerable -> 'a

You just need to give type inference a hand by specifying the actual type of 'a, in this case Match.
If you use it in a pipeline, F# might be able to figure this out for itself.

If you just want the last element of an old IEnumerable, then why use Seq.cast? Instead iterate over it and return the last element. But it's going to be of type obj.

There is going to be a cast in there somewhere whichever way you go. Either after you return obj from the IEnumerable, or in the Seq.cast. I'd favour the solution with Seq.cast.

Actually, I was thinking you could still target IEnumerable, but check for ICollection and others. I don't know how to do it in F#, but in C# you'd do:

if(list is ICollection)
return ((ICollection)list).Count;

Just a note, for those mentioning testing against "ICollection vs. IEnumerable vs. IEnumerable":

The "match" keyword sounds like it should work in this case. Think of it sort of like a switch statement, except you can have all sorts of complex cases, including matches on the type/interface of the value. It's often used with the Some(x)/None Option, which is an F# way of nicely handling NULL.

Granted, extending the class/types/generics with the new functions also looks elegant, making it look more standardized than in the normal CLR offerings.

Leave a comment

About this Entry

This page contains a single entry by DanielBMarkham published on July 10, 2010 2:09 PM.

The Biggest Obstacle was the previous entry in this blog.

The Other End of the Telescope is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Social Widgets





Share Bookmark this on Delicious

Recent Comments

  • Gary M.: Just a note, for those mentioning testing against "ICollection vs. read more
  • Andy Edinborough: Actually, I was thinking you could still target IEnumerable, but read more
  • Kurt: The result of my code in F# interactive is: val read more
  • DanielBMarkham: Yes. I know! But if you explicitly cast the arguments, read more
  • DanielBMarkham: By "works for me" do you mean that it compiles? read more
  • Kurt: Hold on - that should have explicit type parameters in read more
  • Kurt Schelfthout: "So I throw my Regex MatchCollection at this wonderful hunk read more
  • DanielBMarkham: It does indeed. Thanks Andy! However I wanted the more read more
  • Andy Edinborough: MatchCollection implements ICollection which contains Count. If your Last function read more

Information you might find handy
(other sites I have worked on)





Recently I created a list of books that hackers recommend to each other -- what are the books super hackers use to help guide them form their own startups and make millions? hn-books might be a site you'd like to check out.
On the low-end of the spectrum, I realized that a lot of people have problems logging into Facebook, of all things. So I created a micro-site to help folks learn how to log-in correctly, and to share various funny pictures and such that folks might like to share with their friends. It's called (appropriately enough) facebook login help