Monday, August 30, 2010

On Benchmarking Skills

A common topic on Stack Overflow (and possibly many other places) is developers looking for some measure of their skills.  Many ask what they should learn next (or learn first), some ask what they should do for a job search or an interview, some simply and directly ask how they can know how good they are at what they do.  (My personal favorite, and I'm currently having no luck finding a link, was when somebody asked if it was possible to completely master a programming language.  Most of the responses were similar to what I am about to write, but one was simply the word "Yes" linked to Jon Skeet's profile.)

It's a natural question, really.  Humans have a psychological need for positive reinforcement.  We want to know how good we are at something, most specifically in relation to how good other people are at that something.  We want to be weighed and measured in the hopes that we can pin a badge on ourselves demonstrating that we're experts at this and that.  Hell, there's an entire industry of certifications and courses built around this.  And I've certainly met a fair amount of people who cling to those certifications as ironclad proof of their superiority in the field.  (After all, if Microsoft says you're good at something, then you must be, right?  Why else would they be willing to sell you the course materials over and over again every time they release a new version of something that they also sell you?)

But it's relative.  We all know the old saying: "The more you know, the more you know you don't know."  It's a popular saying because it's essentially true.  So true, in fact, that the most common answers I see to such questions as above are that as long as you're striving to improve, you're good.  It's only when you believe that you've mastered something and that you have no room for improvement that you should worry.  No matter how good you get at something, you should always be aware of your limitations.  It's not entirely necessary that you have the ambition to overcome those limitations, so much as it's necessary that you be aware of them.

We've met the people who thought they were the top dog.  Hell, I've been that person.  Back when I worked at a state job, I was the man in terms of software development.  I knew my stuff, I was up to date on technologies, I was the hot shot codeslinger.  I had what I now refer to as "big fish in a small pond syndrome."  It was a state government job, it presented no challenges or growth.  There was no evidence that my skills were lacking or reason to improve.  The job I took after that corrected this syndrome.  The pond, the size of which being measured by the overall skill and talent of my fellow developers, grew and grew and grew.  (It eventually tapered off and I found myself needing to expand and take upward flight under my own strength, which led to my seeking a new job again, but that's another story.  Still seeing how that's playing out.)

I've discussed this with Tony a few times as well.  He's mentioned that the team at his current job as a big fish in a small pond.  Some developer who is "the senior developer" for no other reason than he's unopposed.  He's not really skilled, but until Tony got there he had no means to measure his skills against anybody else.  Recognizing this in the professional world, Tony now finds himself wanting to be the smaller fish in the bigger pond.  This is because our skills are relative.  You only know how good you are at something when standing next to someone who's better.  (American Idol auditions notwithstanding.)

So, to really answer the question of "how do I know how good I am at something?" is to find someone who's better at it.  Learn from them.  There is no certification or online test that can measure you quite so well as you can measure yourself when you work with someone more knowledgeable or more experienced.  Keep in mind that the business model of certification courses isn't to make better developers, it's to sell certification courses.  Online tests (the kind I loathe when required by a hiring manager) don't actually test your ability to perform on the job, they test only your ability to take online tests.  Unless you're interviewing to be an online test taker, they're not particularly applicable.  (Though all too often they're a necessary evil to get past the first line of defense in an interview process.  Even though the code seen in such tests is generally the kind of code a qualified candidate would run from screaming rather than choose to support as a career.)

If you believe yourself to know all there is to know on a subject, that's a bad sign.  The more you know, the more you know you don't know.  Or, as I once saw it comically expressed online somewhere:  "When I graduated high school I thought I knew everything.  When I graduated college I realized I didn't know everything.  When I received my Master's Degree I realized I didn't know anything.  And when I received my PhD I realized that it's alright because nobody else does either."

Wednesday, August 25, 2010

Someone Doesn't Like Me

I must have offended somebody's delicate sensibilities on Stack Overflow:

Some quick research on meta.stackoverflow.com revealed that they have a daily script that checks for this sort of thing and remedies it.  (They have a number of daily scripts that look for a number of things, actually.  It is the internet, after all.)  So we'll see if that does anything.

It's just 10 points, so no big deal either way.  I just hope whoever this is doesn't continue to have nothing better to do.  People are just strange.

Tuesday, August 17, 2010

Intro to Programming with Python

About a month or two I did a video for SummeryPyGames to teach entry level programmings (or those very new) about programming in Python. I just recently uploaded the video to my vimeo account so I figured I would share it. My first screencast and it came out pretty good I thought.


I have thought about doing more maybe delving into some details about different parts of the language, or slowly building a library to teach someone Python moving onto classes and basic scripting after the video above. I'm sure I would learn a lot in the process.

On Business Priorities

Raise your hand if you've been told "this application is very important and critical to the business and absolutely must work."  Now raise your hand if you've ever called them out on that bunk.  Maybe I'm just a little more cynical or blunt about these things, or maybe I'm a bit of an ass, but I think it's important that business users be made aware of the level of bunk often found in this statement when it comes to their applications.

For all the times I've heard similar statements made, they can all fit neatly into two distinct categories of "highest priority" (notwithstanding the fact that constantly pushing the panic button does not build a constructive sense of urgency):
  1. Real business priority
  2. "Squeaky wheel" priority
The former is simple.  There are systems that are critical to a business' core functions.  At my company, for example, our core database is of the utmost importance.  Millions of dollars have been spent, teams of people surround and maintain it at all times, there are detailed policies and procedures that have been created by experts and have stood against rigorous testing, etc.  It's a business priority, and the business knows it.

The latter, however, is nonsense.  And it's nonsense that affects the business' bottom line often without the business even knowing it.  It wastes everyone's time and effort (and, ultimately, the business' money) all because someone somewhere is being a "squeaky wheel" and demanding attention.  More often than not, we as support personnel are simply told to just go along with the demands.  That is, after all, part of our job.  (And that's fine, just don't complain to me when the real work suffers as a result.)

But let's talk for a minute on what this "priority" really boils down to.  This business user has an application that was created for a specific business purpose.  For this user (and this user alone) this application is critical.  They need it in order to do their job.  And their job relates to other business functions which are also critical and so on and so on (which this user will gladly explain in an email, since justifying their job is clearly more important than describing the actual problem).

But if this application is so critical to the business, why then was it not given any design or architectural attention?  Well, that's "just how we do things here."  Our development group isn't so much a "team" as it is an assortment of people who work individually on their own projects and support their own projects until they eventually leave and everyone else inherits their mess.  But that's another story for another time.  If the application is important to the business, then why was the business not willing to put any effort into creating it.  It was given to the cheapest "resource" (person) and done as quickly as possible.  That sounds pretty low-priority to me.

But the user thinks otherwise.  The fact that the business as a whole consciously decided that this application was not important enough to merit any effort and that the resulting application was designed to fail doesn't matter.  Evidence and historical data are unimportant if that wheel can get squeaky enough.  Now we're into a political battle, not software support.  Now we get to play a game that managers and children alike play.  Complain loud enough and you will be heard.  (It's amazing what business professionals have in common with my two small children.)

It gets especially interesting when the user begins invoking names and business terms as part of their squeaking.  Toss around the term "HR" or cc: an executive and the squeaking gets louder.  Sorry, but I don't believe in magic.  Stringing together words of power to form an incantation that will rouse and frighten others into doing your bidding is, I'm sorry to say, witchcraft.  Voodoo.  And it has no place in a professional environment.

So what ends up happening?  Amid all my ranting and complaining (sorry about that) it all comes down to one simple truth.  All we're going to do for that user is placate them.  We're going to apply some quick fix to get them up and running again.  Not even get them up and running, but just get them to shut up.  No effort, no thought, no expertise at all.  Just do whatever it takes to stop the wheel from squeaking.  Again, does this sound like a business priority?

So why are we maintaining this charade?  This charade costs money.  It directly affects our bottom line.  We know that user will be back, we know that application will fail again.  It all comes down to a very simple business decision...

Is this application a priority or not?

While this post was a whole lot whinier and rant-ier than my last, the message is essentially the same.  Care about your software.  As a business (either a whole company or just a department or even a single user, whatever entity "owns" a particular application), if a piece of software is critical to your business then put forth the effort required to make sure it meets your needs.

I've said it before and I'll say it again... If the person who actually cares about this application doesn't really care about it, why should we?

Monday, August 16, 2010

Interconnecting The Tubes

I was bored at work today, so I figured I'd add Stack Overflow flair to the side of the blog.  I also started a personal blog (since too much personal stuff would be noise here) and added my Twitter feed to that.  There's also a Twitter list for this blog's contributors and peeps, but it seems that the Blogger widgets don't have a way to add that as a feed as well.  At least, not one that I've found.

Thoughts?

Friday, August 13, 2010

On Quality Software

Sean sent us this link today to an interview with Dave Thomas and Joe Armstrong on the subject of software quality and craftsmanship.  (I highly recommend downloading the MP3 available on the link and keeping it in one's archives in case the link ever goes stale.  The interview is full of good points and insights that should be retained for the ages.)  It really got me thinking about software quality in the enterprise environment.

It's a conversation a lot of us have had with our managers, our fellow developers, etc.  And it occurs to me that the response is often the same.  It comes across differently, but the message is clear.  Writing quality software in "the real world" is "too hard."  Sure, it's nice for open source projects or academia or anywhere else that money doesn't matter, but in a real business where costs are important it's just not worth all the extra effort vs. just doing some good old rapid application development and whipping up some application to get the job done.

I find that this point of view is based on two core assumptions:

  1. Bad software costs less than good software.
  2. Good software is more difficult to write than bad software.
The former of the two assumptions confuses, downright baffles me.  We've been in this business for a long time now.  Some of us are a little newer to the game than others, but the industry as a whole has been around for some time.  And as such, there are certain lessons I would think we'd have learned by now.  After all, we're talking about businesses here.  Groups of professionals who's sole purpose in the office day in and day out is to concern themselves with "the bottom line."  Money.  Costs.

Can anybody in any position of management over a software development team truly claim with a straight face that supporting bad software is in any way cheap?  That it can just be written off as a justified cost and pushed aside and forgotten?  Teams of highly-paid developers spend a significant portion of their time troubleshooting legacy software that was rapidly whipped up to just get the job done.  We as an industry have enough of this old software laying around that we know full well at this point how difficult and, at the bottom line, expensive it is to support.

Now, what's done is done.  I understand that.  This software has already been written and invested in and it's "business critical" so we need to support it.  But it's not like we're done.  We're still writing software.  We're still creating new features and new applications and new projects.  So why is "the way we've always done things" still the status quo?  You know how much it's going to cost in the long run, that math has already been done.  Hasn't it?

You're either going to pay for it now or pay for it later.  Either way, it's going to cost you.  But there's a critical difference between "now" and "later" in this case.  "Now" is brief.  It's measurable.  You know when it's over.  "Later," on the other hand, is open-ended.  You don't know how long "later" is going to last.  You don't know how much it's going to cost, other than potentially a lot.  Let's express this with a little math...

  • 100 * 3 = 300
  • 50 * n = 50n
So, assuming for a moment (and we'll get to this in a moment) that creating the "quick and easy" software costs half as much as creating the "good" software (seriously, remember that this is hypothetical just for the sake of simple math here... we'll really get to this in a minute), can you tell me which final amount is higher or lower?  Quick, make a business decision.  300 or 50n?

Now the latter assumption is something towards which I can be more sympathetic, mostly because the people who make that assumption don't know any better.  If you haven't written both bad software and good software over the years and really grasped an understanding of the two, then I can't really expect you to truly understand the difference.  Whether you're a manager who just isn't really into the technical details or an inexperienced developer who hasn't really learned yet, the assumption is the same.

But the thing you need to understand about this assumption is that it's just patently untrue.  Writing good software doesn't cost a lot more than writing bad software.  It doesn't even cost a little more.  Maybe this assumption is based on the generality that in most industries good costs more then bad.  Yes, this is true.  But it's true for a different reason than you think.

Manufacturing quality goods costs more than manufacturing cheap goods because the manufacturing is mechanical and commoditized.  It's an industry of tools, not of people, and better tools cost more.  But software isn't manufacturing.  It isn't plugging together a series of pre-defined widgets on an assembly line.  (Microsoft would have you drinking their Kool-Aid and believing that it's all about the tools, but that's because they happen to sell the tools.  But we'll cover that another time.)  It's a craft, and it's performed by craftsmen, not assembly line workers.

Does good art cost more than bad art?  Sure, the classic good art costs a fortune.  But that's because it's rare and can't be truly reproduced.  Software doesn't really have that problem.  But art being created right now has nothing to do with the financing of the work.  Good or bad is a measure of the craftsman, not of the financial backing.

And a good software craftsman can often be found at roughly the same cost as a bad one.  This is because that cost is usually based on how long the craftsman has been doing the craft, not how good he or she happens to be at it.  We've all met "senior developers" who are only in that position because they happen to have been writing mediocre software for a long time, or because they're over a certain age, or because they've been with that company for a long time in various capacities such that when they got into the software gig they were considered "senior level."

But ask any craftsman in any industry.  Ask a seasoned carpenter or electrician or anybody.  Not taking into account the tools, is the actual act of doing something well inherently more difficult or, at the bottom line, more expensive than doing it poorly?  No, absolutely not.  It seems that way to laymen because the professionals "make it look easy."  But the fact is that, for those who know what they're doing, it is easy.  It actually takes a professional more effort to do something in an amateur way than to just do it the way they know how to do it.  It's not right, it's not natural, it forces them to stop and un-think every once in a while.

Do yourself a favor, do your company a favor, do your bottom line a favor and actually care about the software you manage.  Your developers care about it, because craftsman like to produce good things.  (If they don't care about it then perhaps you should re-think your team.)  Produce better software so that, in the long run, your support efforts can be lower and you can have the time and resources to produce even more good software.  Your business end users may not know the difference, they may not know you've done anything at all.  They shouldn't have to, their software should just work.

Saturday, August 7, 2010

Diving into the Ruby world

The Code Dojo had its first meeting earlier today. It was only 2 of us but that might have been to our advantage. We spent the first couple of hours getting environments setup and eating lunch, but once that was out of the way we were able to start writing some code.

We did an exercise I found on this blog post about exercises to learn a language (yeah it was too perfect not to check out). The exercise we did was the one to write a function to calculate the Haar wavelet of an array in numbers. We had to sit and discuss the problem at hand for bit to make sure we understand but after that we were off and stumbling through using RSpec.

Our first impressions of RSpec are that it seems to allow pretty expressive concise tests. We noticed very quickly how RSpec extends a lot of the core language to offer you the ability to do things like this:

it "an odd length array" do
lambda { @haar.calculate([8,5,1]) }.should raise_exception(ArgumentError)
end

I think you can get the gist of the test just from reading it (which is great) and I'm not going to go into Ruby syntax here. I probably could have came up with between description that flowed better but whatever. I should mention that at first we just tried this:

it "an odd length array" do
@haar.calculate([8,5,1]).should raise_exception(ArgumentError)
end

This test failed and it took a little bit before we realized what we were doing. Without the lambda wrapping the code to be exercised, we were having the exception thrown before RSpec knew what was suppose to happen. The calculate() method would be executed and throw the exception we were expecting but since the should() method never got executed to setup everything the test would fail. We had to wrap the call to calculate() in a lambda so that the execution was delayed. I'm not very familiar with RSpec, but I'm going to guess that should() call has an interesting implementation.

As for our actual implementation code, we noticed that Ruby has some pretty powerful built in functions and we were able to avoid looping structures and arrowhead code after some googling. We came across some interesting tidbits about the language. There is a function to split an array into several equal sized slices. Since the calculation required us to split up the numbers into pairs this was a huge savings for us. The method was call each_slice() and out code looked like this:

pairs = []
current.each_slice(2) { |a| pairs.push a }

The first attempted left out the parenthesizes around that 2 and we kept getting syntax errors. So it seems that Ruby requires parenthesizes sometimes but in this case (from our understanding) the block (an anonymous function defined in the { }) to be executed for each slice is being run against the result of the each_slice() method call. Leaving out the parenthesizes results in Ruby trying to run that block of code against the 2 which you can't do. I'm not completely sure of this explanation and I haven't had a time to research this, but it's worth mentioning.

The last quirk that I'll mention tonight is the result of a block and using a return statement inside one. We were doing a map of an array to produce an array of pairs but in the block to be executed we first tried a return statement at the end. This resulted in the entire function to be returned not just the block so we ended up having to do this:

avg_co = pairs.map { |pair|
  wavelet_pair = []
  wavelet_pair[0] = (pair[0] + pair[1]) / 2.0
  wavelet_pair[1] = pair[0] - wavelet_pair[0]
  wavelet_pair
}.transpose

The last line is just the variable (which is scoped how you would expect, existing only inside the block) we want the value of. We have a line with only it and the return value is what we wanted. That last line was originally 'return wavelet_pair' but like I said it returned out of the whole function which was quite unexpected.

I think that pretty much wraps up all the interesting things we picked up while we stumbled through this code. It felt like a great learning experience for us and I look for to more. All the code for our Code Dojo should make its way up to the github organization page. We also have a Google Group page now. Feel free to take a look at our code on github and discuss any ideas or details of Ruby on our group page.

Friday, August 6, 2010

Is concurrency hard?

I have great interest in really grokking Erlang. The language is fascinating to me. It was built to try and solve problems of reliability and scalability originally I believe, but what they got was a functional language with ideas of concurrency built in. Concurrency is a basic idea in this language. That is very interesting.

Now majority of my experience coding is with object-oriented languages. We have classes and methods and polymorphism. We tie state and actions together which allows us to reason and work with things in a more natural way. A Dog barks, a cat annoys you, a MacBook Pro with an I7 burns us. Now what if we want that dog to bark and run? We do it in two separate steps and done in sequence in the same thread. Do it in two threads and now you need to start thinking about locking and I don't think any of us want to do that. We have made concurrency hard for ourselves due to how we reason in this paradigm. Perhaps not a big deal before but multi-core CPUs are everywhere now while CPU speed hasn't improved.

Now enter Erlang where instead of an object being a core concept you have a process (a light weight process not an OS process). So instead of thinking in terms of sequence of events that occur to objects you can think about the coordination of events between many processes. These processes take messages and perform actions based on those messages and possibly send out messages. The processes don't share state and you have to send any data in the message. Sending a message is a one way trip and if you are expecting a reply the one you sent to doesn't have to be the one who replies. You can also do other things while waiting for a reply or wait immediately.

Now this seems to work a little more like the real world. I think an easy example is just how we interact with each other with e-mail. I have an idea and I want to share it with you. You are pooping. I send you my idea and afterwards I decide to play Metal Slug instead of just waiting for a reply. You finish pooping to find a new message. Reading my message, containing my idea, you find it terrible and send a reply. I stop playing Metal Slug and check my mail which has your reply in it. I'm sad now and decide to post flame bait in forums I have no respect for. We are processes, our e-mail is our mailbox of messages, and everything else are the actions we decided to perform.

Erlang gives us an entirely different approach to structuring your application that allows us to code in a similar fashion to how we handle things in the real world. It seems to make it easier to think about concurrency by eliminating shared state and providing these process and messaging abilities for you as part of the core. At a lower level, Erlang is certainly a functional language so the state is disconnected from the actions generally, but an interesting thought is that Erlang at the higher level is actually object-oriented.

I got that from an interview with Joe Armstrong (author of Erlang) that I watched. The idea is that you still have the encapsulation inside your process, you still have the state and actions tied together (process holds onto the state and the messages it handles are the actions), and you have polymorphism in the fact that Erlang is dynamically typed and you don't know what process will actually handle your message in the end nor which one you will be sending it to in the beginning (since you can pass around process Ids or register names).

I hope that one day I can return to Erlang and try to accomplish something in it. I think there is a lot to gain from a language the presents you with a different perspective. In this post I haven't even touch topics like how Erlang provides extra reliability by allowing you to have watcher and worker processes, nor how it allows you to get almost linear performance gains from adding CPU cores. This is certainly a stand out in the crowd of technology I think.

Class Explosion and the dynamic world

I just read a blog post from Ayende titled Data access is contextual, a generic approach will fail and it has me thinking about class explosion that can occur in static languages. Now I'm not a fan of trying to use a single anemic model everywhere or just having a model and then a serializable version of it. I actually like (and this is how I originally learned and thought about DTOs) having many DTOs and contextual action/task based APIs. Certainly leads to a lot of code but hopefully (as Ayende mentioned) a lot of simple code.

Now why do we do this explosion of interfaces, classes, mappers, etc? We generally do it to try and stay flexible. I think we all know that direct coupling of implementations is generally a bad idea, so we use interfaces. Now we have the interface, implementation, and wiring code to keep track of. We know that DTOs are good for sending information about a model across a boundary. Now we have the model, DTOs, and mapping code. This is a lot of code certainly but I personally find great value in these kinds of separations. You'll generally find yourself less frequently in a situation where you have to turn down an idea due to the code being too hard to manipulate (or at least delay that from happening longer).

Thinking about all of this reminded me about how usually you don't see this kind of separation in code written in a dynamic language. I think that is because (and I'm quoting a friend here) "you are doing interface-less interface programming". At any point you can perform a mapping, or wiring. You don't have to create a new class or refactor to an interface. The code can go in any place at any time. You don't need an IOC container, the language has already kept your code flexible. Now of course I'm not sure currently how you keep all this straight in the dynamic world nor how it plays out in large sophisticated applications.

I have been playing around with Python for a little over a year now and I have to say I see things a little differently. I start seeing things in a manner of "oh I'll just map that over to this here" or "ah just tack this on and pass it through that". I find myself sometimes frustrated in C# because of the lack of flexibility (of course sometimes I feel scared because of the lack of enforced structure in Python but that is starting to fade).

An area that I have heard that dynamic languages greatly help is with writing and maintaining test code and specifically when doing Test Driven Development. Generally I seem to find that people who have done testing in both static and dynamic languages usually say that the static world is a pain in the ass. Perhaps all this enforced structure is why only the more disciplined among us write and keep up with tests, sophisticated models, or try to always use an IOC container and interfaces.

Well I intend on trying to get in the habit of TDD and understand its benefits. As Dave has mentioned before we have started a Code Dojo and the first goals of this are to learn Ruby together, use TDD, and complete a project collaboratively in Ruby.

I write this post to see if anyone has any thoughts about why we do the things we do in the static world and also to say that I'm hoping to write more posts about my thoughts on TDD in Ruby. I can compare it to writing tests in C# and I hope to transfer my experiences in Ruby over to C# in hopes of trying to improve my habits and coding abilities in the static world.

LINQ to DB2

Some of you (assuming there are even more than one of you) may know that I've been looking for some way to use LINQ against the massive DB2 core database we have here at work.  And, if any of you have worked with DB2 before (this is my second time around), you know that it's a bit of a pain in the ass.  Nothing supports it, it has all kinds of special proprietary ways of doing anything and everything, etc.  And, to make matters worse, IBM's website is like a scavenger hunt through hell when you're trying to find a download for something.  I get the feeling they don't want you to be able to do anything without going through one of their sales reps and dedicated client engineers and installing their full suite of nonsense.  [take a deep breath, count to 10... ok]

Anyway, I've been searching online for a while now and trying out various tools and everything always comes down to fitting into one of three categories:
  1. Yes, you can do this! All you need is a driver from IBM that there's no download link for!
  2. Here's a handy tutorial for generating a great ORM for your business objects that creates a database from scratch. (Caveats: Useless against a massive legacy database, probably doesn't grok DB2)
  3. Download some tool and generate your ORM. (Tool doesn't support DB2, or claims to but fails when I try.)
I've even contacted the oracle of Stack Overflow a few times on the subject, phrasing the question differently or approaching it from different angles.  And, recently, it paid off (at the cost of 100 reputation points, which was a bargain). A helpful answer pointed me to an open source (MIT License) library I hadn't found before called DB_Linq.  Now, as usual, no DB2 support.  But that's okay, because it's extensible.  So I set about the task of adding a DB2 "vendor" to the code.  I figured it would mostly be a matter of overriding some methods that generate the SQL syntax to support DB2's own flavor, and initially that's all it was.  But my quest yielded a few more roadblocks.

I was hoping to contribute my DB2 support back to the project, but it turns out that I had to spend less time on the openness of it and more time fine-tuning the whole thing for our specific environment.  You'll see what I mean.

First thing's first, I wanted to make the whole thing read-only.  We don't need to be running experimental code against our core database, even in test.  So most of the overrides for generating the SQL just throw a NotImplementedException.  So if anybody tries to generate anything other than a SELECT statement, the app will fail.  Good.  All we want is SELECT, at least for now.  The SELECTs will get more and more complex as I add functionality to the provider, but for my tests so far I've kept it simple.

So, the code has been extended and the DB2 support has been added.  Let's generate the data context and table classes!  ...  Man, this is taking a long time.  Well, we have a big database, so it should take a little while.  ...  Ok, that code file is getting big.  ...  Yay!  We're done!  And all we have is, um, 5.2 million lines of code.  Visual Studio doesn't like that.  I don't like it either.

Thus, the next step was to modify the engine of the code generator (this is where it starts to fork off a bit too much to contribute back, that and the DB2 support is minimal and not very robust yet) to generate separate files for each table.  Luckily, it was pretty easy.  It already creates everything as partial classes so that it can all be extended, in the proper LINQ data context way.  So I just had to muck a bit with the loop that iterates over the tables and generates the code, and move that loop out of the StreamWriter and have it create its own with each iteration.  And, since it's all partial classes, each table class file also extends the data context with its own table property.  Nifty.

Ok, generate the code again.  Now we can see how big that database really is.  Well, as it turns out, the code generated 8,206 table class files.  Assuming my unit conversion is correct, that's just over 4.1 metric fuck-tons of tables.  Holy Hell, Batman, I would have considered a few hundred tables to be excessive.  But, it is what it is.  And now we have code.  Of course, Visual Studio still really doesn't like it.  So let's wrap up the code generation and compilation to a DLL in a script and just re-generate it any time we need it.  (<joke class="inside">I think I'll name it BGC.Entities.DataAccessLayer.dll</joke>)

Doesn't compile.  Shit.  Ok, let's take a look.  A bunch of the table classes have repeated members?  (Oh, and just so you know what I'm looking at, in what appears to be classic DB2 style the tables are named things like "TFX002AF" and the columns are named things like "WBRCH1".)  Well, as it turns out, we have tables with columns like "WTBK$" and "WTBK%" and such.  I know it hurts, but hopefully it'll build character.  So the code generator is interpreting both of those as "WTBK_" in the code.  Well, that sucks.  My first approach to this, just to get it to compile so I can see if it even works before I put too much effort into it, was to just loop through the members of each table when generating the code and, if it's a repeat, append another underscore.  So we'll have "WTBK_" and "WTBK__" on the table.  I'll need to go back later and either make something prettier (replace with names of known special characters instead of always an underscore?) or decorate it with enough intellisense that the developers can at least discern which column they're accessing.

A little more tweaking on the multiple files thing and it finally compiles.  Sweet.  Now to run it through some tests.  So I coded up a simple little app that grabs some data from LINQ a few times and then the same data from vanilla ADO.  The LINQ code is definitely sleeker and sexier and, of course, strongly friggin' typed.  I'm a big fan of that last bit, because it moves a certain subset of potential errors from run time to compile time.  It's also much less prone to SQL injection.  Well, I don't need to sell you on the benefits of LINQ.  So I run the test.

The LINQ code is slow.  Really slow.  It takes several seconds to run a query whereas the ADO DataAdapter fills a DataSet in the blink of an eye.  After a little tinkering, it's back to Stack Overflow.  The code compiled fine into a DLL, but it's a 36 MB DLL with over 8,000 classes in a given namespace.  Is that a problem?  Jon Skeet says "no" and, well, he is a bit of an oracle in the C# world.  Is it the generated SQL?  Took a bit of research to figure out how to get that out of the debugger, but that ends up not being the problem.  No, these simple SELECT statements are pretty straight-forward and run fine against the database.  I do notice in my testing, however, that if I restrict the code generating down to a subset of tables (say, 200 of them) then it runs as fast as expected.

Well, if the number of classes and the size of the DLL don't matter outside of compile-time, then that leaves the syntax-sugaring code that generates the SQL.  I didn't change that when I added the DB2 support, it's using the same stuff that the open-source library uses for everything else.  But, since it's open-source, I can debug against the code.  Let's step into the LINQ statement and see where it takes us.

It didn't take long to find the rate-determining step.  Now, this is where we get into the trenches with LINQ and start hitting against some internals with which I am unfamiliar.  Anyway, I ended up stepping into a GetTables() method that loops through every member of the data context class and does a little bit of reflection to figure stuff out.  Does the official Microsoft implementation do this?  I'll have to find out someday.  But this implementation does it, and I guess the developer(s) didn't expect to ever come across a database with 8,206 tables.  The fool(s)!

So what does this loop do?  Damned if I know.  Yet.  Well, at a high level, it iterates over all of the members of the data context class, looks for meta information about each table, and adds it to a list if it doesn't find it already.  Does it need to?  The only thing that calls this method just grabs from it the one table for which it's looking.  So why do I need meta information on all of them?  It doesn't appear to retain this information, this loop runs every time I build a LINQ query.  So screw it, don't run the loop.  You know the table you're looking for, get its meta information and add it to that list you're using.  So maybe now you'll have to re-find the meta information every time I write a query, but it seems like you were doing that anyway.  Or maybe now you'll have repeated meta information in that list.  Well, so far that hasn't posed any problems.  And if it does I'll address that when I need to, on a much smaller and more manageable list.

Ok, let's run the test again.  Yay!  A few more tests and I now have a perfectly cromulent LINQ to DB2 provider for .NET, neatly wrapped up in a 36 MB DLL.  Take that, ADO.