But it actually wasn't quite what I thought would happen! I had 3 big surprises in this Challenge.
1. SearchResearchers fairly quickly agreed on using Fusion Tables as the main tool for pulling all the data together. In the discussion group (which was quite active, and very fun to read), the consensus came up fairly quickly. That makes sense, and it's the way I wrote my solution, to follow along in what everyone was doing.
What I thought was going to happen was that you'd create a mySQL database in the cloud, load up the data, then run your query in that!
In other words, I tried to write the Challenge so that a "regular" database was needed. I didn't think that Fusion Tables would be a good solution--but I was wrong. Obviously, it's quite possible to search a Fusion Table for "All lakes above 8000 feet that have been planted with trout in Northern California." It was just a matter of applying filters to the data table once it was created by fusing different pieces together.
I didn't expect that. But it's a great solution. Nicely done, team.
2. The second thing I didn't expect was that there wouldn't BE ANY lakes that fit all of the criteria. What do you know?! (I mean, I knew there were a lot of lakes way high in the California Sierras, and I know that many of them were planted, and when I scanned the list I thought I saw some that I thought fit the bill. Turns out I misread the dataset when I was creating the problem!)
As I wrote in the previous blog post, that's part of the reason that you want to use complete data sets (and them put them into Fusion Tables or mySQL so you can query those tools). This is exactly the kind of thing that search engines are NOT very good at--the very fine grain analysis of data. A search engine can help you find the data, but then you have to process it a bit yourself, with your own tools. Metaphorically speaking, the search engine can find you the cow, but you have to make your own sausage.
3. I didn't expect all of the back-and-forth steps during my solution. I realize that my writeup (in three parts) was long and complicated, but I hope you took away one lasting lesson from this: Even experts have to do a lot of iteration to get the data right.
In effect, a lot of what I wrote down were all of the "I forgot to include this, let me back up and do it again with this new data" steps. Normally in classes, the teachers don't show you these steps because they're slightly boring and show what an idiot you are. (Remember doing proofs in your high school math class? I realized after a while that nobody really does proofs like that. Real mathematicians take a lot of forward-and-back steps to figure it out. Everyone goofs.)
But when I wrote up my solution, I wanted to document all of those intermediate steps as well. Real data scientists do this all the time, which is why I wanted to write it down, to show you the inner steps that I wish my teachers would have shown me.
Overall this was a toughie, no doubt about it. But I hope the search lessons are clear. If I was to summarize them, I'd say:
A. Keep track of your data sources; keep your metadata with the data. With all of the updates and recasting of the data, it was essential to know where a particular set of data originated. Keep track of that stuff! (In FT it's easy--there's a spot for it in the header. In Spreadsheets, I always add a comment to cell A1 with the metadata.)
B. When something takes multiple days to solve, leave yourself a note at the end of the day so you know what you're doing and what's next. That's why I summarized what the key questions were in each of my posts. That's basically what I wrote down to keep track of the whole process.
C. Check your data. Check your data. Check your data. As you saw, a couple of times I found errors in transcription, or data getting clobbered by accident. (Such as when the lat/longs on Horseshoe Lake were wrong.) I like to try to view the data in a different way--such as plotting the locations on a map--to see what I can spot. Spreadsheet computations are often a source of error, so constantly check to make sure that each time you touch the data, you're not accidentally messing it up.
D. Keep trying. This was really a multiple step problem. Sometimes you just have to stick with it.
Thanks again to everyone for sticking with the problem. It was great to see everyone pitching in and contributing.