In Scala this is as simple creating a function like this:
1 2 3 4 5 6 |
|
This is a higher-order function that will time the execution of whatever
function that gets passed to it. If you are coming from Java it should look pretty straight forward, the only major difference is the function
argument declaration of execution: () => Unit
; this declares the function argument execution
as type () => Unit
. In Scala
the type declaration comes after the variable name and the two are separated by a :
. The type declaration in this case defines
a function that takes zero arguments (()
is syntactic sugar for this) and returns nothing (Unit
here is similar to void
in Java).
Below are some examples of using this function:
1 2 3 4 5 6 |
|
Writing code like this is only possible thanks to the ability to treat functions as objects. While this won’t revolutionize the way you code it is does allow you to start removing the boiler plate code that tends to build up in Java.
This is only the tip of the iceberg in functional programming, if you want to learn more check out the free Scala by Example book provided by on the Scala website.
]]>Enough of the high level speak though, what I really want to talk about is the pain I experienced just trying to get data in and out of HDFS. Most of the pain was self-inflicted as my mental model going into the problem was induced from over a year working with Cassandra, which is a much simpler system for storing data albeit does not provide as good of a foundation for storing raw data in a lambda architecture type design. In Cassandra you have the cluster and you have the client, where the client is your application and it speaks to the cluster over the network in a fairly typical client-server model.
I quickly discovered that my map was not the territory when I started writing some simple code for sending data into HDFS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
As far as I could tell this pattern for writing was consistent with most of the tutorials I found via Google, the only new thing I added in was the use of Twitter’s elephantbird library to write Snappy compressed Protocol Buffer data. So I was surprised when I saw (and kept seeing) the following errors:
Client Errors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Server Errors (hdfs datanode log):
1 2 3 4 5 |
|
After about 8 hours of lots of googling and ensuring all client and server jars were the same version my surprise turned to frustration. Especially since the error messages were so vague.
Finally after a few circuits at the gym, I started from scratch and read through all the documentation I could find about typical data loading strategies for HDFS
(something I should have done to begin with). This led me to the realization that my client-server mental model was flawed within the HDFS context
since HDFS makes no assumption about where the data is being written from (in fact it seems to assume that the client is local to the cluster).
Some quick exploration of the org.apache.hadoop.fs.FileSystem
class hierarchy showed that there are a variety of different ways for writing
to HDFS and only some of them are over TCP/IP. So with a little refactoring to use the org.apache.hadoop.hdfs.web.WebHdfsFileSystem
implementation
my code works just fine:
Note the new webhdfs://
protocol in the URI and the new port of 50070
. There seems to be a tight coupling of protocol to FileSystem
implementation as well as port mapping, but I have not found great documentation yet as to what this coupling is.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
File in hdfs:
1 2 3 |
|
I’m still not sure this is the write method for writing large amounts of data to HDFS, but more dots are starting to connect in my head about how the component parts of the ecosystem fit together. Lots more to learn though.
]]>My second talk was at the Datastax Cassandra SF Meetup hosted at Disqus. This talk (slides here) was a bit more low-level and focused on how we have been using Cassandra at Lithium for the past six months as we move to a more service-oriented architecture internally. This talk was primarily focused on our use case, data model, and all of the issues we dealt with getting Cassandra into production. I also covered the strategy we used for migrating data from MySQL to Cassandra with zero downtime. Our migration strategy was heavily influenced by a Netflix blog post covering a migration from SimpleDB to Cassandra.
]]>The following steps outline how to setup a fresh Octopress install that connects to an existing Github Pages repository (this builds on the zerosharp post with some updates based on recent changes in Octopress). These steps assume that the Github repository is fully up to date with all latest changes
First you need to make sure that your source
directory actually contains the source/_posts
and the stylesheets
folder. You also need to make sure .gitignore is not ignoring any of these (if there is a reason these should not be committed please let me know, I could not think of any).
1 2 3 4 5 6 |
|
The remainder of the steps happen on your second machine. First clone the source
branch
1 2 |
|
Next, install Octopress
1 2 3 4 |
|
The last command deleted the _deploy
directory and re-added it. We don’t want this because we want the latest changes from _deploy
so we don’t run into any nasty [rejected] master -> master (non-fast-forward)
git errors because of an out-of-date branch.
1 2 |
|
Octopress should now be setup, and the source
dir should contain your up-to-date markdown. To test things out make a change to a post (or make a new post) then regenerate and deploy.
1 2 |
|
Your new changes should appear on your site. The only downside of this approach is that when you go back to your original machine and try to deploy, git will yell at you (i.e., non-fast-forward error) because the _deploy
dir from that machine is now out of date. The best way to fix this is to remove it (see above) and re-clone it (see above). The other option is to edit the octopress/RakeFile
and change the line: system "git push origin #{deploy_branch}"
to system "git push origin +#{deploy_branch}"
to force the deployment despite version mismatches (be sure to undo this change immediately after).
This initial post is both a how-to for setting up a blog with Octopress/Github and quick cheat-sheet so I don’t forget how I did things.
The Octopress setup docs are incredibly helpful, so I’m not going to duplicate content explaining what everything means. Assuming ruby is correctly installed the setup commands are as follows:
1 2 3 4 5 6 7 |
|
The above commands will leave your ‘/octopress’ dir in a state where you are ready to begin blogging. You can think of the ‘/octopress’ dir as a container for all your blog content as well as the library for all the commands you need to push content to Github.
The following files and directories comprise the essential building blocks for an Octopress site:
The following commands are the most useful for doing basic things with Octopress
1 2 3 4 5 6 |
|
That is it for now, still need to document how to perform development from multiple machines and how to rebuild your local development workstation if something goes wrong.
]]>