Writing files to Hadoop HDFS using Scala

If you’ve been wondering whether storing files in Hadoop HDFS programmatically is difficult, I have good news – it’s not.

For the purpose of this example i’ll be using my favorite (recently) language – Scala.


Here’s what you need to do:

  1. Start a new SBT project in IntelliJ
  2. Add the “hadoop-client” dependency (Important: You must use the same version of the client, as is the version of the Hadoop server you’ll be writing files to)
    libraryDependencies ++= Seq(
      "org.apache.hadoop" % "hadoop-client" % "2.7.0"
  3. Check in Hadoop configuration the value of “fs.default.name” property (/etc/hadoop/core-site.xml). This will be the URI you need in order to point the app code at your Hadoop Cluster
  4. Write few lines of code
    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs.{FileSystem, Path}
    object Hdfs extends App {
      def write(uri: String, filePath: String, data: Array[Byte]) = {
        System.setProperty("HADOOP_USER_NAME", "Mariusz")
        val path = new Path(filePath)
        val conf = new Configuration()
        conf.set("fs.defaultFS", uri)
        val fs = FileSystem.get(conf)
        val os = fs.create(path)
  5. Use the code written above
      Hdfs.write("hdfs://", "test.txt", "Hello World".getBytes)


That’s all there is to it, really


Installing Hadoop on Windows 8 or 8.1

I was installing Hadoop 2.7.0 recently on a Windows platform (8.1) and thought i’ll document the steps, as the procedure isn’t that obvious (existing documentation on how to do it, is outdated in few places)


Basic info:

  • Official Apache Hadoop releases do not include Windows binaries, so you have to download sources and build a Windows package yourself.
  • Do not¬†run the installation from within Cygwin. Cygwin is not¬†required/supported anymore
  • I assume you have a JDK already installed (ver. 1.7+)
  • I assume you have¬†Unix command-line tools (like:¬†sh, mkdir, rm, cp, tar, gzip) installed as well. These¬†tools must be present on your PATH. They come with Windows Git package that can be downloaded from here¬†or¬†you can also use win-bash (here) or GnuWin32.
  • If using Visual Studio, it must be Visual Studio 2010 Professional (not 2012).
  • Do not use Visual Studio Express¬†(It does not support compiling for 64-bit)
  • Google‚Äôs Protocol Buffers must be installed in exactly¬†version 2.5.0 (not newer, this is a hard-coded dependency¬†…weird)
  • Several tests that are being executed while building hadoop¬†widows package, require that the user must have the “Create Symbolic Links”¬†privilege. Therefore, the ‘mvn¬†package’ command¬†must be executed¬†from the Command Line in “Administrator mode”.



  1. Download Hadoop sources tarball from here.
  2. Make sure you have JAVA_HOME in your “Environment Variables” set up properly (in my case it was “c:\Program Files\Java\jdk1.8.0_40”)
  3. Download Maven binaries from here.
  4. Add ‘bin’ folder of maven to your path (in “Environment Variables”)
  5. Download Google’s Protocol Buffers in version 2.5.0 (no other version, including 2.6.1 will work) from here.
  6. Download and install CMake (Windows Installer) from here.
  7. Download and install “Visual Studio 2010 Professional” (Trial is enough) from here (Web Installer) or here (ISO Image)
  8. Alternatively (to the step no 7 above), you can install “Windows SDK 8.1” from here.
  9. Add the location of newly installed MSBuild.exe (c:\Windows\Microsoft.NET\Framework64\v4.0.30319;) to your system path¬†(in “Environment Variables”).
  10. Because you’ll be running the Maven ‘package’¬†goal from the Command Line (cmd.exe) in “Administrator mode” (aka. “Elevated mode”), it is important that in steps no 4 and 9 above, you’re updating the “PATH” in “System variables” section, and not in “User variables for logged-in user” section.
  11. Run cmd in “Administrator Mode” and execute: “set Platform=x64” (assuming you want 64-bit version, otherwise use “set Platform=Win32”)
  12. Now, while still in cmd, execute:
    mvn package -Pdist,native-win -DskipTests -Dtar
  13. After the build is complete, you should find hadoop-2.7.0.tar.gz file in “hadoop-2.7.0-src\hadoop-dist\target\” directory.
  14. Extract the newly created Hadoop Windows package to the directory of choice (eg. c:\hdp\)



  1. We’ll be configuring Hadoop for a¬†Single Node (pseudo-distributed) Cluster.
  2. As part of configuring HDFS, update the files:
    1. near the end of “\hdp\etc\hadoop\hadoop-env.cmd” add following lines:
        set HADOOP_PREFIX=c:\hdp
        set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
        set PATH=%PATH%;%HADOOP_PREFIX%\bin
    2. modify “\hdp\etc\hadoop\core-site.xml” with following:
    3. modify “\hdp\etc\hadoop\hdfs-site.xml” with:
    4. Finally, make sure “\hdp\etc\hadoop\slaves” has the following entry:

    5. and create c:\tmp directory as the default configuration puts HDFS metadata and data files under \tmp on the current drive
  3. As part of configuring YARN, update files:
    1. add following entries to “\hdp\etc\hadoop\mapred-site.xml”, replacing %USERNAME% with your Windows user name:
    2. modify “\hdp\etc\hadoop\yarn-site.xml”, with:
  4. because Hadoop doesn’t recognize JAVA_HOME from “Environment Variables” (and has problems with spaces in pathnames)
    1. copy your JDK to some dir (eg. “c:\hdp\java\jdk1.8.0_40”)
    2. edit “\hdp\etc\hadoop\hadoop-env.cmd” and update
        set JAVA_HOME=c:\hdp\java\jdk1.8.0_40
    3. initialize Environment Variables by running cmd in “Administrator Mode” and executing: “c:\hdp\etc\hadoop\hadoop-env.cmd”
  5. Format the FileSystem
      c:\hdp\bin\hdfs namenode -format
  6. Start HDFS Daemons
  7. Start YARN Daemons
  8. Run an example YARN job
      c:\hdp\bin\yarn jar c:\hdp\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.0.jar wordcount c:\hdp\LICENSE.txt /out
  9. Check the following pages in your browser:
      Resource Manager:  http://localhost:8088
      Web UI of the NameNode daemon:  http://localhost:50070
      HDFS NameNode web interface:  http://localhost:8042






Connecting remote JVM over JMX using VisualVM or JConsole

There are many posts over the Internet on how to do it right, but unfortunately none worked for me (debian behind firewall on the server side, reached over VPN from my local Mac). Therefore, i’m sharing below the solution that worked for me.


1. Check server ip

hostname -i


2. use JVM params:

-Dcom.sun.management.jmxremote.port=[jmx port]
-Djava.rmi.server.hostname=[server ip from step 1]


3. Run application


4. Find pid of the running java process


5. Check all ports used by JMX/RMI

netstat -lp | grep [pid from step 4]


6. Open all ports from step 5 on the firewall


Connecting remote JVM over JMX using VisualVM or JConsole



SSH Linux login without password

Below is probably the quickest way to achieve this


1. Generate SSH key (if you don’t have one already)

ssh-keygen -t rsa


2. Use SSH to create a remote directory ~/.ssh

ssh username@dev.company.com mkdir -p .ssh


3. Append your public key to .ssh/authorized_keys on remote host

cat ~/.ssh/id_rsa.pub | ssh username@dev.company.com 'cat >> .ssh/authorized_keys'



That’s it!

SSH Key Authentication with GitLab

Every time i start building a product for a new company, one of the first step is creating a repository and uploading SSH key. Instead of browsing the web looking for a reminder on how to do it, i decided i’ll post the quickest solution here.


1. Enter the following command in the Terminal window (Mac OS X)

ssh-keygen -t rsa


2. Accept default location and leave password blank (or not, up to you)


3. The key will get generated

Your identification has been saved in /Users/mariuszprzydatek/.ssh/id_rsa.
Your public key has been saved in /Users/mariuszprzydatek/.ssh/id_rsa.pub.
The key fingerprint is:
ce:80:76:66:5b:5d:d2:29:3d:64:66:65:e8:d3:aa:5e mariuszprzydatek@Mariuszs-MacBook-Pro.local
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|         .       |
|        E .      |
|   .   . o       |
|  o . . S .      |
| + + o . +       |
|. + o = o +      |
| o...o * o       |
|.  oo.o .        |


4. The private key (id_rsa) is saved in the .ssh directory and used to verify the public key. The public key (id_rsa.pub) is the key you’ll be uploading to your GitLab account.


5. Copy your public key to the clipboard

pbcopy < ~/.ssh/id_rsa.pub


6. Paste the key to GitLab


GitLab SSH Key Authentication




Git branch name in zsh terminal

Ever¬†wondered¬†how nice it would be, to always¬†know which git branch you’re current on, in a given directory? If so, then i encourage you to give¬†Prezto ‚ÄĒ Instantly Awesome Zsh a try.


Git branch name in zsh terminal


You’ll find it here (as well as instruction on how to install):



Prezto integrates nicely with (among others):

iTerm2, SSH, Ruby, Git, various editors, etc.



Iterative Dichotomiser 3 (ID3) algorithm – Decision Trees – Machine Learning

ID3 is the first of a series of algorithms created by Ross Quinlan to generate decision trees.



  • ID3 does not guarantee an optimal solution; it can get stuck in local optimums
  • It uses a greedy approach by selecting the best attribute to split the dataset on each iteration (one improvement that can be made on the algorithm can be to use backtracking during the search for the optimal decision tree)
  • ID3 can overfit to the training data (to avoid overfitting, smaller decision trees should be preferred over larger ones)
  • This algorithm usually produces small trees, but it does not always produce the smallest possible tree
  • ID3 is harder to use on continuous data (if the values of any given attribute is continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time consuming).



  • ID3 is a precursor to both C4.5 algorithm, as well as C5.0 algorithm.
  • C4.5 improvements over ID3:
    • discrete and continuous attributes,
    • missing attribute values,
    • attributes with differing costs,
    • pruning trees (replacing irrelevant branches with leaf nodes)
  • C5.0 improvements over C4.5:
    • several orders of magnitude faster,
    • memory efficiency,
    • smaller decision trees,
    • boosting (more accuracy),
    • ability to weight different attributes,
    • winnowing (reducing noise)
  • J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool
  • C5.0 is being sold commercially (single-threaded version is distributed under the terms of the GNU General Public License) under following names: C5.0 (Unix/Linux), See5 (Windows)



  • The ID3 algorithm is used by training a dataset S to produce a decision tree which is stored in memory.
  • At runtime, the decision tree is used to classify new unseen test cases by working down the tree¬†nodes¬†using the values of a given¬†test case to arrive at a terminal node that tells you what class this test case belongs to.



  • Entropy H(S) – measures the amount of uncertainty in the (data) set S
  • Information gain IG(A) – measures how much uncertainty in S was reduced, after splitting the (data) set S on a attribute
  • More details on both Entropy and Information Gain you’ll find here.


High-level inner workings:

  • Calculate the entropy of every attribute using the data set S
  • Split the set S into subsets using the attribute for which entropy is minimum (or, equivalently, information gain is maximum)
  • Make a decision tree node containing that attribute
  • Recurse on subsets using remaining attributes


Detailed algorithm steps:

  1. We begin with the original data set S as the root node
  2. In each iteration the algorithm iterates through every unused attribute of the data set S and calculates the entropy H(S) (or information gain IG(A)) of that attribute
  3. Next it selects the attribute which has the smallest entropy (or largest information gain) value
  4. The data set S is then split by the selected attribute (e.g. age < 50, 50 <= age < 100, age >= 100) to produce subsets of the data
  5. The algorithm continues to recurse on each subset, considering only attributes never selected before
  6. Recursion on a subset may stop in one of these cases:
    • every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf and labelled with the class of the examples
    • there are no more attributes to be selected, but the examples still do not belong to the same class (some are + and some are -), then the node is turned into a leaf and labelled with the most common class of the examples in the subset
    • there are no examples in the subset, this happens when no example in the parent set was found to be matching a specific value of the selected attribute, for example if there was no example with age >= 100. Then a leaf is created, and labelled with the most common class of the examples in the parent set
  7. Throughout the algorithm, the decision tree is constructed with each non-terminal node representing the selected attribute on which the data was split, and terminal nodes representing the class label of the final subset of this branch


Python implementation:

  1. Create a new python file called id3_example.py
  2. Import logarithmic capabilities from math lib as well as the operator library
        from math import log
        import operator
  3. Add a function to calculate the entropy of a data set
    def entropy(data):
        entries = len(data)
        labels = {}
        for feat in data:
            label = feat[-1]
            if label not in labels.keys():
            labels[label] = 0
            labels[label] += 1
        entropy = 0.0
        for key in labels:
            probability = float(labels[key])/entries
            entropy -= probability * log(probability,2)
        return entropy
  4. Add a function to split the data set on a given feature
    def split(data, axis, val):
        newData = []
        for feat in data:
            if feat[axis] == val:
                reducedFeat = feat[:axis]
        return newData
  5. Add a function to choose the best feature to split on
    def choose(data):
        features = len(data[0]) - 1
        baseEntropy = entropy(data)
        bestInfoGain = 0.0;
        bestFeat = -1
        for i in range(features):
            featList = [ex[i] for ex in data]
            uniqueVals = set(featList)
            newEntropy = 0.0
            for value in uniqueVals:
                newData = split(data, i, value)
                probability = len(newData)/float(len(data))
                newEntropy += probability * entropy(newData)
            infoGain = baseEntropy - newEntropy
            if (infoGain > bestInfoGain):
                bestInfoGain = infoGain
                bestFeat = i
        return bestFeat
  6. According to step 6 of the “Detailed algorithm steps” section above, there are certain cases in which the recursion may stop.¬†If we¬†don‚Äôt meet any of the stopping¬†conditions, then the small function below will allow us to¬†choose the best feature depending on the “majority”:
    def majority(classList):
        for vote in classList:
            if vote not in classCount.keys(): classCount[vote] = 0
            classCount[vote] += 1
        sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
  7. Finally add the main function to generate the decision tree
    def tree(data,labels):
        classList = [ex[-1] for ex in data]
        if classList.count(classList[0]) == len(classList):
            return classList[0]
        if len(data[0]) == 1:
            return majority(classList)
        bestFeat = choose(data)
        bestFeatLabel = labels[bestFeat]
        theTree = {bestFeatLabel:{}}
        featValues = [ex[bestFeat] for ex in data]
        uniqueVals = set(featValues)
        for value in uniqueVals:
            subLabels = labels[:]
            theTree[bestFeatLabel][value] = tree(split\(data, bestFeat, value),subLabels)
        return theTree