Category Archives: Hadoop

Writing files to Hadoop HDFS using Scala

If you’ve been wondering whether storing files in Hadoop HDFS programmatically is difficult, I have good news – it’s not.

For the purpose of this example i’ll be using my favorite (recently) language – Scala.

 

Here’s what you need to do:

  1. Start a new SBT project in IntelliJ
  2. Add the “hadoop-client” dependency (Important: You must use the same version of the client, as is the version of the Hadoop server you’ll be writing files to)
    libraryDependencies ++= Seq(
      "org.apache.hadoop" % "hadoop-client" % "2.7.0"
    )
    
  3. Check in Hadoop configuration the value of “fs.default.name” property (/etc/hadoop/core-site.xml). This will be the URI you need in order to point the app code at your Hadoop Cluster
  4. Write few lines of code
    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs.{FileSystem, Path}
    
    object Hdfs extends App {
    
      def write(uri: String, filePath: String, data: Array[Byte]) = {
        System.setProperty("HADOOP_USER_NAME", "Mariusz")
        val path = new Path(filePath)
        val conf = new Configuration()
        conf.set("fs.defaultFS", uri)
        val fs = FileSystem.get(conf)
        val os = fs.create(path)
        os.write(data)
        fs.close()
      }
    }
    
  5. Use the code written above
      Hdfs.write("hdfs://0.0.0.0:19000", "test.txt", "Hello World".getBytes)
    

 

That’s all there is to it, really

Cheers 🙂

Installing Hadoop on Windows 8 or 8.1

I was installing Hadoop 2.7.0 recently on a Windows platform (8.1) and thought i’ll document the steps, as the procedure isn’t that obvious (existing documentation on how to do it, is outdated in few places)

 

Basic info:

  • Official Apache Hadoop releases do not include Windows binaries, so you have to download sources and build a Windows package yourself.
  • Do not run the installation from within Cygwin. Cygwin is not required/supported anymore
  • I assume you have a JDK already installed (ver. 1.7+)
  • I assume you have Unix command-line tools (like: sh, mkdir, rm, cp, tar, gzip) installed as well. These tools must be present on your PATH. They come with Windows Git package that can be downloaded from here or you can also use win-bash (here) or GnuWin32.
  • If using Visual Studio, it must be Visual Studio 2010 Professional (not 2012).
  • Do not use Visual Studio Express (It does not support compiling for 64-bit)
  • Google’s Protocol Buffers must be installed in exactly version 2.5.0 (not newer, this is a hard-coded dependency …weird)
  • Several tests that are being executed while building hadoop widows package, require that the user must have the “Create Symbolic Links” privilege. Therefore, the ‘mvn package’ command must be executed from the Command Line in “Administrator mode”.

 

Installation:

  1. Download Hadoop sources tarball from here.
  2. Make sure you have JAVA_HOME in your “Environment Variables” set up properly (in my case it was “c:\Program Files\Java\jdk1.8.0_40”)
  3. Download Maven binaries from here.
  4. Add ‘bin’ folder of maven to your path (in “Environment Variables”)
  5. Download Google’s Protocol Buffers in version 2.5.0 (no other version, including 2.6.1 will work) from here.
  6. Download and install CMake (Windows Installer) from here.
  7. Download and install “Visual Studio 2010 Professional” (Trial is enough) from here (Web Installer) or here (ISO Image)
  8. Alternatively (to the step no 7 above), you can install “Windows SDK 8.1” from here.
  9. Add the location of newly installed MSBuild.exe (c:\Windows\Microsoft.NET\Framework64\v4.0.30319;) to your system path (in “Environment Variables”).
  10. Because you’ll be running the Maven ‘package’ goal from the Command Line (cmd.exe) in “Administrator mode” (aka. “Elevated mode”), it is important that in steps no 4 and 9 above, you’re updating the “PATH” in “System variables” section, and not in “User variables for logged-in user” section.
  11. Run cmd in “Administrator Mode” and execute: “set Platform=x64” (assuming you want 64-bit version, otherwise use “set Platform=Win32”)
  12. Now, while still in cmd, execute:
    mvn package -Pdist,native-win -DskipTests -Dtar
    
  13. After the build is complete, you should find hadoop-2.7.0.tar.gz file in “hadoop-2.7.0-src\hadoop-dist\target\” directory.
  14. Extract the newly created Hadoop Windows package to the directory of choice (eg. c:\hdp\)

 

Testing:

  1. We’ll be configuring Hadoop for a Single Node (pseudo-distributed) Cluster.
  2. As part of configuring HDFS, update the files:
    1. near the end of “\hdp\etc\hadoop\hadoop-env.cmd” add following lines:
        set HADOOP_PREFIX=c:\hdp
        set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
        set YARN_CONF_DIR=%HADOOP_CONF_DIR%
        set PATH=%PATH%;%HADOOP_PREFIX%\bin
      
    2. modify “\hdp\etc\hadoop\core-site.xml” with following:
      <configuration>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://0.0.0.0:19000</value>
        </property>
      </configuration>
      
    3. modify “\hdp\etc\hadoop\hdfs-site.xml” with:
      <configuration>
        <property>
          <name>dfs.replication</name>
          <value>1</value>
        </property>
      </configuration>
      
    4. Finally, make sure “\hdp\etc\hadoop\slaves” has the following entry:

        localhost
      
    5. and create c:\tmp directory as the default configuration puts HDFS metadata and data files under \tmp on the current drive
  3. As part of configuring YARN, update files:
    1. add following entries to “\hdp\etc\hadoop\mapred-site.xml”, replacing %USERNAME% with your Windows user name:
      <configuration>
        <property>
          <name>mapreduce.job.user.name</name>
          <value>%USERNAME%</value>
        </property>
        <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
        </property>
        <property>
          <name>yarn.apps.stagingDir</name>
          <value>/user/%USERNAME%/staging</value>
        </property>
        <property>
          <name>mapreduce.jobtracker.address</name>
          <value>local</value>
        </property>
      </configuration>
      
    2. modify “\hdp\etc\hadoop\yarn-site.xml”, with:
      <configuration>
        <property>
          <name>yarn.server.resourcemanager.address</name>
          <value>0.0.0.0:8020</value>
        </property>
        <property>
          <name>yarn.server.resourcemanager.application.expiry.interval</name>
          <value>60000</value>
        </property>
        <property>
          <name>yarn.server.nodemanager.address</name>
          <value>0.0.0.0:45454</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
          <name>yarn.server.nodemanager.remote-app-log-dir</name>
          <value>/app-logs</value>
        </property>
        <property>
          <name>yarn.nodemanager.log-dirs</name>
          <value>/dep/logs/userlogs</value>
        </property>
        <property>
          <name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddress</name>
          <value>0.0.0.0</value>
        </property>
        <property>
          <name>yarn.server.mapreduce-appmanager.client-service.bindAddress</name>
          <value>0.0.0.0</value>
        </property>
        <property>
          <name>yarn.log-aggregation-enable</name>
          <value>true</value>
        </property>
        <property>
          <name>yarn.log-aggregation.retain-seconds</name>
          <value>-1</value>
        </property>
        <property>
          <name>yarn.application.classpath</name>
          <value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*</value>
        </property>
      </configuration>
      
  4. because Hadoop doesn’t recognize JAVA_HOME from “Environment Variables” (and has problems with spaces in pathnames)
    1. copy your JDK to some dir (eg. “c:\hdp\java\jdk1.8.0_40”)
    2. edit “\hdp\etc\hadoop\hadoop-env.cmd” and update
        set JAVA_HOME=c:\hdp\java\jdk1.8.0_40
      
    3. initialize Environment Variables by running cmd in “Administrator Mode” and executing: “c:\hdp\etc\hadoop\hadoop-env.cmd”
  5. Format the FileSystem
      c:\hdp\bin\hdfs namenode -format
    
  6. Start HDFS Daemons
      c:\hdp\sbin\start-dfs.cmd
    
  7. Start YARN Daemons
      c:\hdp\sbin\start-yarn.cmd
    
  8. Run an example YARN job
      c:\hdp\bin\yarn jar c:\hdp\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.0.jar wordcount c:\hdp\LICENSE.txt /out
    
  9. Check the following pages in your browser:
      Resource Manager:  http://localhost:8088
      Web UI of the NameNode daemon:  http://localhost:50070
      HDFS NameNode web interface:  http://localhost:8042
    

 

Voilà.

 

 

Resources: