mariuszprzydatek.com

Writing files to Hadoop HDFS using Scala

Big Data, Hadoop, Scala May 10, 2015 1 Comment

If you’ve been wondering whether storing files in Hadoop HDFS programmatically is difficult, I have good news – it’s not.

For the purpose of this example i’ll be using my favorite (recently) language – Scala.

Here’s what you need to do:

Start a new SBT project in IntelliJ
Add the “hadoop-client” dependency (Important: You must use the same version of the client, as is the version of the Hadoop server you’ll be writing files to)
```
libraryDependencies ++= Seq(
  "org.apache.hadoop" % "hadoop-client" % "2.7.0"
)
```
Check in Hadoop configuration the value of “fs.default.name” property (/etc/hadoop/core-site.xml). This will be the URI you need in order to point the app code at your Hadoop Cluster

Write few lines of code

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

object Hdfs extends App {

  def write(uri: String, filePath: String, data: Array[Byte]) = {
    System.setProperty("HADOOP_USER_NAME", "Mariusz")
    val path = new Path(filePath)
    val conf = new Configuration()
    conf.set("fs.defaultFS", uri)
    val fs = FileSystem.get(conf)
    val os = fs.create(path)
    os.write(data)
    fs.close()
  }
}

Use the code written above

  Hdfs.write("hdfs://0.0.0.0:19000", "test.txt", "Hello World".getBytes)

That’s all there is to it, really

Cheers 🙂

Installing Hadoop on Windows 8 or 8.1

Big Data, Hadoop May 10, 2015 Comments: 6

I was installing Hadoop 2.7.0 recently on a Windows platform (8.1) and thought i’ll document the steps, as the procedure isn’t that obvious (existing documentation on how to do it, is outdated in few places)

Basic info:

Official Apache Hadoop releases do not include Windows binaries, so you have to download sources and build a Windows package yourself.
Do not run the installation from within Cygwin. Cygwin is not required/supported anymore
I assume you have a JDK already installed (ver. 1.7+)
I assume you have Unix command-line tools (like: sh, mkdir, rm, cp, tar, gzip) installed as well. These tools must be present on your PATH. They come with Windows Git package that can be downloaded from here or you can also use win-bash (here) or GnuWin32.
If using Visual Studio, it must be Visual Studio 2010 Professional (not 2012).
Do not use Visual Studio Express (It does not support compiling for 64-bit)
Google’s Protocol Buffers must be installed in exactly version 2.5.0 (not newer, this is a hard-coded dependency …weird)
Several tests that are being executed while building hadoop widows package, require that the user must have the “Create Symbolic Links” privilege. Therefore, the ‘mvn package’ command must be executed from the Command Line in “Administrator mode”.

Installation:

Download Hadoop sources tarball from here.
Make sure you have JAVA_HOME in your “Environment Variables” set up properly (in my case it was “c:\Program Files\Java\jdk1.8.0_40”)
Download Maven binaries from here.
Add ‘bin’ folder of maven to your path (in “Environment Variables”)
Download Google’s Protocol Buffers in version 2.5.0 (no other version, including 2.6.1 will work) from here.
Download and install CMake (Windows Installer) from here.
Download and install “Visual Studio 2010 Professional” (Trial is enough) from here (Web Installer) or here (ISO Image)
Alternatively (to the step no 7 above), you can install “Windows SDK 8.1” from here.
Add the location of newly installed MSBuild.exe (c:\Windows\Microsoft.NET\Framework64\v4.0.30319;) to your system path (in “Environment Variables”).
Because you’ll be running the Maven ‘package’ goal from the Command Line (cmd.exe) in “Administrator mode” (aka. “Elevated mode”), it is important that in steps no 4 and 9 above, you’re updating the “PATH” in “System variables” section, and not in “User variables for logged-in user” section.
Run cmd in “Administrator Mode” and execute: “set Platform=x64” (assuming you want 64-bit version, otherwise use “set Platform=Win32”)

Now, while still in cmd, execute:

mvn package -Pdist,native-win -DskipTests -Dtar

After the build is complete, you should find hadoop-2.7.0.tar.gz file in “hadoop-2.7.0-src\hadoop-dist\target\” directory.
Extract the newly created Hadoop Windows package to the directory of choice (eg. c:\hdp\)

Testing:

We’ll be configuring Hadoop for a Single Node (pseudo-distributed) Cluster.

As part of configuring HDFS, update the files:

near the end of “\hdp\etc\hadoop\hadoop-env.cmd” add following lines:

  set HADOOP_PREFIX=c:\hdp
  set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
  set YARN_CONF_DIR=%HADOOP_CONF_DIR%
  set PATH=%PATH%;%HADOOP_PREFIX%\bin

modify “\hdp\etc\hadoop\core-site.xml” with following:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://0.0.0.0:19000</value>
  </property>
</configuration>

modify “\hdp\etc\hadoop\hdfs-site.xml” with:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Finally, make sure “\hdp\etc\hadoop\slaves” has the following entry:
```
  localhost
```
and create c:\tmp directory as the default configuration puts HDFS metadata and data files under \tmp on the current drive

As part of configuring YARN, update files:

add following entries to “\hdp\etc\hadoop\mapred-site.xml”, replacing %USERNAME% with your Windows user name:

<configuration>
  <property>
    <name>mapreduce.job.user.name</name>
    <value>%USERNAME%</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>yarn.apps.stagingDir</name>
    <value>/user/%USERNAME%/staging</value>
  </property>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>local</value>
  </property>
</configuration>

modify “\hdp\etc\hadoop\yarn-site.xml”, with:

<configuration>
  <property>
    <name>yarn.server.resourcemanager.address</name>
    <value>0.0.0.0:8020</value>
  </property>
  <property>
    <name>yarn.server.resourcemanager.application.expiry.interval</name>
    <value>60000</value>
  </property>
  <property>
    <name>yarn.server.nodemanager.address</name>
    <value>0.0.0.0:45454</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.server.nodemanager.remote-app-log-dir</name>
    <value>/app-logs</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/dep/logs/userlogs</value>
  </property>
  <property>
    <name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddress</name>
    <value>0.0.0.0</value>
  </property>
  <property>
    <name>yarn.server.mapreduce-appmanager.client-service.bindAddress</name>
    <value>0.0.0.0</value>
  </property>
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>-1</value>
  </property>
  <property>
    <name>yarn.application.classpath</name>
    <value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*</value>
  </property>
</configuration>

because Hadoop doesn’t recognize JAVA_HOME from “Environment Variables” (and has problems with spaces in pathnames)
1. copy your JDK to some dir (eg. “c:\hdp\java\jdk1.8.0_40”)
2. edit “\hdp\etc\hadoop\hadoop-env.cmd” and update
```
  set JAVA_HOME=c:\hdp\java\jdk1.8.0_40
```
3. initialize Environment Variables by running cmd in “Administrator Mode” and executing: “c:\hdp\etc\hadoop\hadoop-env.cmd”
Format the FileSystem
```
  c:\hdp\bin\hdfs namenode -format
```
Start HDFS Daemons
```
  c:\hdp\sbin\start-dfs.cmd
```
Start YARN Daemons
```
  c:\hdp\sbin\start-yarn.cmd
```

Run an example YARN job

  c:\hdp\bin\yarn jar c:\hdp\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.0.jar wordcount c:\hdp\LICENSE.txt /out

Check the following pages in your browser:

  Resource Manager:  http://localhost:8088
  Web UI of the NameNode daemon:  http://localhost:50070
  HDFS NameNode web interface:  http://localhost:8042

Voilà.

Resources:

Apache Hadoop project (https://hadoop.apache.org/)
Maven Homepage (https://maven.apache.org/)
Protocol Buffers (https://developers.google.com/protocol-buffers/)
CMake (http://www.cmake.org/)
Git Homepage (http://git-scm.com/)
win-bash on Sourceforge (http://sourceforge.net/projects/win-bash/)

Connecting remote JVM over JMX using VisualVM or JConsole

Java February 11, 2015 Leave a comment

There are many posts over the Internet on how to do it right, but unfortunately none worked for me (debian behind firewall on the server side, reached over VPN from my local Mac). Therefore, i’m sharing below the solution that worked for me.

1. Check server ip

hostname -i

2. use JVM params:

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=[jmx port]
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Djava.rmi.server.hostname=[server ip from step 1]

3. Run application

4. Find pid of the running java process

5. Check all ports used by JMX/RMI

netstat -lp | grep [pid from step 4]

6. Open all ports from step 5 on the firewall

Cheers!

mariuszprzydatek.com

Writing files to Hadoop HDFS using Scala

Installing Hadoop on Windows 8 or 8.1

Connecting remote JVM over JMX using VisualVM or JConsole

Welcome to my Blog on Software Engineering

Recent Posts

Subscribe to RSS

Follow Blog via Email

Archives

Categories

Recent reads

Mariusz Przydatek