Select Page


Hadoop is a Java-based programming framework that helps the processing and garage of extraordinarily massive datasets on a cluster of affordable machines. It was once the primary primary open supply undertaking within the large information enjoying box and is backed by way of the Apache Device Basis.

Hadoop is constituted of 4 primary layers:

  • Hadoop Commonplace is the selection of utilities and libraries that reinforce different Hadoop modules.
  • HDFS, which stands for Hadoop Allotted Report Gadget, is answerable for persisting information to disk.
  • YARN, brief for But Every other Useful resource Negotiator, is the “operating system” for HDFS.
  • MapReduce is the unique processing fashion for Hadoop clusters. It distributes paintings inside the cluster or map, then organizes and decreases the consequences from the nodes right into a reaction to a question. Many different processing fashions are to be had for the three.x model of Hadoop.

Hadoop clusters are somewhat complicated to arrange, so the undertaking features a stand-alone mode which is acceptable for finding out about Hadoop, acting easy operations, and debugging.

On this educational, we’re going to set up Hadoop in stand-alone mode and run one of the vital instance instance MapReduce methods it contains to ensure the set up.


To apply this educational, you are going to want:

As soon as you will have finished this prerequisite, you are prepared to put in Hadoop and its dependencies.

Prior to you start, you may also like to try An Introduction to Big Data Concepts and Terminology or An Introduction to Hadoop

Step 1 — Putting in Java

To get began, we’re going to replace our package deal checklist:

Subsequent, we’re going to set up OpenJDK, the default Java Construction Equipment on Ubuntu 18.04:

  • sudo apt set up default-jdk

As soon as the set up is whole, let’s test the model.


openjdk 10.0.1 2018-04-17 OpenJDK Runtime Atmosphere (construct 10.0.1+10-Ubuntu-3ubuntu1) OpenJDK 64-Bit Server VM (construct 10.0.1+10-Ubuntu-3ubuntu1, combined mode)

This output verifies that OpenJDK has been effectively put in.

Step 2 — Putting in Hadoop

With Java in position, we’re going to consult with the Apache Hadoop Releases page to seek out the newest strong unencumber.

Navigate to binary for the discharge you’d like to put in. On this information, we’ll set up Hadoop 3.0.3.

Screenshot of the Hadoop releases page highlighting the link to the latest stable binary

At the subsequent web page, right-click and duplicate the hyperlink to the discharge binary.

Screenshot of the Hadoop mirror page

At the server, we’re going to use wget to fetch it:

  • wget

Notice: The Apache website online will direct you to the most productive reflect dynamically, so your URL won’t fit the URL above.

To be able to ensure that the document we downloaded hasn’t been altered, we’re going to do a handy guide a rough test the usage of SHA-256. Go back to the releases page, then right-click and duplicate the hyperlink to the checksum document for the discharge binary you downloaded:

Screenshot highlighting the .mds file

Once more, we’re going to use wget on our server to obtain the document:

  • wget

Then run the verification:

  • shasum -a 256 hadoop-3.0.3.tar.gz


db96e2c0d0d5352d8984892dfac4e27c0e682d98a497b7e04ee97c3e2019277a hadoop-3.0.3.tar.gz

Examine this worth with the SHA-256 worth within the .mds document:

  • cat hadoop-3.0.3.tar.gz.mds


SHA256 = DB96E2C0 D0D5352D 8984892D FAC4E27C 0E682D98 A497B7E0 4EE97C3E 2019277A

You’ll safely forget about the variation in case and the areas. The output of the command we ran in opposition to the document we downloaded from the reflect must fit the price within the document we downloaded from

Now that we now have verified that the document wasn’t corrupted or modified, we’re going to use the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that we are extracting from a document. Use tab-completion or change the proper model quantity within the command beneath:

  • tar -xzvf hadoop-3.0.3.tar.gz

After all, we’re going to transfer the extracted information into /usr/native, the proper position for in the neighborhood put in tool. Exchange the model quantity, if wanted, to check the model you downloaded.

  • sudo mv hadoop-3.0.3 /usr/native/hadoop

With the tool in position, we are in a position to configure its atmosphere.

Step 3 — Configuring Hadoop’s Java House

Hadoop calls for that you simply set the trail to Java, both as an atmosphere variable or within the Hadoop configuration document.

The trail to Java, /usr/bin/java is a symlink to /and so forth/possible choices/java, which is in flip a symlink to default Java binary. We can use readlink with the -f flag to apply each symlink in each a part of the trail, recursively. Then, we’re going to use sed to trim bin/java from the output to offer us the proper worth for JAVA_HOME.

To search out the default Java trail

  • readlink -f /usr/bin/java | sed "s:bin/java::"



You’ll replica this output to set Hadoop’s Java house to this explicit model, which guarantees that if the default Java adjustments, this worth is not going to. On the other hand, you’ll be able to use the readlink command dynamically within the document in order that Hadoop will mechanically use no matter Java model is about because the gadget default.

To start, open

  • sudo nano /usr/native/hadoop/and so forth/hadoop/

Then, make a selection one of the vital following choices:

Choice 1: Set a Static Worth

/usr/native/hadoop/and so forth/hadoop/

 . . .
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
 . . . 

/usr/native/hadoop/and so forth/hadoop/

 . . .
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
 . . . 

Notice: With appreciate to Hadoop, the price of JAVA_HOME in overrides any values which might be set within the atmosphere by way of /and so forth/profile or in a consumer’s profile.

Step 4 — Working Hadoop

Now we must have the ability to run Hadoop:

  • /usr/native/hadoop/bin/hadoop


Utilization: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS] the place CLASSNAME is a user-provided Java magnificence OPTIONS is none or any of: --config dir Hadoop config listing --debug activate shell script debug mode --help utilization data buildpaths try to upload magnificence information from construct tree hostnames checklist[,of,host,names] hosts to make use of in slave mode hosts filename checklist of hosts to make use of in slave mode loglevel stage set the log4j stage for this command staff activate employee mode SUBCOMMAND is considered one of: . . .

The assist manner we now have effectively configured Hadoop to run in stand-alone mode. We will make certain that it’s functioning correctly by way of working the instance MapReduce program it ships with. To take action, create a listing known as enter in our house listing and duplicate Hadoop’s configuration information into it to make use of the ones information as our information.

  • mkdir ~/enter
  • cp /usr/native/hadoop/and so forth/hadoop/*.xml ~/enter

Subsequent, we will be able to use the next command to run the MapReduce hadoop-mapreduce-examples program, a Java archive with a number of choices. We will invoke its grep program, one of the vital many examples incorporated in hadoop-mapreduce-examples, adopted by way of the enter listing, enter and the output listing grep_example. The MapReduce grep program will depend the suits of a literal phrase or common expression. After all, we’re going to provide the common expression allowed[.]* to seek out occurrences of the phrase allowed inside or on the finish of a declarative sentence. The expression is case-sensitive, so we would not in finding the phrase if it had been capitalized at the start of a sentence:

  • /usr/native/hadoop/bin/hadoop jar /usr/native/hadoop/proportion/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep ~/enter ~/grep_example 'allowed[.]*'

When the duty completes, it supplies a abstract of what has been processed and mistakes it has encountered, however this does not include the true effects.


. . . Report Gadget Counters FILE: Collection of bytes learn=1330690 FILE: Collection of bytes written=3128841 FILE: Collection of learn operations=0 FILE: Collection of massive learn operations=0 FILE: Collection of write operations=0 Map-Scale back Framework Map enter data=2 Map output data=2 Map output bytes=33 Map output materialized bytes=43 Enter break up bytes=115 Mix enter data=0 Mix output data=0 Scale back enter teams=2 Scale back shuffle bytes=43 Scale back enter data=2 Scale back output data=2 Spilled Data=4 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=3 Overall dedicated heap utilization (bytes)=478150656 Shuffle Mistakes BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 Report Enter Structure Counters Bytes Learn=147 Report Output Structure Counters Bytes Written=34

Notice: If the output listing already exists, this system will fail, and quite than seeing the abstract, the output will glance one thing like:


. . . at java.base/java.lang.mirror.Means.invoke( at at org.apache.hadoop.util.RunJar.primary(

Effects are saved within the output listing and can also be checked by way of working cat at the output listing:


19 allowed. 1 allowed

The MapReduce activity discovered 19 occurrences of the phrase allowed adopted by way of a length and one prevalence the place it was once now not. Working the instance program has verified that our stand-alone set up is operating correctly and that non-privileged customers at the gadget can run Hadoop for exploration or debugging.


On this educational, we now have put in Hadoop in stand-alone mode and verified it by way of working an instance program it supplied. To be informed how write your individual MapReduce methods, it’s possible you’ll need to consult with Apache Hadoop’s MapReduce tutorial which walks in the course of the code at the back of the instance. When you find yourself in a position to arrange a cluster, see the Apache Basis Hadoop Cluster Setup information.