Objective
In this Hadoop tutorial, we will discuss the important differences between Hadoop 2.x and Hadoop 3.x. What new features have been added in Hadoop version 3, are Hadoop 2 programs compatible with Hadoop 3, what is the difference between Hadoop 2 and Hadoop 3? We hope this functional difference between Hadoop 2 and Hadoop 3 will help you answer the questions above.
Hadoop 2.x and Hadoop 3.x Important Differences
This section tells you about the 22 most important differences between Hadoop 2.x and Hadoop 3.x. Now let’s look at each feature individually and their important differences.
License
Hadoop 2.x – Apache 2.0, open source
Hadoop 3.x – Apache 2.0, open source
Minimum supported version of Java
Hadoop 2.x – The minimum supported version of Java is Java 7.
Hadoop 3.x – The minimum supported version of Java is Java 8
Fault Tolerance
Hadoop 2.x – Fault tolerance can be handled through replication (which is a waste of space).
Hadoop 3.x – Fault tolerance can be managed by erasure coding.
Data Balancing
Hadoop 2.x – The HDFS Balancer is used for data balancing.
Hadoop 3.x – Balancing for data uses Intradata Node Balancer, which is called through the HDFS Disk Balancer CLI.
Storage Scheme
Hadoop 2.x – Uses the 3X replication scheme.
Hadoop 3.x – Support for HDFS erasure encoding.
Storage Overhead
Hadoop 2.x : HDFS has a storage overhead of 200%.
Hadoop 3.x: memory overhead is only 50%.
Storage Overhead Example
Hadoop 2.x : If there are 6 blocks, the replication scheme uses 18 blocks of storage space.
Hadoop 3.x – If there are 6 blocks, 9 blocks are occupied, place 6 blocks and 3 for parity.
YARN
Hadoop 2.x Timeline Service – Uses an old timeline service with scalability issues.
Hadoop 3.x: Improves Timeline Service v2 and improves the scalability and reliability of the Timeline Service.
Standard Port Range
Hadoop 2.x – In Hadoop 2.0, some standard ports are short-lived Linux port ranges. So at launch, they are not linked.
Hadoop 3.x – But in Hadoop 3.0 these ports move out of the ephemeral realm.
Tools
Hadoop 2.x – Uses Hive, Pig, Tez, Hama, Giraph, and other Hadoop tools.
Hadoop 3.x: Hive, Pig, Tez, Hama, Giraph and other Hadoop tools are available.
Compatible File System
Hadoop 2.x – HDFS (FS Standard), FTP File System: This saves all your data on remotely accessible FTP servers. Amazon Simple Storage Service (S3) file system Windows Azure Storage Blobs (WASB) file system.
Hadoop 3.x – Supports all of the above, as well as the Microsoft Azure Data Lake file system.
Datanode
Hadoop 2.x resources: Datanode resource is not designed for MapReduce, we can use it for other applications.
Hadoop 3.x – Data node resources use themselves here for other applications.
MR API
Hadoop 2.x support – MR API supports Hadoop 1.x program for execution on Hadoop 2.X
Hadoop 3.x – Again, MR API supports programs executed Hadoop 1.x for execution on Hadoop 3. X
Support for Microsoft Windows
Hadoop 2.x – Can also deploy on Windows.
Hadoop 3.x – Also compatible with Microsoft Windows.
Slots / Container
Hadoop 2.x: Hadoop 1 works on the concept of slots, but Hadoop 2.X works on the concept of container. Through the container we can do the generic task.
Hadoop 3.x – Also works on the container concept.
Failure points
Hadoop 2.x Single Point of Failure – Has capabilities to overcome SPOF, so the naming code is automatically responsible for the event of failure.
Hadoop 3.x – Has the SPOF bypass feature so that Namenode automatically restores itself when it fails, with no need for manual intervention to bypass it.
HDFS Federation
Hadoop 2.x – In Hadoop 1.0, only single NameNode to manage all Namespace but in Hadoop 2.0, multiple NameNode for multiple Namespace.
Hadoop 3.x – Hadoop 3.x also have multiple Namenode for multiple namespaces.
Scalability
Hadoop 2.x – We can scale up to 10,000 Nodes per cluster.
Hadoop 3.x – Better scalability. We can scale more than 10,000 nodes per cluster.
Faster access to data
Hadoop 2.x – Thanks to data node caching, we can access data quickly.
Hadoop 3.x – Here we can also quickly access data through data node caching.
HDFS Snapshot
Hadoop 2.x: Hadoop 2 adds support for a snapshot. Provides disaster recovery and protection against user error.
Hadoop 3.x: Hadoop 2 also supports the snapshot feature.
Data Analysis
Hadoop 2.x Platform – Can serve as a platform for a wide variety of data analytics that is to perform event processing, streaming, and real-time operations.
Hadoop 3.x: Here it is also possible to run event processing, streaming, and real-time operation in YARN.
Cluster Resource Management
Hadoop 2.x: YARN uses this for cluster resource management. Improves scalability, high availability, and multi-tenancy.
Hadoop 3.x – Resource Management uses full functionality YARN for a cluster.
Conclusion
After discussing 22 key differences between these two, we can now decide which one is better to install between them. We offer the installation of the former on Ubuntu and the installation of later on Ubuntu or convenience.
For more articles, CLICK HERE.