Friday, 9 December 2022

Is Hadoop Still in Demand in 2023

Is Hadoop Still in Demand in 2023

 When we take a look at predictions about the Big Data Industry, the trend doesn’t seem to be slowing down any time soon. Learning skills such as Hadoop, Spark, Kafka, etc., can land promising Big Data jobs. The Global Hadoop market is said to grow at a CAGR of 33% between 2019 and 2024

Does Hadoop have a future?

Since it all started, the number of open-source projects and startups in the Big Data world has kept increasing, year after year (just take a look at the 2021 landscape to see how huge it has become). I remember that around 2012 some people were predicting that the new SQL wars would end and true victors would eventually emerge. This did not happen yet. How all of this will evolve in the future is very difficult to predict. It will take a few more years for the dust to settle. But if I had to take some wilde guesses, I would make the following predictions.

What is the future scope of Hadoop?

What is the future of Hadoop?

Is Hadoop outdated?

Hadoop Market is expected to reach $340.35 billion by 2027, growing at a CAGR of 37.5% from 2020 to 2027

Hadoop in 2023

As others have already noted, the main existing data platforms (Databricks, Snowflake, BigQuery, Azure Synapse) will keep on improving and add new features to close the gaps between each other. I expect to see more and more connectivity between every component, and also between data languages like SQL and Python.

We might see a slowdown of the number of new projects and companies in the next couple of years, although this would be more from a lack of funding after the burst of a new dotcom bubble (if this ever happens) than from a lack of will or ideas.

Since the beginning, the main lacking resource has been skilled workforce. This mean that for most companies, it was simpler to throw more money at performance problems, or migrate to more cost-effective solutions, rather than spend more time optimizing them. Especially now that storage costs in the main distributed warehouses have become so cheap. But perhaps at some point the price competition between vendors will become more difficult to maintain for them, and prices will go up. Even if prices don’t go up, the volume of data stored by businesses keeps increasing year after year, and the related cost of inefficiency with them. Perhaps at some point we will see a new trend where people start looking for new, cheaper open-source alternatives, and a new Hadoop-like cycle will start again.

In the long term, I believe the real winners will be the cloud providers, Google, Amazon and Microsoft. All they have to do is wait and see in which direction the wind blows the most, bide their time, then acquire (or simply reproduce) the technologies which work the best. Each tool that gets integrated into their cloud makes things so much easier and seamless for users, especially when it comes to security, gouvernance, access control, and cost management. 

Is Hadoop still in demand in 2023?

Is Hadoop worth learning 2023?

Yes

What will replace Hadoop?

Possible Top 10 Alternatives to Hadoop HDFS

Top 10 Hadoop HDFS Alternatives 2022

Google Cloud BigQuery.

Databricks Lakehouse Platform.

Cloudera.

Hortonworks Data Platform.

Snowflake.

Google Cloud Dataproc.

Microsoft SQL Server.

Vertica.

Saturday, 21 May 2022

What is Hadoop as a service (HaaS)?

 Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop. Users do not have to invest in or install additional infrastructure on premises when using the technology, as HaaS is provided and managed by a third-party vendor.

Hadoop or Spark?

 Hadoop or Spark?

  1. Performance: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce.
  2. Cost: Hadoop runs at a lower cost since it relies on any disk storage type for data processing. Spark runs at a higher cost because it relies on in-memory computations for real-time data processing, which requires it to use high quantities of RAM to spin up nodes.
  3. Processing: Though both platforms process data in a distributed environment, Hadoop is ideal for batch processing and linear data processing. Spark is ideal for real-time processing and processing live unstructured data streams.
  4. Scalability: When data volume rapidly grows, Hadoop quickly scales to accommodate the demand via Hadoop Distributed File System (HDFS). In turn, Spark relies on the fault tolerant HDFS for large volumes of data.
  5. Security: Spark enhances security with authentication via shared secret or event logging, whereas Hadoop uses multiple authentication and access control methods. Though, overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a higher security level.
  6. Machine learning (ML): Spark is the superior platform in this category because it includes MLlib, which performs iterative in-memory ML computations. It also includes tools that perform regression, classification, persistence, pipeline construction, evaluation, etc.

Wednesday, 20 April 2022

UnsupportedFileSystemException No FileSystem for scheme "hdfs"

No FileSystem for scheme \"hdfs\"\n\tat 

stacktrace":"org.apache.hadoop.fs.UnsupportedFileSystemException:

No FileSystem for scheme \"hdfs\"\n\tat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:)\n\

tat org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:)

\n\tat org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:)\n\

tat org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:)\n\

tat org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:)\n\t

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:)\n

\tat org.apache.hadoop.fs.FileSystem.get(FileSystem.java:)\n

Solution 

This Error is due to unavailability of required libarary during FS object creation

Add hadoop-hdfs and hadoop-hdfs-client jars as runtime depedency to your project

POM:

.....

 <dependency>

  <groupId>org.apache.hadoop</groupId>

  <artifactId>hadoop-hdfs-client</artifactId>

  <version>3.0.0</version>

</dependency>

<dependency>

  <groupId>org.apache.hadoop</groupId>

  <artifactId>hadoop-hdfs</artifactId>

  <version>3.0.0</version>

</dependency>

-------------------------------------------------------------------------------------------------------

Saturday, 9 April 2022

Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

Error While running Hadoop on Window 

 Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:737)

at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:272)

at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:288)

at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:777)

at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:522)

at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:562)

at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:534)

at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:561)

at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:534)

at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705)

at com.nokia.cemod.ice.rest.controller.HDFSDemo.createDir(HDFSDemo.java:42)

at com.nokia.cemod.ice.rest.controller.HDFSDemo.main(HDFSDemo.java:24)

Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems

at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:549)

Solution

Hadoop requires native libraries on Windows to work properly -that includes to access the file:// filesystem, where Hadoop uses some Windows APIs to implement posix-like file access permissions.

This is implemented in HADOOP.DLL and WINUTILS.EXE.

In particular, %HADOOP_HOME%\BIN\WINUTILS.EXE must be locatable.

If it is not, Hadoop or an application built on top of Hadoop will fail.

How to fix a missing WINUTILS.EXE

You can fix this problem in two ways

  1. Install a full native windows Hadoop version. The ASF does not currently (September 2015) release such a version; releases are available externally.
  2. Or: get the WINUTILS.EXE binary from a Hadoop redistribution. There is a repository of this for some Hadoop versions on github.

Then

  1. Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE.
  2. Or: run the Java process with the system property hadoop.home.dir set to the home directory.
  3. In Eclipse/Studio Job configuration, open the Run > Advanced settings tab. In the JVM Setting section, select the Use specific JVM arguments check box, click the New button, and add a new argument like this:
  4. Dhadoop.home.dir=C:\hadoop\bin
      5.Also in Development Environment set HADOOP_HOME=C:\hadoop\bin