Saturday, 18 December 2021

What is Big data as a service (BDaaS)

What is Big Data as a Service

BDaaS encompasses the software, data warehousing, infrastructure and platform service models in order to deliver advanced analysis of large data sets, generally through a cloud-based network.

Big data as a service is the delivery of data platforms and tools by a cloud provider to help organizations process, manage and analyze large data sets so they can generate insights in order to improve business operations and gain a competitive advantage.

BDaaS = DaaS+ HaaS + data analytics as a service.

Benefits of BDaaS

Initially, most big data systems were installed in on-premises data centers, primarily by large enterprises that combined various open source technologies to fit their particular big data applications and use cases. But deployments have shifted more to the cloud because of its potential advantages. In particular, big data as a service offers the following benefits to users:

Reduced complexity. Because of their customized nature, big data environments are complicated to design, deploy and manage. Using cloud infrastructure and managed services can simplify the process by eliminating much of the hands-on work that organizations need to do.
Easier scalability. In many environments, data processing workloads aren't consistent. For example, big data analytics applications often run intermittently or just once. BDaaS makes it easy to scale up systems when processing needs increase and to scale them down again after jobs are completed.
Increased flexibility. In addition to scaling systems up or down as needed, BDaaS users can more easily add or remove platforms, technologies and tools to meet evolving business requirements than typically is possible in on-premises big data architectures.
Potential cost savings. Using the cloud may reduce IT costs by enabling businesses to avoid the need to buy new hardware and software and to hire workers with big data management skills. But pay-as-you-go cloud services must be monitored to prevent unnecessary processing expenses from driving up their cost.
Stronger security. Concerns about data security kept many organizations from adopting the cloud at first, particularly in regulated industries. In many cases, though, cloud vendors and service providers are able to invest in better security protections than individual companies can.

Large enterprises lead big data as a service investment

As mentioned, the SMB market doesn’t account for the largest share of the Big-Data-as-a-Service market. Small- and medium-sized businesses only accounted for around a quarter of the USD 5,356.8 million value of the BDaaS market in 2018. However, during the forecast period, the small and medium-sized business segment is expected to grow fastest.

What is Data as a Service (DaaS)?

Data as a Service (DaaS)

Data as a service, or DaaS, is a term used to describe cloud-based software tools used for working with data,such as managing data in a data warehouse or analyzing data with business intelligence

Data as a Service (DaaS) is one of the most ambiguous offerings in the "as a service" family. Yet, in today's world, data and analytics are the keys to building a competitive advantage. We're clearing up the confusion around DaaS and helping your company understand when and how to tap into this service..

Data as a service (DaaS) is a data management strategy that uses the cloud to deliver data storage, integration, processing, and/or analytics services via a network connection.

What are the benefits of data as a service?

DaaS increases the speed to access the necessary data by exposing the data in a flexible but simple way. Users can quickly take action without the need for a comprehensive understanding of where the data is stored or how it is indexed

Compared to on-premises data storage and management, DaaS provides several key advantages with regard to speed, reliability, and performance. They include:

Minimal setup time: Organizations can begin storing and processing data almost immediately using a DaaS solution.
Improved functionality: Cloud infrastructure is less likely to fail, making DaaS workloads less prone to downtime or disruptions.
Greater flexibility: DaaS is more scalable and flexible than the on-premises alternative, since more resources can be allocated to cloud workloads instantaneously.
Cost savings: Data management and processing costs are easier to optimize with a DaaS solution. Companies can allocate just the right amount of resources to their data workloads in the cloud and increase or decrease those allocations as needs change.
Automated maintenance: The tools and services on DaaS platforms are automatically managed and kept up-to-date by the DaaS provider, eliminating the need for end-users to manage the tools themselves.
Smaller staff requirements: When using a DaaS platform, organizations do not need to maintain in-house staff who specialize in data tool set up and management. These tasks are handled by the DaaS provider.

Data as a Service is one of 3 categories of big data business models based on their value propositions and customers:

Answers as a Service;
Information as a Service;
Data as a Service.

Friday, 17 December 2021

Hadoop as a Service (HaaS)

What is Hadoop as a Service (HaaS) ?

Well While world is busy in Saas,Paas or CaaS,Now new term HaaS is also gaining curiosity

Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop. Users do not have to invest in or install additional infrastructure on premises when using the technology, as HaaS is provided and managed by a third-party vendor.

Definition of HaaS

HaaS (commonly referred as Hadoop in the cloud), is a framework of Big Data Analytics. This framework analyzes and stores data in the cloud utilizing Hadoop. For using HaaS, there is no need to install or invest in extra infrastructures On-Premises. The technology of the HaaS is offered as well as handled by the third party. In other words, HaaS is a term, which defines virtual data analyses as well as storage in the cloud. It arises as an alternative to On-Premise Hadoop.

Features

HaaS providers offer a variety of features and support, including:
Hadoop framework deployment support.
Hadoop cluster management.
Alternative programming languages.
Data transfer between clusters.
Customizable and user-friendly dashboards and data manipulation.
Security features.

Why HaaS As A Cloud Computing Solution?

Apache Hadoop as a Service when providing as a cloud computing solution is aimed at making medium and large scale data processing easier, faster, accessible and cost effective. To help a business focus on the growth perspective, the HaaS eliminates all the operational challenges that emerge while running Hadoop.

With outstanding features like unlimited scalability and on demand access to storage capacity and computing, cloud computing perfectly blends with this Big Data processing technology. More than the on-premise solutions, the Hadoop as a Service providers offer various distinct advantages as given below:-

1. Fully Integrated Big Data Software

Hadoop as a Service comes fully powered with the Hadoop ecosystem comprising Hive, Pig, MapReduce, Presto, Oozie, Spark and Sqoop. The HaaS also offers connectors for integration of data and creating data pipelines that coordinate with the working of existing data pipelines.

2. On-Demand Elastic Cluster

In accordance with the changes in the data processing requirements, the Hadoop clusters in the cloud scale up and down, thus providing more operational efficiency in comparison to static clusters deployed on-premises. Moreover, performance is improved as nodes get automatically added or removed from the clusters depending upon the size of the data.

3. Cluster Management Made Easier

Opting for cloud based HaaS offers a fully configured Hadoop cluster, thus relieving of the need to invest extra time and resources in setting up clusters, scaling infrastructure and managing nodes.

4. Cost Economical

One of the major reasons why Hadoop in the cloud is becoming immensely popular is its cost effectiveness. Businesses are not required to make investments in installing on site infrastructure and IT support and on-demand instances render 90 percent savings and payment has to be made only for space when used with auto-scaling clusters.

Monday, 22 November 2021

CDH Troubleshooting Upgrades

Cluster hosts do not appear

Some cluster hosts do not appear when you click Find Hosts in install or update wizard.

Possible Reasons

You might have network connectivity problems.

Possible Solutions

Make sure all cluster hosts have SSH port 22 open.

Check other common causes of loss of connectivity such as firewalls and interference from SELinux.

Cannot start services after upgrade

You have upgraded the Cloudera Manager Server, but now cannot start services.

Possible Reasons

You might have mismatched versions of the Cloudera Manager Server and Agents.

Possible Solutions

Make sure you have upgraded the Cloudera Manager Agents on all hosts. (The previous version of the Agents will heartbeat with the new version of the Server, but you cannot start HDFS and MapReduce with this combination.)

HDFS DataNodes fail to start

After upgrading, HDFS DataNodes fail to start with exception:

Exception in secureMainjava.lang.RuntimeException: Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) of 4294967296 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.

Possible Reasons

HDFS caching, which is enabled by default in CDH 5 and higher, requires new memlock functionality from Cloudera Manager Agents.

Possible Solutions:

Do the following:

Stop all CDH and managed services.

On all hosts with Cloudera Manager Agents, hard-restart the Agents. Before performing this step, ensure you understand the semantics of the hard_restart command by reading Cloudera Manager Agents.

RHEL 7, SLES 12, Ubuntu 18.04 and higher

sudo systemctl stop supervisord

sudo systemctl start cloudera-scm-agent

RHEL 5 or 6, SLES 11, Debian 6 or 7, Ubuntu 12.04 or 14.04

sudo service cloudera-scm-agent hard_restart

Start all services.

Cloudera services fail to start

Possible Reasons

Java might not be installed or might be installed at a custom location.

Possible Solutions

See Configuring a Custom Java Home Location for more information on resolving this issue.

Host Inspector Fails

If you see the following message in the Host Inspector:

There are mismatched versions across the system, which will cause failures. See below for details on which hosts are running what versions of components.

When looking at the results, some hosts report Supervisord vX.X.X, while others report X.X.X-cmY.Y.Y (where X and Y are version numbers). During the upgrade, an old file on the hosts may cause the Host Inspector to indicate mismatched Supervisord versions.

This issue occurs because these hosts have a file on them at /var/run/cloudera-scm-agent/supervisor/__STARTING_CM_VERSION__ that contains a string for the older version of Cloudera Manager.

To resolve this issue:

Remove or rename the /var/run/cloudera-scm-agent/supervisor/__STARTING_CM_VERSION__ file

Perform a hard restart of the agents:

sudo systemctl stop cloudera-scm-supervisord.service

sudo systemctl start cloudera-scm-agent

Run the Host inspector again. It should pass without the warning.

Saturday, 10 July 2021

Hadoop Tutorial ! What exactly is Hadoop? What hadoop used for ?

What is Hadoop?

What exactly is Hadoop?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware

What is Hadoop and Big Data?

Big Data meaning a data that is huge in size. Bigdata is a term used to describe a collection of data that is huge in size

Is Hadoop a programming language?

No. Hadoop is framework itself and mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts.

Is Hadoop a database?

Hadoop not traditional type database, it is a software ecosystem that allows for massively parallel computing. Also there is types NoSQL distributed databases (such as HBase) which is part of hadoop

Is Hadoop Dead Now?

No Hadoop is not dead. There are number of core projects from the Hadoop ecosystem continue to live on in the Cloudera Data Platform, a product that is very much alive in near future

All Your questions will be answered and discussed here in details:

Introduction

Apache Hadoop is an open-source software framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single server to thousands of servers, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a running cluster of computers, each of which may be prone to failures.It helps in handling larger volume of data with minimum failure

History

Intially hadoop was f conceived to fix a scalability issue , an open source crawler and search engine. At that time Google had published papers about the Google File System (GFS), and Map-Reduce, a computational framework for parallel processing. Development started in the Apache Nutch project with the successful implementation of these papers. But in 2006 Apache Nutch project was moved to the new Hadoop subproject Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.

Hadoop

Hadoop is a distributed master-slave architecture that consists of the following primary components:

Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)
Hadoop Distributed File System (HDFS) for data storage.
Yet Another Resource Negotiator (YARN), a general purpose scheduler and resource manager.
MapReduce, a batch-based computational engine. MapReduce is implemented as a YARN application.

HDFS

HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after the Google File System (GFS) paper.4 HDFS is optimized for high throughput and works best when reading and writing large files (gigabytes and larger). To support this throughput, HDFS uses unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O).

Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance. Hadoop 2 introduced two significant new features for HDFS—Federation and High Availability (HA):

NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
Federation allows HDFS metadata to be shared across multiple NameNode hosts, which aides with HDFS scalability and also provides data isolation, allowing different applications or teams to run their own NameNodes without fear of impacting other NameNodes on the same cluster.
High Availability in HDFS removes the single point of failure that existed in Hadoop 1, wherein a NameNode disaster would result in a cluster outage. HDFS HA also offers the ability for failover (the process by which a standby Name-Node takes over work from a failed primary NameNode) to be automated.

HDFS Commands

Click this Link: HDFS Commands

YARN

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.

YARN’s architecture is simple because its primary role is to schedule and manage resources in a Hadoop cluster. The core components in YARN: the ResourceManager and the NodeManager. YARN separates resource management and processing components.

Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc. YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.

YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.

ResourceManager: It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.

MAPREDUCE

MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce. It allows you to parallelize work over a large amount of raw data. The MapReduce model simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software. With this abstraction, MapReduce allows the programmer to focus on addressing business needs rather than getting tangled up in distributed system complications.

MapReduce consists of two distinct tasks – Map and Reduce.
As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed.
So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

Hadoop distributions

Hadoop is an Apache open source project, and regular releases of the software are available for download directly from the Apache project’s website (http://hadoop.apache.org/releases.html#Download). You can either download and install Hadoop from the website or use a commercial distribution of Hadoop, which will give you the added benefits of enterprise administration software, a support team to consult.

Apache

Apache is the organization that maintains the core Hadoop code and distribution. the challenge with the Apache distributions has been that support is limited to the goodwill of the open source community, and there’s no guarantee that your issue will be investigated and fixed. Having said that, the Hadoop community is a very supportive one, and responses to problems are usually rapid.

Cloudera

CDH (Cloudera Distribution Including Apache Hadoop) is the most tenured Hadoop distribution, and it employs a large number of Hadoop (and Hadoop ecosystem) committers. Doug Cutting, who along with Mike Caferella originally created Hadoop, is the chief architect at Cloudera. In aggregate, this means that bug fixes and feature requests have a better chance of being addressed in Cloudera compared to Hadoop distributions with fewer committers.

Hortonaworks

Hortonworks Data Platform (HDP) is also made up of a large number of Hadoop committers, and it offers the same advantages as Cloudera in terms of the ability to quickly address problems and feature requests in core Hadoop and its ecosystem projects. Hortonworks is also the main driver behind the next-generation YARN platform, which is a key strategic piece keeping Hadoop relevant.

Cloudera Hortonworks Merger

January 3, 2019 the enterprise data cloud company, today announced completion of its merger with Hortonworks.Cloudera and Hortonworks have announced they are merging. ... Knowing this, there must have been a strong driver that forced Cloudera and Hortonworks together

Thursday, 24 June 2021

Hbase errors issues and solutions

1)ERROR: KeeperErrorCode = NoNode for /hbase/master

check

hbase(main):001:0> list

TABLE

ERROR: KeeperErrorCode = NoNode for /hbase/master

For usage try 'help "list"'

Took 8.2629 seconds

hbase(main):002:0> list

TABLE

ERROR: Call id=15, waitTime=60008, rpcTimeout=60000

For usage try 'help "list"'

Took 488.5749 seconds

Dead regions server

hbase(main):002:0> status

1 active master, 2 backup masters, x servers, x dead, 0.0000 average load

Took 0.0711 seconds

HMASTER UI SHOWING DEAD REGION SERVER

hbase:meta,,1 is not online on

Solution

In progress

How to Delete a directory from Hadoop cluster which is having commas in its name?

># hdfs dfs -rm -r /hbase/WALs/wrker-02.xyz.com,16020,1623662453275-splitting

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/hadoop/tmp

{"type":"log","host":"host_name","category":"YARN-yarn-GATEWAY-BASE","level":"WARN","system":"n05dlkcluster","time": "21/06/23 06:13:57","logger":"util.NativeCodeLoader","timezone":"UTC","log":{"message":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"}}

Deleted /hbase/WALs/wrker-02.xyz.com,16020,1623662453275-splitting

How to clear Dead Region Servers in HBase UI?

Microsoft’s Windows 11 Feature Dowload Anroid apps Amazon app store

Microsoft’s Windows 11 Launched....

Android apps coming to Windows 11 as well

The next version of Windows 11 is here with a complete design overhaul.

Teams integration is being added to Windows 11 rings Fresh Interface, Centrally-Placed Start Menu called the “next generation” of Windows, comes with a massive redesign over its predecessor, starting from an all-new boot screen and startup sound to a centrally-placed Start menu and upgraded widgets.

Windows 11 also removes elements including the annoying “Hi Cortana” welcome screen and Live Tiles

Windows 11 is a major release of the Windows NT operating system, announced on June 24, 2021, and developed by Microsoft.

Developer Microsoft

Written in

C, C++, C#, assembly language

OS family Microsoft Windows

Source model

Closed-source

Source-available (through Shared Source Initiative)

Some components open source[1][2][3][4]

Marketing target Personal computing

Available in 110 languages[5][6]

List of languages

Update method

Windows Update

Microsoft Store

Windows Server Update Services (WSUS)

Platforms x86-64, ARM64

Kernel type Hybrid (Windows NT kernel)

Userland Windows API

.NET Framework

Universal Windows Platform

Windows Subsystem for Linux

Android

Default user interface Windows shell (graphical)

Preceded by Windows 10 (2015)

Official website windows.com

Wednesday, 23 June 2021

hbase commands cheat sheet

hbase commands cheat sheet based on Groupds

COMMAND GROUPS

Group name: general

Commands: processlist, status, table_help, version, whoami

Group name: ddl

Commands: alter, alter_async, alter_status, clone_table_schema, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, list_regions, locate_region, show_filters

Group name: namespace

Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

Group name: dml

Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

Group name: tools

Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, cleaner_chore_enabled, cleaner_chore_run, cleaner_chore_switch, clear_block_cache, clear_compaction_queues, clear_deadservers, close_region, compact, compact_rs, compaction_state, flush, is_in_maintenance_mode, list_deadservers, major_compact, merge_region, move, normalize, normalizer_enabled, normalizer_switch, split, splitormerge_enabled, splitormerge_switch, stop_master, stop_regionserver, trace, unassign, wal_roll, zk_dump

Group name: replication

Commands: add_peer, append_peer_exclude_namespaces, append_peer_exclude_tableCFs, append_peer_namespaces, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_peers, list_replicated_tables, remove_peer, remove_peer_exclude_namespaces, remove_peer_exclude_tableCFs, remove_peer_namespaces, remove_peer_tableCFs, set_peer_bandwidth, set_peer_exclude_namespaces, set_peer_exclude_tableCFs, set_peer_namespaces, set_peer_replicate_all, set_peer_serial, set_peer_tableCFs, show_peer_tableCFs, update_peer_config

Group name: snapshots

Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot

Group name: configuration

Commands: update_all_config, update_config

Group name: quotas

Commands: list_quota_snapshots, list_quota_table_sizes, list_quotas, list_snapshot_sizes, set_quota

Group name: security

Commands: grant, list_security_capabilities, revoke, user_permission

Group name: procedures

Commands: list_locks, list_procedures

Group name: visibility labels

Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility

Group name: rsgroup

Commands: add_rsgroup, balance_rsgroup, get_rsgroup, get_server_rsgroup, get_table_rsgroup, list_rsgroups, move_namespaces_rsgroup, move_servers_namespaces_rsgroup, move_servers_rsgroup, move_servers_tables_rsgroup, move_tables_rsgroup, remove_rsgroup, remove_servers_rsgroup

Saturday, 19 June 2021

Hbase quickly count number of rows

There are two ways to get count of rows from hbase table with Speed

Scenario #1

If hbase table size is small then login to hbase shell with valid user and execute

hbase shell

>count '<tablename>'

Example

>count 'employee'

6 row(s) in 0.1110 seconds

Use RowCounter in HBase RowCounter is in build mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.Its very helpfull when hbase table have huge data stored

Scenario #2
If hbase table size is large,then execute inbuilt RowCounter map reduce job: Login to hadoop machine with valid user and execute:
/$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter '<tablename>'
Example:
 /$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'employee'

     ....
     ....
     ....
     Virtual memory (bytes) snapshot=22594633728
                Total committed heap usage (bytes)=5093457920
        org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
                ROWS=6
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0

Saturday, 12 June 2021

hadoop commands cheat sheet

Hadoop commands cheat sheet | HDFS commands cheat sheet

There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are demonstrated here, use hadoop or hdfs for the commands

hadoop fs -ls <path> list files in the path of the file system

hadoop fs -chmod <arg> <file-or-dir> alters the permissions of a file where <arg> is the binary argument e.g. 777

hadoop fs -chown <owner>:<group> <file-or-dir> change the owner of a file

hadoop fs -mkdir <path> make a directory on the file system

hadoop fs -put <local-origin> <destination> copy a file from the local storage onto file system

hadoop fs -get <origin> <local-destination> copy a file to the local storage from the file system

hadoop fs -copyFromLocal <local-origin> <destination> similar to the put command but the source is restricted to a local file reference

hadoop fs -copyToLocal <origin> <local-destination> similar to the get command but the destination is restricted to a local file reference

hadoop fs -touchz create an empty file on the file system

hadoop fs -cat <file> copy files to stdout

-------------------------------------------------------------------------

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.

"<file>" means any filename.

----------------------------------------------------------------

-ls <path>

Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.

-lsr <path>

Behaves like -ls, but recursively displays entries in all subdirectories of path.

-du <path>

Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix.

-dus <path>

Like -du, but prints a summary of disk usage of all files/directories in the path.

-mv <src><dest>

Moves the file or directory indicated by src to dest, within HDFS.

-cp <src> <dest>

Copies the file or directory identified by src to dest, within HDFS.

-rm <path>

Removes the file or empty directory identified by path.

-rmr <path>

Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).

-put <localSrc> <dest>

Copies the file or directory from the local file system identified by localSrc to dest within the DFS.

-copyFromLocal <localSrc> <dest>

-moveFromLocal <localSrc> <dest>

Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success.

-cat <filen-ame>

Displays the contents of filename on stdout.

-mkdir <path>

Creates a directory named path in HDFS.

Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).

-setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)

-touchz <path>

Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.

-test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

-stat [format] <path>

Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

-chmod [-R] mode,mode,... <path>...

Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask.

-chown [-R] [owner][:[group]] <path>...

Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.

-chgrp [-R] group <path>...

Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.

LIST FILES

hdfs dfs -ls / ==>>List all the files/directories for the given hdfs destination path.

hdfs dfs -ls -d /hadoop ==> Directories are listed as plain files. In this case, this command will list

the details of hadoop folder.

hdfs dfs -ls -h /data ==>Format file sizes in a human-readable fashion (eg 64.0m instead of

67108864).

hdfs dfs -ls -R /hadoop ==>Recursively list all files in hadoop directory and all subdirectories in

hadoop directory.

hdfs dfs -ls /hadoop/dat* ==>List all the files matching the pattern. In this case, it will list all the

files inside hadoop directory which starts with 'dat'.

OWNERSHIP

hdfs dfs -checksum /hadoop/file1 ==>Dump checksum information for files that match the file pattern <src>

to stdout.

hdfs dfs -chmod 755 /hadoop/file1 ==> Changes permissions of the file.

hdfs dfs -chmod -R 755 /hadoop ==> Changes permissions of the files recursively.

hdfs dfs -chown myuser:mygroup /hadoop ==> Changes owner of the file. 1st ubuntu in the command is owner and

2nd one is group.

hdfs dfs -chown -R hadoop:hadoop /hadoop ==> Changes owner of the files recursively.

hdfs dfs -chgrp ubuntu /hadoop ==> Changes group association of the file.

hdfs dfs -chgrp -R ubuntu /hadoop ==> Changes group association of the files recursively.

Saturday, 22 May 2021

How to check complete list of kubernetes objects?

Following command successfully display all kubernetes objects

kubectl api-resources

Example

[root@hsk-controller ~]# kubectl api-resources

NAME SHORTNAMES KIND

bindings Binding

componentstatuses cs ComponentStatus

configmaps cm ConfigMap

endpoints ep Endpoints

events ev Event

limitranges limits LimitRange

namespaces ns Namespace

nodes no Node

persistentvolumeclaims pvc PersistentVolumeClaim

persistentvolumes pv PersistentVolume

pods po Pod

podtemplates PodTemplate

replicationcontrollers rc ReplicationController

resourcequotas quota ResourceQuota

secrets Secret

serviceaccounts sa ServiceAccount

services svc Service

initializerconfigurations InitializerConfiguration

mutatingwebhookconfigurations MutatingWebhookConfiguration

validatingwebhookconfigurations ValidatingWebhookConfiguration

customresourcedefinitions crd,crds CustomResourceDefinition

apiservices APIService

controllerrevisions ControllerRevision

daemonsets ds DaemonSet

deployments deploy Deployment

replicasets rs ReplicaSet

statefulsets sts StatefulSet

tokenreviews TokenReview

localsubjectaccessreviews LocalSubjectAccessReview

selfsubjectaccessreviews SelfSubjectAccessReview

selfsubjectrulesreviews SelfSubjectRulesReview

subjectaccessreviews SubjectAccessReview

horizontalpodautoscalers hpa HorizontalPodAutoscaler

cronjobs cj CronJob

jobs Job

brpolices br,bp BrPolicy

clusters rcc Cluster

filesystems rcfs Filesystem

objectstores rco ObjectStore

pools rcp Pool

certificatesigningrequests csr CertificateSigningRequest

leases Lease

events ev Event

daemonsets ds DaemonSet

deployments deploy Deployment

ingresses ing Ingress

networkpolicies netpol NetworkPolicy

podsecuritypolicies psp PodSecurityPolicy

replicasets rs ReplicaSet

nodes NodeMetrics

pods PodMetrics

networkpolicies netpol NetworkPolicy

poddisruptionbudgets pdb PodDisruptionBudget

podsecuritypolicies psp PodSecurityPolicy

clusterrolebindings ClusterRoleBinding

clusterroles ClusterRole

rolebindings RoleBinding

roles Role

volumes rv Volume

priorityclasses pc PriorityClass

storageclasses sc StorageClass

volumeattachments VolumeAttachment

Note: kubernate version is v1.12*

Tuesday, 2 March 2021

How To Install Helm3 on Windows10

Introduction

Helm is a tool that streamlines installing and managing Kubernetes applications.Think of it like apt/yum/homebrew for Kubernetes.

- Helm renders your templates and communicates with the Kubernetes API

- Helm runs on your laptop, CI/CD, or wherever you want it to run.

-Helm3 bring lot more capabilities and feature

Deploying applications to Kubernetes is a complicated process. Many tools simplify this process, and one of them is Helm.

Prerequisites

A system running with window 10

Access to a command line/terminal

A Kubernetes cluster installed and configured

To download Helm on Windows:

1. Follow the link below to download the latest Helm version.
https://github.com/helm/helm/releases

2. Locate the Windows amd64 download link from the Installation platform list and click on it to download.

Downloading Helm from Git.

3. Next, extract the windows-amd64 zip to the preferred location.

4.Example->C:\Users\hss\Downloads\helm-v3.5.2-windows-amd64\windows-amd64

5. Set enviromnet variable and point to helm.exe

6. Your done....Goto command line and start executng helm commands

C:\Users\>helm version

version.BuildInfo{Version:"v3.4.2", GitCommit:"167aac70832d3a384f65f9745335e9fb40169dc2", GitTreeState:"dirty", GoVersion:"go1.15.7"}

Happy helming!

Pages

Saturday, 18 December 2021

What is Big Data as a Service

Benefits of BDaaS

Large enterprises lead big data as a service investment

Data as a Service (DaaS)

What are the benefits of data as a service?

Friday, 17 December 2021

What is Hadoop as a Service (HaaS) ?

Definition of HaaS

Features

Why HaaS As A Cloud Computing Solution?

Monday, 22 November 2021

Cluster hosts do not appear

Cannot start services after upgrade

HDFS DataNodes fail to start

Cloudera services fail to start

Host Inspector Fails

Saturday, 10 July 2021

What is Hadoop?

Introduction

History

Hadoop

HDFS

HDFS Commands

Click this Link: HDFS Commands

YARN

MAPREDUCE

Hadoop distributions

Apache

Cloudera

Hortonaworks

Cloudera Hortonworks Merger

January 3, 2019 the enterprise data cloud company, today announced completion of its merger with Hortonworks.Cloudera and Hortonworks have announced they are merging. ... Knowing this, there must have been a strong driver that forced Cloudera and Hortonworks together

Thursday, 24 June 2021

Wednesday, 23 June 2021

Saturday, 19 June 2021

Saturday, 12 June 2021

Saturday, 22 May 2021

Tuesday, 2 March 2021

Total Pageviews