HBASE INTERVIEW QUESTIONS
Top Answers to HBase Interview Questions
(https://intellipaat.com/interview-question/hbase-interview-questions/)
1. Compare HBase & Cassandra
Criteria HBase Cassandra
Basis for the cluster Hadoop Peer-to-peer
Best suited for Batch Jobs Data writes
The API REST/Thrift Thrift
2. What is Apache HBase?
It is a column-oriented database which is used to store the sparse data sets. It is run on the top of Hadoop file distributed system. Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API or through a Thrift or REST gateway, making it accessible by any language. Some of the key properties of HBase include:
- NoSQL: HBase is not a traditional relational database (RDBMS). HBase relaxes the ACID (Atomicity, Consistency, Isolation, Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
- Wide-Column: HBase stores data in a table-like format with the ability to store billions of rows with millions of columns. Columns can be grouped together in “column families” which allows physical distribution of row values onto different cluster nodes.
- Distributed and Scalable: HBase group rows into “regions” which define how table data is split over multiple nodes in a cluster. If a region gets too large, it is automatically split to share the load across more servers.
- Consistent: HBase is architected to have “strongly-consistent” reads and writes, as opposed to other NoSQL databases that are “eventually consistent”. This means that once a write has been performed, all read requests for that data will return the same value.
Read this blog, to learn more about Apache HBase.
3. Give the name of the key components of HBase
The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.
4. What is S3?
S3 stands for simple storage service and it is a one of the file system used by hbase.
5. What is the use of get() method?
get() method is used to read the data from the table.
6. What is the reason of using HBase?
HBase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.
7. In how many modes HBase can run?
There are two run modes of HBase i.e. standalone and distributed.
8. Define the difference between hive and HBase?
HBase is used to support record level operations but hive does not support record level operations.
9. Define column families?
It is a collection of columns whereas row is a collection of column families.
Download HBase Interview questions asked by top MNCs in 2018 ?
10. Define standalone mode in HBase?
It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
11. What is decorating Filters?
It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data.
12. What is the full form of YCSB?
YCSB stands for Yahoo! Cloud Serving Benchmark.
13. What is the use of YCSB?
It can be used to run comparable workloads against different storage systems.
Check this tutorial, to learn more about the use of YCSB in HBase.
14. Which operating system is supported by HBase?
HBase supports those OS which supports java like windows, Linux.
15. What is the most common file system of HBase?
The most common file system of HBase is HDFS i.e. Hadoop Distributed File System.
16. Define Pseudodistributed mode?
A pseudodistributed mode is simply a distributed mode that is run on a single host.
17. What is regionserver?
It is a file which lists the known region server names.
18. Define MapReduce.
MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way.
19. What are the operational commands of HBase?
Operational commands of HBase are Get, Delete, Put, Increment, and Scan.
20. Which code is used to open the connection in Hbase?
Following code is used to open a connection:
Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, “users”);
21. Which command is used to show the version?
Version command is used to show the version of HBase.
Syntax – hbase> version
22. What is use of tools command?
This command is used to list the HBase surgery tools.
23. What is the use of shutdown command?
It is used to shut down the cluster.
24. What is the use of truncate command?
It is used to disable, recreate and drop the specified tables.
25. Which command is used to run HBase Shell?
$ ./bin/hbase shell command is used to run the HBase shell.
26. Which command is used to show the current HBase user?
The whoami command is used to show HBase user.
27. How to delete the table with the shell?
To delete table first disable it then delete it.
28. What is use of InputFormat in MapReducr process?
InputFormat the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.
29. What is the full form of MSLAB?
MSLAB stands for Memstore-Local Allocation Buffer.
30. Define LZO?
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed, and written in ANSIC.
31. What is HBaseFsck?
HBase comes with a tool called hbck which is implemented by the HBaseFsck class. It provides various command-line switches that influence its behavior.
32. What is REST?
Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
33. Define Thrift?
Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.
34. What are the fundamental key structures of HBase?
The fundamental key structures of HBase are row key and column key.
35. What is JMX?
The Java Management Extensions technology is the standard for Java applications to export their status.
36. What is nagios?
Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.
37. What is the syntax of describe Command?
The syntax of describe command is –
hbase> describe tablename
38. What the use is of exists command?
The exists command is used to check that the specified table is exists or not.
39. What is the use of MasterServer?
MasterServer is used to assign a region to the region server and also handle the load balancing.
40. What is HBase Shell?
HBase shell is a java API by which we communicate with HBase.
Check this tutorial, to learn more about operations using HBase Shell.
41. What is the use of ZooKeeper?
The zookeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization.
42. Define catalog tables in HBase?
Catalog tables are used to maintain the metadata information.
43. Define cell in HBase?
The cell is the smallest unit of HBase table which stores the data in the form of a tuple.
44. Define compaction in HBase?
Compaction is a process which is used to merge the Hfiles into the one file and after the merging file is created and then old file is deleted. There are different types of tombstone markers which make cells invisible and these tombstone markers are deleted during compaction.
45. What is the use of HColumnDescriptor class?
HColumnDescriptor stores the information about a column family like compression settings , Number of versions etc.
46. What is the function of HMaster?
It is a MasterServer which is responsible for monitoring all regionserver instances in a cluster.
47. How many compaction types are in HBase?
There are two types of Compaction i.e. Minor Compaction and Major Compaction.
48. Define HRegionServer in HBase
It is a RegionServer implementation which is responsible for managing and serving regions.
49. Which filter accepts the pagesize as the parameter in HBase?
PageFilter accepts the pagesize as the parameter.
50. Which method is used to access HFile directly without using HBase?
HFile.main() method used to access HFile directly without using HBase.
51. Which type of data HBase can store?
HBase can store any type of data that can be converted into the bytes.
52. What is the use of Apache HBase?
Apache HBase is used when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Learn more about HBase Online Training Course for a better career.
53. What are the features of Apache HBase?
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and an REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible JRuby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
To learn more about HBase features, check this tutorial.
54. How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?
In HBase 0.96, the project moved to a modular structure. Adjust your project’s dependencies to rely upon the HBase-client module or another module as appropriate, rather than a single JAR. You can model your Maven depency after one of the following, depending on your targeted version of HBase. See Section 3.5, “Upgrading from 0.94.x to 0.96.x” or Section 3.3, “Upgrading from 0.96.x to 0.98.x” for more information.
- Maven Dependency for HBase 0.98
org.apache.hbase
hbase-client
0.98.5-hadoop2
- Maven Dependency for HBase 0.96
org.apache.hbase
hbase-client
0.96.2-hadoop2
- Maven Dependency for HBase 0.94
org.apache.hbase
hbase
0.94.3
55. How should I design my schema in HBase?
HBase schemas can be created or updated using ‘The Apache HBase Shell’ or by using ‘Admin in the Java API’.
Tables must be disabled when making ColumnFamily modifications, for example:
Configuration config = HBaseConfiguration.create();
Admin admin = new Admin(conf);
String table = “myTable”;
admin.disableTable(table);
HColumnDescriptor cf1 = …;
admin.addColumn(table, cf1); // adding new ColumnFamily
HColumnDescriptor cf2 = …;
admin.modifyColumn(table, cf2); // modifying existing ColumnFamily
admin.enableTable(table);
Refer this link, for getting more knowledge on HBase schema.
56. What is the Hierarchy of Tables in Apache HBase?
The hierarchy for tables in HBase is as follows:
- Tables
Column Families
Rows
- Columns
Cells
When a table is created, one or more column families are defined as high-level categories for storing data corresponding to an entry in the table. As is suggested by HBase being “column-oriented”, column family data for all table entries, or rows, are stored together. For a given (row, column family) combination, multiple columns can be written at the time the data is written. Therefore, two rows in an HBase table need not necessarily share the same columns, only column families. For each (row, column-family, column) combination HBase can store multiple cells, with each cell associated with a version, or timestamp corresponding to when the data was written. HBase clients can choose to only read the most recent version of a given cell, or read all versions.
57. How can I troubleshoot my HBase cluster?
Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.
An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example, one trick with RegionServers is that they will print some metrics when aborting so grapping for Dump should get you around the start of the problem.
RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS.
Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Long GC pauses above.
Interested in learning HBase? Click here
58. Compare HBase with Cassandra?
Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or — even better — billions of rows. Anything less, and you’re advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is, therefore, important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition — such as a column that stores the state field of a customer’s mailing address.
HBase lacks built-in support for secondary indexes but offers a number of mechanisms that provide secondary index functionality. These are described in HBase’s online reference guide and on HBase community blogs.
59. Compare HBase with Hive?
Hive can help the SQL savvy to run MapReduce jobs. Since its JDBC compliant, it also integrates with existing SQL-based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
Hive and HBase are two different Hadoop-based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
60. What version of Hadoop do I need to run HBase?
Different versions of HBase require different versions of Hadoop. Consult the table below to find which version of Hadoop you will need:
HBase Release Number Hadoop Release Number
0.1.x 0.16.x
0.2.x 0.17.x
0.18.x 0.18.x
0.19.x 0.19.x
0.20.x 0.20.x
0.90.4 (current stable)
???
Releases of Hadoop can be found here. We recommend using the most recent version of Hadoop possible, as it will contain the most bug fixes. Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with the jars from Hadoop-0.18.x.
Also note that after HBase-0.2.x, the HBase release numbering schema will change to align with the Hadoop release number on which it depends.Learn more about HBase in this insightful blog now!
-------------------------------------------------------------
https://www.edureka.co/blog/interview-questions/hbase-interview-questions/
1. Compare HBase & Cassandra
Criteria | HBase | Cassandra |
Basis for the cluster | Hadoop | Peer-to-peer |
Best suited for | Batch Jobs | Data writes |
The API | REST/Thrift | Thrift |
2. What is Apache HBase?
It is a column-oriented database which is used to store the sparse data sets. It is run on the top of Hadoop file distributed system. Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API or through a Thrift or REST gateway, making it accessible by any language. Some of the key properties of HBase include:
- NoSQL: HBase is not a traditional relational database (RDBMS). HBase relaxes the ACID (Atomicity, Consistency, Isolation, Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
- Wide-Column: HBase stores data in a table-like format with the ability to store billions of rows with millions of columns. Columns can be grouped together in “column families” which allows physical distribution of row values onto different cluster nodes.
- Distributed and Scalable: HBase group rows into “regions” which define how table data is split over multiple nodes in a cluster. If a region gets too large, it is automatically split to share the load across more servers.
- Consistent: HBase is architected to have “strongly-consistent” reads and writes, as opposed to other NoSQL databases that are “eventually consistent”. This means that once a write has been performed, all read requests for that data will return the same value.
Read this blog, to learn more about Apache HBase.
3. Give the name of the key components of HBase
The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.
4. What is S3?
S3 stands for simple storage service and it is a one of the file system used by hbase.
5. What is the use of get() method?
get() method is used to read the data from the table.
6. What is the reason of using HBase?
HBase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.
7. In how many modes HBase can run?
There are two run modes of HBase i.e. standalone and distributed.
8. Define the difference between hive and HBase?
HBase is used to support record level operations but hive does not support record level operations.
9. Define column families?
It is a collection of columns whereas row is a collection of column families.
Download HBase Interview questions asked by top MNCs in 2018 ?
10. Define standalone mode in HBase?
It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
11. What is decorating Filters?
It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data.
12. What is the full form of YCSB?
YCSB stands for Yahoo! Cloud Serving Benchmark.
13. What is the use of YCSB?
It can be used to run comparable workloads against different storage systems.
Check this tutorial, to learn more about the use of YCSB in HBase.
14. Which operating system is supported by HBase?
HBase supports those OS which supports java like windows, Linux.
15. What is the most common file system of HBase?
The most common file system of HBase is HDFS i.e. Hadoop Distributed File System.
16. Define Pseudodistributed mode?
A pseudodistributed mode is simply a distributed mode that is run on a single host.
17. What is regionserver?
It is a file which lists the known region server names.
18. Define MapReduce.
MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way.
19. What are the operational commands of HBase?
Operational commands of HBase are Get, Delete, Put, Increment, and Scan.
20. Which code is used to open the connection in Hbase?
Following code is used to open a connection:
Configuration myConf = HBaseConfiguration.create(); HTableInterface usersTable = new HTable(myConf, “users”);
21. Which command is used to show the version?
Version command is used to show the version of HBase.
Syntax – hbase> version
22. What is use of tools command?
This command is used to list the HBase surgery tools.
23. What is the use of shutdown command?
It is used to shut down the cluster.
24. What is the use of truncate command?
It is used to disable, recreate and drop the specified tables.
25. Which command is used to run HBase Shell?
$ ./bin/hbase shell command is used to run the HBase shell.
26. Which command is used to show the current HBase user?
The whoami command is used to show HBase user.
27. How to delete the table with the shell?
To delete table first disable it then delete it.
28. What is use of InputFormat in MapReducr process?
InputFormat the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.
29. What is the full form of MSLAB?
MSLAB stands for Memstore-Local Allocation Buffer.
30. Define LZO?
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed, and written in ANSIC.
31. What is HBaseFsck?
HBase comes with a tool called hbck which is implemented by the HBaseFsck class. It provides various command-line switches that influence its behavior.
32. What is REST?
Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
33. Define Thrift?
Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.
34. What are the fundamental key structures of HBase?
The fundamental key structures of HBase are row key and column key.
35. What is JMX?
The Java Management Extensions technology is the standard for Java applications to export their status.
36. What is nagios?
Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.
37. What is the syntax of describe Command?
The syntax of describe command is –
hbase> describe tablename
hbase> describe tablename
38. What the use is of exists command?
The exists command is used to check that the specified table is exists or not.
39. What is the use of MasterServer?
MasterServer is used to assign a region to the region server and also handle the load balancing.
40. What is HBase Shell?
HBase shell is a java API by which we communicate with HBase.
Check this tutorial, to learn more about operations using HBase Shell.
41. What is the use of ZooKeeper?
The zookeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization.
42. Define catalog tables in HBase?
Catalog tables are used to maintain the metadata information.
43. Define cell in HBase?
The cell is the smallest unit of HBase table which stores the data in the form of a tuple.
44. Define compaction in HBase?
Compaction is a process which is used to merge the Hfiles into the one file and after the merging file is created and then old file is deleted. There are different types of tombstone markers which make cells invisible and these tombstone markers are deleted during compaction.
45. What is the use of HColumnDescriptor class?
HColumnDescriptor stores the information about a column family like compression settings , Number of versions etc.
46. What is the function of HMaster?
It is a MasterServer which is responsible for monitoring all regionserver instances in a cluster.
47. How many compaction types are in HBase?
There are two types of Compaction i.e. Minor Compaction and Major Compaction.
48. Define HRegionServer in HBase
It is a RegionServer implementation which is responsible for managing and serving regions.
49. Which filter accepts the pagesize as the parameter in HBase?
PageFilter accepts the pagesize as the parameter.
50. Which method is used to access HFile directly without using HBase?
HFile.main() method used to access HFile directly without using HBase.
51. Which type of data HBase can store?
HBase can store any type of data that can be converted into the bytes.
52. What is the use of Apache HBase?
Apache HBase is used when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Learn more about HBase Online Training Course for a better career.
53. What are the features of Apache HBase?
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and an REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible JRuby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
To learn more about HBase features, check this tutorial.
54. How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?
In HBase 0.96, the project moved to a modular structure. Adjust your project’s dependencies to rely upon the HBase-client module or another module as appropriate, rather than a single JAR. You can model your Maven depency after one of the following, depending on your targeted version of HBase. See Section 3.5, “Upgrading from 0.94.x to 0.96.x” or Section 3.3, “Upgrading from 0.96.x to 0.98.x” for more information.
- Maven Dependency for HBase 0.98
org.apache.hbase
hbase-client
0.98.5-hadoop2 - Maven Dependency for HBase 0.96
org.apache.hbase
hbase-client
0.96.2-hadoop2 - Maven Dependency for HBase 0.94
org.apache.hbase
hbase
0.94.3
55. How should I design my schema in HBase?
HBase schemas can be created or updated using ‘The Apache HBase Shell’ or by using ‘Admin in the Java API’.
Tables must be disabled when making ColumnFamily modifications, for example:
Tables must be disabled when making ColumnFamily modifications, for example:
Configuration config = HBaseConfiguration.create(); Admin admin = new Admin(conf); String table = “myTable”; admin.disableTable(table); HColumnDescriptor cf1 = …; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = …; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);
Refer this link, for getting more knowledge on HBase schema.
56. What is the Hierarchy of Tables in Apache HBase?
The hierarchy for tables in HBase is as follows:
- Tables
Column Families
Rows - Columns
Cells
When a table is created, one or more column families are defined as high-level categories for storing data corresponding to an entry in the table. As is suggested by HBase being “column-oriented”, column family data for all table entries, or rows, are stored together. For a given (row, column family) combination, multiple columns can be written at the time the data is written. Therefore, two rows in an HBase table need not necessarily share the same columns, only column families. For each (row, column-family, column) combination HBase can store multiple cells, with each cell associated with a version, or timestamp corresponding to when the data was written. HBase clients can choose to only read the most recent version of a given cell, or read all versions.
57. How can I troubleshoot my HBase cluster?
Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.
An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example, one trick with RegionServers is that they will print some metrics when aborting so grapping for Dump should get you around the start of the problem.
RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS.
Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Long GC pauses above.
Interested in learning HBase? Click here
58. Compare HBase with Cassandra?
Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or — even better — billions of rows. Anything less, and you’re advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
Both are distributed databases, not only in how data is stored but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is, therefore, important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition — such as a column that stores the state field of a customer’s mailing address.
HBase lacks built-in support for secondary indexes but offers a number of mechanisms that provide secondary index functionality. These are described in HBase’s online reference guide and on HBase community blogs.
59. Compare HBase with Hive?
Hive can help the SQL savvy to run MapReduce jobs. Since its JDBC compliant, it also integrates with existing SQL-based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
Hive and HBase are two different Hadoop-based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
60. What version of Hadoop do I need to run HBase?
Different versions of HBase require different versions of Hadoop. Consult the table below to find which version of Hadoop you will need:
HBase Release Number Hadoop Release Number
0.1.x 0.16.x
0.2.x 0.17.x
0.18.x 0.18.x
0.19.x 0.19.x
0.20.x 0.20.x
0.90.4 (current stable)
???
Releases of Hadoop can be found here. We recommend using the most recent version of Hadoop possible, as it will contain the most bug fixes. Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with the jars from Hadoop-0.18.x.
HBase Release Number Hadoop Release Number
0.1.x 0.16.x
0.2.x 0.17.x
0.18.x 0.18.x
0.19.x 0.19.x
0.20.x 0.20.x
0.90.4 (current stable)
???
Releases of Hadoop can be found here. We recommend using the most recent version of Hadoop possible, as it will contain the most bug fixes. Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with the jars from Hadoop-0.18.x.
Also note that after HBase-0.2.x, the HBase release numbering schema will change to align with the Hadoop release number on which it depends.Learn more about HBase in this insightful blog now!
-------------------------------------------------------------
https://www.edureka.co/blog/interview-questions/hbase-interview-questions/
Apache HBase Interview Questions
Looking
out for Apache HBase Interview Questions that are frequently asked by
employers? Here is the blog on Apache HBase interview questions in
Hadoop Interview Questions series. I hope you must not have missed the
earlier blogs of our Hadoop Interview Question series.
After
going through the HBase interview questions, you will get an in-depth
knowledge of questions that are frequently asked by employers in Hadoop
interviews related to HBase.
In
case you have attended any HBase interview previously, we encourage you
to add your questions in the comments tab. We will be happy to answer
them, and spread the word to the community of fellow job seekers.
Looking
out for Apache HBase Interview Questions that are frequently asked by
employers? Here is the blog on Apache HBase interview questions in
Hadoop Interview Questions series. I hope you must not have missed the
earlier blogs of our Hadoop Interview Question series.
After
going through the HBase interview questions, you will get an in-depth
knowledge of questions that are frequently asked by employers in Hadoop
interviews related to HBase.
In
case you have attended any HBase interview previously, we encourage you
to add your questions in the comments tab. We will be happy to answer
them, and spread the word to the community of fellow job seekers.
Important points to remember about Apache HBase:
- Apache
HBase is a NoSQL column oriented database which is used to store the
sparse data sets. It runs on top of the Hadoop distributed file system
(HDFS) and it can store any kind of data.
- Clients
can access HBase data through either a native Java API, or through a
Thrift or REST gateway, making it accessible from any language.
♣ Tip: Before going through this Apache HBase interview questions, I would suggest you to go through Apache HBase Tutorial and HBase Architecture to revise your HBase concepts.
Now moving on, let us look at the Apache HBase interview questions.
- Apache HBase is a NoSQL column oriented database which is used to store the sparse data sets. It runs on top of the Hadoop distributed file system (HDFS) and it can store any kind of data.
- Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible from any language.
♣ Tip: Before going through this Apache HBase interview questions, I would suggest you to go through Apache HBase Tutorial and HBase Architecture to revise your HBase concepts.
Now moving on, let us look at the Apache HBase interview questions.
1. What are the key components of HBase?
The key components of HBase are Zookeeper, RegionServer and HBase Master.
- Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
- HMaster: It coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).
- ZooKeeper:
Zookeeper acts like as a coordinator inside HBase distributed
environment. It helps in maintaining server state inside the cluster by
communicating through sessions.
The key components of HBase are Zookeeper, RegionServer and HBase Master.
- Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
- HMaster: It coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).
- ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
2. When would you use HBase?
HBase
is used in cases where we need random read and write operations and it
can perform a number of operations per second on a large data sets.
HBase gives strong data consistency. It can handle very large tables
with billions of rows and millions of columns on top of commodity
hardware cluster.
HBase
is used in cases where we need random read and write operations and it
can perform a number of operations per second on a large data sets.
HBase gives strong data consistency. It can handle very large tables
with billions of rows and millions of columns on top of commodity
hardware cluster.
3. What is the use of get() method?
get() method is used to read the data from the table.
get() method is used to read the data from the table.
4. Define the difference between Hive and HBase?
Apache
Hive is a data warehousing infrastructure built on top of Hadoop. It
helps in querying data stored in HDFS for analysis using Hive Query
Language (HQL), which is a SQL-like language, that gets translated into
MapReduce jobs. Hive performs batch processing on Hadoop.
Apache
HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive,
HBase operations run in real-time on its database rather than MapReduce
jobs. HBase partitions the tables, and the tables are further splitted
into column families.
Hive
and HBase are two different Hadoop based technologies – Hive is an
SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value
database of Hadoop. We can use them together. Hive can be used for
analytical queries while HBase for real-time querying. Data can even be
read and written from HBase to Hive and vice-versa.
Apache
Hive is a data warehousing infrastructure built on top of Hadoop. It
helps in querying data stored in HDFS for analysis using Hive Query
Language (HQL), which is a SQL-like language, that gets translated into
MapReduce jobs. Hive performs batch processing on Hadoop.
Apache
HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive,
HBase operations run in real-time on its database rather than MapReduce
jobs. HBase partitions the tables, and the tables are further splitted
into column families.
Hive
and HBase are two different Hadoop based technologies – Hive is an
SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value
database of Hadoop. We can use them together. Hive can be used for
analytical queries while HBase for real-time querying. Data can even be
read and written from HBase to Hive and vice-versa.
5. Explain the data model of HBase.
HBase comprises of:
- Set of tables.
- Each table consists of column families and rows.
- Row key acts as a Primary key in HBase.
- Any access to HBase tables uses this Primary Key.
- Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.
HBase comprises of:
- Set of tables.
- Each table consists of column families and rows.
- Row key acts as a Primary key in HBase.
- Any access to HBase tables uses this Primary Key.
- Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.
6. Define column families?
Column Family is a collection of columns, whereas row is a collection of column families.
Column Family is a collection of columns, whereas row is a collection of column families.
7. Define standalone mode in HBase?
It
is a default mode of HBase. In standalone mode, HBase does not use
HDFS—it uses the local filesystem instead—and it runs all HBase daemons
and a local ZooKeeper in the same JVM process.
It
is a default mode of HBase. In standalone mode, HBase does not use
HDFS—it uses the local filesystem instead—and it runs all HBase daemons
and a local ZooKeeper in the same JVM process.
8. What is decorating Filters?
It
is useful to modify, or extend, the behavior of a filter to gain
additional control over the returned data. These types of filters are
known as decorating filter. It includes SkipFilter and WhileMatchFilter.
It
is useful to modify, or extend, the behavior of a filter to gain
additional control over the returned data. These types of filters are
known as decorating filter. It includes SkipFilter and WhileMatchFilter.
9. What is RegionServer?
A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
10. What are the data manipulation commands of HBase?
Data Manipulation commands of HBase are:
- put – Puts a cell value at a specified column in a specified row in a particular table.
- get – Fetches the contents of a row or a cell.
- delete – Deletes a cell value in a table.
- deleteall – Deletes all the cells in a given row.
- scan – Scans and returns the table data.
- count – Counts and returns the number of rows in a table.
- truncate – Disables, drops, and recreates a specified table.
Data Manipulation commands of HBase are:
- put – Puts a cell value at a specified column in a specified row in a particular table.
- get – Fetches the contents of a row or a cell.
- delete – Deletes a cell value in a table.
- deleteall – Deletes all the cells in a given row.
- scan – Scans and returns the table data.
- count – Counts and returns the number of rows in a table.
- truncate – Disables, drops, and recreates a specified table.
11. Which code is used to open a connection in HBase?
Following code is used to open a HBase connection, here users is my HBase table:
1
2
Configuration myConf = HBaseConfiguration.create();
HTable table = new HTable(myConf, “users”);
Following code is used to open a HBase connection, here users is my HBase table:
1
2
| Configuration myConf = HBaseConfiguration.create(); HTable table = new HTable(myConf, “users”); |
12. What is the use of truncate command?
It is used to disable, drop and recreate the specified tables.
♣ Tip: To delete table first disable it, then delete it.
It is used to disable, drop and recreate the specified tables.
♣ Tip: To delete table first disable it, then delete it.
13. What happens when you issue a delete command in HBase?
Once
you issue a delete command in HBase for cell, column or column family,
it is not deleted instantly. A tombstone marker in inserted. Tombstone
is a specified data, which is stored along with standard data. This
tombstone makes hides all the deleted data.
The
actual data is deleted at the time of major compaction. In Major
compaction, HBase merges and recommits the smaller HFiles of a region to
a new HFile. In this process, the same column families are placed
together in the new HFile. It drops deleted and expired cell in this
process. All the results from scan and get filters the deleted cells.
Once
you issue a delete command in HBase for cell, column or column family,
it is not deleted instantly. A tombstone marker in inserted. Tombstone
is a specified data, which is stored along with standard data. This
tombstone makes hides all the deleted data.
The
actual data is deleted at the time of major compaction. In Major
compaction, HBase merges and recommits the smaller HFiles of a region to
a new HFile. In this process, the same column families are placed
together in the new HFile. It drops deleted and expired cell in this
process. All the results from scan and get filters the deleted cells.
14. What are different tombstone markers in HBase?
There are three types of tombstone markers in HBase:
- Version Marker: Marks only one version of a column for deletion.
- Column Marker: Marks the whole column (i.e. all version) for deletion.
- Family Marker: Marks the whole column family (i.e. all the columns in the column family) for deletion
There are three types of tombstone markers in HBase:
- Version Marker: Marks only one version of a column for deletion.
- Column Marker: Marks the whole column (i.e. all version) for deletion.
- Family Marker: Marks the whole column family (i.e. all the columns in the column family) for deletion
15. HBase blocksize is configured on which level?
The blocksize is configured per column family and the default value is 64 KB. This value can be changed as per requirements.
The blocksize is configured per column family and the default value is 64 KB. This value can be changed as per requirements.
16. Which command is used to run HBase Shell?
./bin/hbase shell command is used to run the HBase shell. Execute this command in HBase directory.
./bin/hbase shell command is used to run the HBase shell. Execute this command in HBase directory.
17. Which command is used to show the current HBase user?
whoami command is used to show HBase user.
whoami command is used to show HBase user.
18. What is the full form of MSLAB?
MSLAB
stands for Memstore-Local Allocation Buffer. Whenever a request thread
needs to insert data into a MemStore, it doesn’t allocates the space for
that data from the heap at large, but rather allocates memory arena
dedicated to the target region.
MSLAB
stands for Memstore-Local Allocation Buffer. Whenever a request thread
needs to insert data into a MemStore, it doesn’t allocates the space for
that data from the heap at large, but rather allocates memory arena
dedicated to the target region.
19. Define LZO?
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that focuses on decompression speed.
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that focuses on decompression speed.
20. What is HBase Fsck?
HBase
comes with a tool called hbck which is implemented by the HBaseFsck
class. HBaseFsck (hbck) is a tool for checking for region consistency
and table integrity problems and repairing a corrupted HBase. It works
in two basic modes – a read-only inconsistency identifying mode and a
multi-phase read-write repair mode.
HBase
comes with a tool called hbck which is implemented by the HBaseFsck
class. HBaseFsck (hbck) is a tool for checking for region consistency
and table integrity problems and repairing a corrupted HBase. It works
in two basic modes – a read-only inconsistency identifying mode and a
multi-phase read-write repair mode.
21. What is REST?
Rest
stands for Representational State Transfer which defines the semantics
so that the protocol can be used in a generic way to address remote
resources. It also provides support for different message formats,
offering many choices for a client application to communicate with the
server.
Rest
stands for Representational State Transfer which defines the semantics
so that the protocol can be used in a generic way to address remote
resources. It also provides support for different message formats,
offering many choices for a client application to communicate with the
server.
22. What is Thrift?
Apache
Thrift is written in C++, but provides schema compilers for many
programming languages, including Java, C++, Perl, PHP, Python, Ruby, and
more.
Apache
Thrift is written in C++, but provides schema compilers for many
programming languages, including Java, C++, Perl, PHP, Python, Ruby, and
more.
23. What is Nagios?
Nagios
is a very commonly used support tool for gaining qualitative data
regarding cluster status. It polls current metrics on a regular basis
and compares them with given thresholds.
Nagios
is a very commonly used support tool for gaining qualitative data
regarding cluster status. It polls current metrics on a regular basis
and compares them with given thresholds.
24. What is the use of ZooKeeper?
The
ZooKeeper is used to maintain the configuration information and
communication between region servers and clients. It also provides
distributed synchronization. It helps in maintaining server state inside
the cluster by communicating through sessions.
Every
Region Server along with HMaster Server sends continuous heartbeat at
regular interval to Zookeeper and it checks which server is alive and
available. It also provides server failure notifications so that,
recovery measures can be executed.
The
ZooKeeper is used to maintain the configuration information and
communication between region servers and clients. It also provides
distributed synchronization. It helps in maintaining server state inside
the cluster by communicating through sessions.
Every
Region Server along with HMaster Server sends continuous heartbeat at
regular interval to Zookeeper and it checks which server is alive and
available. It also provides server failure notifications so that,
recovery measures can be executed.
25. Define catalog tables in HBase?
Catalog tables are used to maintain the metadata information.
Catalog tables are used to maintain the metadata information.
26. Define compaction in HBase?
HBase
combines HFiles to reduce the storage and reduce the number of disk
seeks needed for a read. This process is called compaction. Compaction
chooses some HFiles from a region and combines them. There are two types
of compactions.
- Minor Compaction: HBase automatically picks smaller HFiles and recommits them to bigger HFiles.
- Major Compaction: In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile.
HBase
combines HFiles to reduce the storage and reduce the number of disk
seeks needed for a read. This process is called compaction. Compaction
chooses some HFiles from a region and combines them. There are two types
of compactions.
- Minor Compaction: HBase automatically picks smaller HFiles and recommits them to bigger HFiles.
- Major Compaction: In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile.
27. What is the use of HColumnDescriptor class?
HColumnDescriptor
stores the information about a column family like compression settings,
number of versions etc. It is used as input when creating a table or
adding a column.
HColumnDescriptor
stores the information about a column family like compression settings,
number of versions etc. It is used as input when creating a table or
adding a column.
28. Which filter accepts the pagesize as the parameter in hBase?
PageFilter
accepts the pagesize as the parameter. Implementation of Filter
interface that limits results to a specific page size. It terminates
scanning once the number of filter-passed the rows greater than the
given page size.
Syntax: PageFilter (<page_size>)
PageFilter
accepts the pagesize as the parameter. Implementation of Filter
interface that limits results to a specific page size. It terminates
scanning once the number of filter-passed the rows greater than the
given page size.
Syntax: PageFilter (<page_size>)
29. How will you design or modify schema in HBase programmatically?
HBase schemas can be created or updated using the Apache HBase Shell or by using Admin in the Java API.
Creating table schema:
1
2
3
4
5
6
7
8
9
10
11
12
Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf); // execute command through admin</span></pre>
// Instantiating table descriptor class
HTableDescriptor t1 = new HTableDescriptor(TableName.valueOf("employee"));
// Adding column families to t1
t1.addFamily(new HColumnDescriptor("professional"));
t1.addFamily(new HColumnDescriptor("personal"));
// Create the table through admin
admin.createTable(t1);
♣ Tip: Tables must be disabled when making ColumnFamily modifications.
For modification:
1
2
3
4
String table = “myTable”;
admin.disableTable(table);
admin.modifyColumn(table, cf2); // modifying existing ColumnFamily
admin.enableTable(table);
HBase schemas can be created or updated using the Apache HBase Shell or by using Admin in the Java API.
Creating table schema:
1
2
3
4
5
6
7
8
9
10
11
12
| Configuration config = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); // execute command through admin</span></pre> // Instantiating table descriptor class HTableDescriptor t1 = new HTableDescriptor(TableName.valueOf("employee")); // Adding column families to t1 t1.addFamily(new HColumnDescriptor("professional")); t1.addFamily(new HColumnDescriptor("personal")); // Create the table through admin admin.createTable(t1); |
For modification:
1
2
3
4
| String table = “myTable”; admin.disableTable(table); admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table); |
30.What are the filters are available in Apache HBase?
The filters that are supported by HBase are:
- ColumnPrefixFilter:
takes a single argument, a column prefix. It returns only those
key-values present in a column that starts with the specified column
prefix.
- TimestampsFilter: takes a list of timestamps. It returns those key-values whose timestamps match any of the specified timestamps.
- PageFilter: takes one argument, a page size. It returns page size, number of rows from the table.
- MultipleColumnPrefixFilter:
takes a list of column prefixes. It returns key-values that are present
in a column that starts with any of the specified column prefixes.
- ColumnPaginationFilter:
takes two arguments, a limit and an offset. It returns limit number of
columns after offset number of columns. It does this for all the rows.
- SingleColumnValueFilter:
takes a column family, a qualifier, a comparison operator and a
comparator. If the specified column is not found, all the columns of
that row will be emitted. If the column is found and the comparison with
the comparator returns true, all the columns of the row will be
emitted.
- RowFilter:
takes a comparison operator and a comparator. It compares each row key
with the comparator using the comparison operator and if the comparison
returns true, it returns all the key-values in that row.
- QualifierFilter:
takes a comparison operator and a comparator. It compares each
qualifier name with the comparator using the comparison operator and if
the comparison returns true, it returns all the key-values in that
column.
- ColumnRangeFilter:
takes either minColumn, maxColumn, or both. Returns only those keys
with columns that are between minColumn and maxColumn. It also takes two
boolean variables to indicate whether to include the minColumn and
maxColumn or not. If you don’t want to set the minColumn or the
maxColumn, you can pass in an empty argument.
- ValueFilter:
takes a comparison operator and a comparator. It compares each value
with the comparator using the compare operator and if the comparison
returns true, it returns that key-value.
- PrefixFilter:
takes a single argument, a prefix of a row key. It returns only those
key-values present in a row that start with the specified row prefix.
- SingleColumnValueExcludeFilter:
takes the same arguments and behaves same as SingleColumnValueFilter.
However, if the column is found and the condition passes, all the
columns of the row will be omitted except for the tested column value.
- ColumnCountGetFilter: takes one argument, a limit. It returns the first limit number of columns in the table.
- InclusiveStopFilter:
takes one argument, a row key on which to stop scanning. It returns all
key-values present in rows up to and including the specified row.
- DependentColumnFilter:
takes two arguments required arguments, a family and a qualifier. It
tries to locate this column in each row and returns all key-values in
that row that have the same timestamp.
- FirstKeyOnlyFilter: takes no arguments. Returns the key portion of the first key-value pair.
- KeyOnlyFilter: takes no arguments. Returns the key portion of each key-value pair.
- FamilyFilter:
takes a comparison operator and comparator. It compares each family
name with the comparator using the comparison operator and if the
comparison returns true, it returns all the key-values in that family.
- CustomFilter: You can create a custom filter by implementing the Filter class.
The filters that are supported by HBase are:
- ColumnPrefixFilter: takes a single argument, a column prefix. It returns only those key-values present in a column that starts with the specified column prefix.
- TimestampsFilter: takes a list of timestamps. It returns those key-values whose timestamps match any of the specified timestamps.
- PageFilter: takes one argument, a page size. It returns page size, number of rows from the table.
- MultipleColumnPrefixFilter: takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes.
- ColumnPaginationFilter: takes two arguments, a limit and an offset. It returns limit number of columns after offset number of columns. It does this for all the rows.
- SingleColumnValueFilter: takes a column family, a qualifier, a comparison operator and a comparator. If the specified column is not found, all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted.
- RowFilter: takes a comparison operator and a comparator. It compares each row key with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that row.
- QualifierFilter: takes a comparison operator and a comparator. It compares each qualifier name with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that column.
- ColumnRangeFilter: takes either minColumn, maxColumn, or both. Returns only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not. If you don’t want to set the minColumn or the maxColumn, you can pass in an empty argument.
- ValueFilter: takes a comparison operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.
- PrefixFilter: takes a single argument, a prefix of a row key. It returns only those key-values present in a row that start with the specified row prefix.
- SingleColumnValueExcludeFilter: takes the same arguments and behaves same as SingleColumnValueFilter. However, if the column is found and the condition passes, all the columns of the row will be omitted except for the tested column value.
- ColumnCountGetFilter: takes one argument, a limit. It returns the first limit number of columns in the table.
- InclusiveStopFilter: takes one argument, a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.
- DependentColumnFilter: takes two arguments required arguments, a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp.
- FirstKeyOnlyFilter: takes no arguments. Returns the key portion of the first key-value pair.
- KeyOnlyFilter: takes no arguments. Returns the key portion of each key-value pair.
- FamilyFilter: takes a comparison operator and comparator. It compares each family name with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that family.
- CustomFilter: You can create a custom filter by implementing the Filter class.
31. How do we back up a HBase cluster?
There
are two broad strategies for performing HBase backups: backing up with a
full cluster shutdown, and backing up on a live cluster. Each approach
has benefits and limitation.
Full Shutdown Backup
Some
environments can tolerate a periodic full shutdown of their HBase
cluster, for example, if it is being used as a back-end process and not
serving front-end webpages.
- Stop HBase: Stop the HBase services first.
- Distcp:
Distcp could be used to either copy the contents of the HBase directory
in HDFS to either the same cluster in another directory, or to a
different cluster.
- Restore:
The backup of the HBase directory from HDFS is copied onto the ‘real’
HBase directory via distcp. The act of copying these files, creates new
HDFS metadata, which is why a restore of the NameNode edits from the
time of the HBase backup isn’t required for this kind of restore,
because it’s a restore (via distcp) of a specific HDFS directory (i.e.,
the HBase part) not the entire HDFS file-system.
Live Cluster Backup
The environments which cannot handle downtime uses Live Cluster Backup.
- CopyTable:
Copy table utility could either be used to copy data from one table to
another on the same cluster, or to copy data to another table on another
cluster.
- Export: Export approach dumps the content of a table to HDFS on the same cluster.
There
are two broad strategies for performing HBase backups: backing up with a
full cluster shutdown, and backing up on a live cluster. Each approach
has benefits and limitation.
Full Shutdown Backup
Some
environments can tolerate a periodic full shutdown of their HBase
cluster, for example, if it is being used as a back-end process and not
serving front-end webpages.
- Stop HBase: Stop the HBase services first.
- Distcp: Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
- Restore: The backup of the HBase directory from HDFS is copied onto the ‘real’ HBase directory via distcp. The act of copying these files, creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn’t required for this kind of restore, because it’s a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
Live Cluster Backup
The environments which cannot handle downtime uses Live Cluster Backup.
- CopyTable: Copy table utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
- Export: Export approach dumps the content of a table to HDFS on the same cluster.
32. How HBase Handles the write failure?
Failures are common in large distributed systems, and HBase is no exception.
If
the server hosting a MemStore that has not yet been flushed crashes.
The data that was in memory, but not yet persisted are lost. HBase
safeguards against that by writing to the WAL before the write
completes. Every server that’s part of the.
HBase
cluster keeps a WAL to record changes as they happen. The WAL is a file
on the underlying file system. A write isn’t considered successful
until the new WAL entry is successfully written. This guarantee makes
HBase as durable as the file system backing it. Most of the time, HBase
is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes
down, the data that were not yet flushed from the MemStore to the HFile
can be recovered by replaying the WAL.
Failures are common in large distributed systems, and HBase is no exception.
If
the server hosting a MemStore that has not yet been flushed crashes.
The data that was in memory, but not yet persisted are lost. HBase
safeguards against that by writing to the WAL before the write
completes. Every server that’s part of the.
HBase
cluster keeps a WAL to record changes as they happen. The WAL is a file
on the underlying file system. A write isn’t considered successful
until the new WAL entry is successfully written. This guarantee makes
HBase as durable as the file system backing it. Most of the time, HBase
is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes
down, the data that were not yet flushed from the MemStore to the HFile
can be recovered by replaying the WAL.
33. While reading data from HBase, from which three places data will be reconciled before returning the value?
The read process will go through the following process sequentially:
- For
reading the data, the scanner first looks for the Row cell in Block
cache. Here all the recently read key value pairs are stored.
- If
Scanner fails to find the required result, it moves to the MemStore, as
we know this is the write cache memory. There, it searches for the most
recently written files, which has not been dumped yet in HFile.
- At last, it will use bloom filters and block cache to load the data from the HFile.
The read process will go through the following process sequentially:
- For reading the data, the scanner first looks for the Row cell in Block cache. Here all the recently read key value pairs are stored.
- If Scanner fails to find the required result, it moves to the MemStore, as we know this is the write cache memory. There, it searches for the most recently written files, which has not been dumped yet in HFile.
- At last, it will use bloom filters and block cache to load the data from the HFile.
34. Can you explain data versioning?
In addition to being a schema-less database, HBase is also versioned.
Every
time you perform an operation on a cell, HBase implicitly stores a new
version. Creating, modifying and deleting a cell are all treated
identically, they are all new versions. When a cell exceeds the maximum
number of versions, the extra records are dropped during the major
compaction.
Instead
of deleting an entire cell, you can operate on a specific version
within that cell. Values within a cell are versioned and it is
identified the timestamp. If a version is not mentioned, then the
current timestamp is used to retrieve the version. The default number of
cell version is three.
In addition to being a schema-less database, HBase is also versioned.
Every
time you perform an operation on a cell, HBase implicitly stores a new
version. Creating, modifying and deleting a cell are all treated
identically, they are all new versions. When a cell exceeds the maximum
number of versions, the extra records are dropped during the major
compaction.
Instead
of deleting an entire cell, you can operate on a specific version
within that cell. Values within a cell are versioned and it is
identified the timestamp. If a version is not mentioned, then the
current timestamp is used to retrieve the version. The default number of
cell version is three.
35. What is a Bloom filter and how does it help in searching rows?
HBase
supports Bloom Filter to improve the overall throughput of the cluster.
A HBase Bloom Filter is a space efficient mechanism to test whether a
HFile contains a specific row or row-col cell.
Without
Bloom Filter, the only way to decide if a row key is present in a
HFile is to check the HFile’s block index, which stores the start row
key of each block in the HFile. There are many rows drops between the
two start keys. So, HBase has to load the block and scan the block’s
keys to figure out if that row key actually exists.
-----------------------------------------------------------
HBase
supports Bloom Filter to improve the overall throughput of the cluster.
A HBase Bloom Filter is a space efficient mechanism to test whether a
HFile contains a specific row or row-col cell.
Without
Bloom Filter, the only way to decide if a row key is present in a
HFile is to check the HFile’s block index, which stores the start row
key of each block in the HFile. There are many rows drops between the
two start keys. So, HBase has to load the block and scan the block’s
keys to figure out if that row key actually exists.
-----------------------------------------------------------
Interview Question for – HBase
http://www.bigdatatrunk.com/top-50-hbase-interview-questions/
Q1 What are the different types of tombstone markers in HBase for deletion?
Answer: There are 3 different types of tombstone markers in HBase for deletion-
1)Family Delete Marker- This marker marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers mark all the versions of a column.
Q2 When should you use HBase and what are the key components of HBase?
Answer: HBase should be used when the big data application has –
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key-based access to data while retrieving.
Key components of HBase are –
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and
META.ROOT table tracks where the META table is and META table stores all
the regions in the system.
Q3 Explain the difference between HBase and Hive.
Answer: HBase and Hive both are completely different Hadoop based
technologies-Hive is a data warehouse infrastructure on top of Hadoop
whereas HBase is a NoSQL key-value store that runs on top of Hadoop.
Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports
4 primary operations-put, get, scan and delete. HBase is ideal for
real-time querying of big data where Hive is an ideal choice for
analytical querying of data collected over the period of time.
Q4 What is Row Key?
Answer: Every row in an HBase table has a unique identifier known as
RowKey. It is used for grouping cells logically and it ensures that all
cells that have the same RowKeys are co-located on the same server.
RowKey is internally regarded as a byte array.
Q5 Explain the difference between RDBMS data model and HBase data model.
Answer: RDBMS is a schema-based database whereas HBase is schema-less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Q6 What are the different operational commands in HBase at record level and table level?
Answer: Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
Q7 Explain about the different catalog tables in HBase?
Answer: The two important catalog tables in HBase, are ROOT and META.
ROOT table tracks where the META table is and META table stores all the
regions in the system.
Q8 Explain the process of row deletion in HBase.
Answer: On issuing a delete command in HBase through the HBase
client, data is not actually deleted from the cells but rather the cells
are made invisible by setting a tombstone marker. The deleted cells are
removed at regular intervals during compaction.
Q9 What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?
Answer: The logical deviation of data is represented by a key known
as column Family. Column families consist of the basic unit of physical
storage on which compression features can be applied. In an already
populated database, when the block size of column family is altered, the
old data will remain within the old block size whereas the new data
that comes in will take the new block size. When compaction takes place,
the old data will take the new block size so that the existing data is
read correctly.
Q10 Explain about HLog and WAL in HBase.
Answer: All edits in the HStore are stored in the HLog. Every region
server has one HLog. HLog contains entries for edits of all regions
performed by a particular Region Server.WAL abbreviates to Write Ahead
Log (WAL) in which all the HLog edits are written immediately.WAL edits
remain in the memory till the flush period in case of deferred log
flush.
Q11 what is NoSql?
Answer: Apache HBase is a type of “NoSQL” database. “NoSQL” is a
general term meaning that the database isn’t an RDBMS which supports SQL
as its primary access language, but there are many types of NoSQL
databases: BerkeleyDB is an example of a local NoSQL database, whereas
HBase is very much a distributed database. Technically speaking, HBase
is really more a “Data Store” than “Data Base” because it lacks many of
the features you find in an RDBMS, such as typed columns, secondary
indexes, triggers, and advanced query languages, etc.
Q12 What is region server?
Answer: It is a file which lists the known region server names.
Q13 Give the name of the key components of HBase
Answer: The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.
Q14 What is the reason for using HBase?
Answer: Hbase is used because it provides random read and write
operations and it can perform a number of operation per second on a
large data sets.
Q15 Define standalone mode in Hbase?
Answer: It is a default mode of HBase .In standalone mode, HBase does
not use HDFS—it uses the local filesystem instead—and it runs all HBase
daemons and a local ZooKeeper in the same JVM process.
Q16 Which operating system is supported by HBase?
Answer: HBase supports those OS which supports java like windows, Linux.
Q17 What are the main features of Apache HBase?
Answer: Apache HBase has many features which support both linear and
modular scaling, HBase tables are distributed on the cluster via
regions, and regions are automatically split and re-distributed as your
data grows(Automatic sharding).HBase supports a Block Cache and Bloom
Filters for high volume query optimization(Block Cache and Bloom
Filters).
Q18 What is the difference between HDFS/Hadoop and HBase?
Answer: HDFS doesn’t provide fast lookup records in a file, IN Hbase provides fast lookup records for a large table.
Q19 What are data model operations in HBase?
Answer: 1)Get(returns attributes for a specified row, Gets are executed via HTable.get)
2)put(Put either adds new rows to a table (if the key is new) or can
update existing rows (if the key already exists). Puts are executed via
HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
3)scan(Scan allow iteration over multiple rows for specified attributes)
4)Delete(Delete removes a row from a table. Deletes are executed via HTable.delete)
HBase does not modify data in place, and so deletes are handled by
creating new markers called tombstones. These tombstones, along with the
dead values, are cleaned up on major compaction.
Q20 How many filters are available in Apache HBase?
Answer: Total we have 18 filters are support to hbase.They are:
ColumnPrefixFilter
TimestampsFilter
PageFilter
MultipleColumnPrefixFilter
FamilyFilter
ColumnPaginationFilter
SingleColumnValueFilter
RowFilter
QualifierFilter
ColumnRangeFilter
ValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
ColumnCountGetFilter
InclusiveStopFilter
DependentColumnFilter
FirstKeyOnlyFilter
KeyOnlyFilter
Q21 Does HBase support SQL?
Answer: Not really. SQL-ish support for HBase via Hive is in
development, however, Hive is based on MapReduce which is not generally
suitable for low-latency requests.By using Apache Phoenix can retrieve
data from HBase by using sql queries.
Q22 Is there any difference between HBase data model and RDBMS data model?
Answer: In HBase, data is stored as a table(have rows and columns)
similar to RDBMS but this is not a helpful analogy. Instead, it can be
helpful to think of an HBase table as a multi-dimensional map.
Q23 What is Apache HBase?
Answer: Apache HBase is one the sub-project of Apache Hadoop, which
was designed for NoSql database(Hadoop Database),big data store and a
distributed, scalable.Use Apache HBase when you need random, real-time
read/write access to your Big Data.A table which contains billions of
rows X millions of columns -atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, non-relational database
modelled after Google’s Bigtable. Apache HBase provides Bigtable-like
capabilities run on top of Hadoop and HDFS.
Q24 What is the use of shutdown command?
Answer: It is used to shut down the cluster.
Q25 How to delete the table with the shell?
Answer: To delete table first disable it then delete it.
Q26 What is the full form of MSLAB?
Answer: MSLAB stands for Memstore-Local Allocation Buffer.
Q27 What is REST?
Answer: Rest stands for Representational State Transfer which defines
the semantics so that the protocol can be used in a generic way to
address remote resources. It also provides support for different message
formats, offering many choices for a client application to communicate
with the server.
Q28 What Is The Difference Between HBase and Hadoop/HDFS?
Answer: HDFS: is a distributed file system that is well suited for
the storage of large files. It\’s documentation states that it is not,
however, a general-purpose file system, and does not provide fast
individual record lookups in files.
HBase: on the other hand, is built on top of HDFS and provides fast
record lookups (and updates) for large tables. This can sometimes be a
point of conceptual confusion. HBase internally puts your data in
indexed “StoreFiles” that exist on HDFS for high-speed lookups.
Q29 How many Operational commands in Hbase?
Answer: There are five main commands in HBase.
- Get
- Put
- Delete
- Scan
- Increment
Q30 Why cant I iterate through the rows of a table in reverse order?
Answer: Because of the way HFile works: for efficiency, column values
are put on a disk with the length of the value written first and then
the bytes of the actual value written second. To navigate through these
values in reverse order, these length values would need to be stored
twice (at the end as well) or in a side file. A robust secondary index
implementation is the likely solution here to ensure the primary use
case remains fast.
Q31 Explain what is Hbase?
Answer: HBase is a column-oriented database management system which
runs on top of HDFS (Hadoop Distributed File System). HBase is not a
relational data store, and it does not support structured query language
like SQL.
In HBase, a master node regulates the cluster and region servers to
store portions of the tables and operates the work on the data.
Q32 How to connect to Hbase?
Answer: A connection to HBase is established through Hbase Shell which is a Java API.
Q33 Why we describe HBase Schemaless?
Answer: Other than the column family name, HBase doesn’t require you
to tell it anything about your data ahead of time. That’s why HBase is
often described as a schema-less database.
Q34 What is Hfile?
Answer: All columns in a column family are stored together in the same low-level storage file, called an Hfile.
Q35 How data is written into HBase?
Answer: When data is updated it is first written to a commit log,
called a write-ahead log (WAL) in HBase, and then stored in the
in-memory memstore. Once the data in memory has exceeded a given maximum
value, it is flushed as an HFile to disk. After the flush, the commit
logs can be discarded up to the last unflushed modification.
Q36 How data is read back from HBase?
Answer: Reading data back involves a merge of what is stored in the
memstores, that is, the data that has not been written to disk, and the
on-disk store files. Note that the WAL is never used during data
retrieval, but solely for recovery purposes when a server has crashed
before writing the in-memory data to disk.
Q37 What is the role of Zookeeper in Hbase?
Answer: The zookeeper maintains configuration information, provides
distributed synchronization, and also maintains the communication
between clients and region servers.
Q38 What are the different types of filters used in Hbase?
Answer: Filters are used to get specific data form a Hbase table rather than all the records.
They are of the following types.
Column Value Filter
Column Value comparators
KeyValue Metadata filters.
RowKey filters.
Q39 How does Hbase provide high availability?
Answer: Hbase uses a feature called region replication. In this
feature for each region of a table, there will be multiple replicas that
are opened in different RegionServers. The Load Balancer ensures that
the region replicas are not co-hosted in the same region servers.
Q40 Explain what is the row key?
Answer: Row key is defined by the application. As the combined key is
pre-fixed by the rowkey, it enables the application to define the
desired sort order. It also allows logical grouping of cells and make
sure that all cells with the same rowkey are co-located on the same
server.
Q41 What are the different compaction types in Hbase?
Answer: There are two types of compaction. Major and Minor
compaction. In minor compaction, the adjacent small HFiles are merged to
create a single HFile without removing the deleted HFiles. Files to be
merged are chosen randomly.
In Major compaction, all the HFiles of a column are emerged and a
single HFiles is created. The delted HFiles are discarded and it is
generally triggered manually.
Q42 What is TTL (Time to live) in Hbase?
Answer: TTL is a data retention technique using which the version of a
cell can be preserved till a specific time period.Once that timestamp
is reached the specific version will be removed
Q43 In Hbase what is log splitting?
Answer: When a region is edited, the edits in the WAL file which
belong to that region need to be replayed. Therefore, edits in the WAL
file must be grouped by region so that particular sets can be replayed
to regenerate the data in a particular region. The process of grouping
the WAL edits by region is called log splitting.
Q45 Why MultiWAL is needed?
Answer: With a single WAL per RegionServer, the RegionServer must
write to the WAL serially, because HDFS files must be sequential. This
causes the WAL to be a performance bottleneck.
Q46 What are the different Block Caches in Hbase?
Answer: HBase provides two different BlockCache implementations: the
default on-heap LruBlockCache and the BucketCache, which is (usually)
off-heap.
Q47 Can you create HBase table without assigning column family.
Answer: No, Column family also impact how the data should be stored
physically in the HDFS file system, hence there is a mandate that you
should always have at least one column family. We can also alter the
column families once the table is created.
Q48 What is HFile ?
Answer: The HFile is the underlying storage format for HBase.
HFiles belong to a column family and a column family can have multiple HFiles.
But a single HFile can’t have data for multiple column families
Q1 What are the different types of tombstone markers in HBase for deletion?
Answer: There are 3 different types of tombstone markers in HBase for deletion-
1)Family Delete Marker- This marker marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers mark all the versions of a column.
Q2 When should you use HBase and what are the key components of HBase?
Answer: HBase should be used when the big data application has –
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key-based access to data while retrieving.
Key components of HBase are –
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.
Q3 Explain the difference between HBase and Hive.
Answer: HBase and Hive both are completely different Hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key-value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real-time querying of big data where Hive is an ideal choice for analytical querying of data collected over the period of time.
Q4 What is Row Key?
Answer: Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.
Q5 Explain the difference between RDBMS data model and HBase data model.
Answer: RDBMS is a schema-based database whereas HBase is schema-less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Q6 What are the different operational commands in HBase at record level and table level?
Answer: Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
Q7 Explain about the different catalog tables in HBase?
Answer: The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
Q8 Explain the process of row deletion in HBase.
Answer: On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.
Q9 What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?
Answer: The logical deviation of data is represented by a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.
Q10 Explain about HLog and WAL in HBase.
Answer: All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
Q11 what is NoSql?
Answer: Apache HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
Q12 What is region server?
Answer: It is a file which lists the known region server names.
Q13 Give the name of the key components of HBase
Answer: The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.
Q14 What is the reason for using HBase?
Answer: Hbase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.
Q15 Define standalone mode in Hbase?
Answer: It is a default mode of HBase .In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
Q16 Which operating system is supported by HBase?
Answer: HBase supports those OS which supports java like windows, Linux.
Q17 What are the main features of Apache HBase?
Answer: Apache HBase has many features which support both linear and modular scaling, HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).HBase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom Filters).
Q18 What is the difference between HDFS/Hadoop and HBase?
Answer: HDFS doesn’t provide fast lookup records in a file, IN Hbase provides fast lookup records for a large table.
Q19 What are data model operations in HBase?
Answer: 1)Get(returns attributes for a specified row, Gets are executed via HTable.get)
2)put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
3)scan(Scan allow iteration over multiple rows for specified attributes)
4)Delete(Delete removes a row from a table. Deletes are executed via HTable.delete)
HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compaction.
Q20 How many filters are available in Apache HBase?
Answer: Total we have 18 filters are support to hbase.They are:
ColumnPrefixFilter
TimestampsFilter
PageFilter
MultipleColumnPrefixFilter
FamilyFilter
ColumnPaginationFilter
SingleColumnValueFilter
RowFilter
QualifierFilter
ColumnRangeFilter
ValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
ColumnCountGetFilter
InclusiveStopFilter
DependentColumnFilter
FirstKeyOnlyFilter
KeyOnlyFilter
Q21 Does HBase support SQL?
Answer: Not really. SQL-ish support for HBase via Hive is in development, however, Hive is based on MapReduce which is not generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from HBase by using sql queries.
Q22 Is there any difference between HBase data model and RDBMS data model?
Answer: In HBase, data is stored as a table(have rows and columns) similar to RDBMS but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
Q23 What is Apache HBase?
Answer: Apache HBase is one the sub-project of Apache Hadoop, which was designed for NoSql database(Hadoop Database),big data store and a distributed, scalable.Use Apache HBase when you need random, real-time read/write access to your Big Data.A table which contains billions of rows X millions of columns -atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modelled after Google’s Bigtable. Apache HBase provides Bigtable-like capabilities run on top of Hadoop and HDFS.
Q24 What is the use of shutdown command?
Answer: It is used to shut down the cluster.
Q25 How to delete the table with the shell?
Answer: To delete table first disable it then delete it.
Q26 What is the full form of MSLAB?
Answer: MSLAB stands for Memstore-Local Allocation Buffer.
Q27 What is REST?
Answer: Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
Q28 What Is The Difference Between HBase and Hadoop/HDFS?
Answer: HDFS: is a distributed file system that is well suited for the storage of large files. It\’s documentation states that it is not, however, a general-purpose file system, and does not provide fast individual record lookups in files.
HBase: on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed “StoreFiles” that exist on HDFS for high-speed lookups.
Q29 How many Operational commands in Hbase?
Answer: There are five main commands in HBase.
Answer: Because of the way HFile works: for efficiency, column values are put on a disk with the length of the value written first and then the bytes of the actual value written second. To navigate through these values in reverse order, these length values would need to be stored twice (at the end as well) or in a side file. A robust secondary index implementation is the likely solution here to ensure the primary use case remains fast.
Q31 Explain what is Hbase?
Answer: HBase is a column-oriented database management system which runs on top of HDFS (Hadoop Distributed File System). HBase is not a relational data store, and it does not support structured query language like SQL.
In HBase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.
Q32 How to connect to Hbase?
Answer: A connection to HBase is established through Hbase Shell which is a Java API.
Q33 Why we describe HBase Schemaless?
Answer: Other than the column family name, HBase doesn’t require you to tell it anything about your data ahead of time. That’s why HBase is often described as a schema-less database.
Q34 What is Hfile?
Answer: All columns in a column family are stored together in the same low-level storage file, called an Hfile.
Q35 How data is written into HBase?
Answer: When data is updated it is first written to a commit log, called a write-ahead log (WAL) in HBase, and then stored in the in-memory memstore. Once the data in memory has exceeded a given maximum value, it is flushed as an HFile to disk. After the flush, the commit logs can be discarded up to the last unflushed modification.
Q36 How data is read back from HBase?
Answer: Reading data back involves a merge of what is stored in the memstores, that is, the data that has not been written to disk, and the on-disk store files. Note that the WAL is never used during data retrieval, but solely for recovery purposes when a server has crashed before writing the in-memory data to disk.
Q37 What is the role of Zookeeper in Hbase?
Answer: The zookeeper maintains configuration information, provides distributed synchronization, and also maintains the communication between clients and region servers.
Q38 What are the different types of filters used in Hbase?
Answer: Filters are used to get specific data form a Hbase table rather than all the records.
They are of the following types.
Column Value Filter
Column Value comparators
KeyValue Metadata filters.
RowKey filters.
Q39 How does Hbase provide high availability?
Answer: Hbase uses a feature called region replication. In this feature for each region of a table, there will be multiple replicas that are opened in different RegionServers. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers.
Q40 Explain what is the row key?
Answer: Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
Q41 What are the different compaction types in Hbase?
Answer: There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.
In Major compaction, all the HFiles of a column are emerged and a single HFiles is created. The delted HFiles are discarded and it is generally triggered manually.
Q42 What is TTL (Time to live) in Hbase?
Answer: TTL is a data retention technique using which the version of a cell can be preserved till a specific time period.Once that timestamp is reached the specific version will be removed
Q43 In Hbase what is log splitting?
Answer: When a region is edited, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.
Q45 Why MultiWAL is needed?
Answer: With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
Q46 What are the different Block Caches in Hbase?
Answer: HBase provides two different BlockCache implementations: the default on-heap LruBlockCache and the BucketCache, which is (usually) off-heap.
Q47 Can you create HBase table without assigning column family.
Answer: No, Column family also impact how the data should be stored physically in the HDFS file system, hence there is a mandate that you should always have at least one column family. We can also alter the column families once the table is created.
Q48 What is HFile ?
Answer: The HFile is the underlying storage format for HBase.
HFiles belong to a column family and a column family can have multiple HFiles.
But a single HFile can’t have data for multiple column families
Answer: There are 3 different types of tombstone markers in HBase for deletion-
1)Family Delete Marker- This marker marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers mark all the versions of a column.
Q2 When should you use HBase and what are the key components of HBase?
Answer: HBase should be used when the big data application has –
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key-based access to data while retrieving.
Key components of HBase are –
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.
Q3 Explain the difference between HBase and Hive.
Answer: HBase and Hive both are completely different Hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key-value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real-time querying of big data where Hive is an ideal choice for analytical querying of data collected over the period of time.
Q4 What is Row Key?
Answer: Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.
Q5 Explain the difference between RDBMS data model and HBase data model.
Answer: RDBMS is a schema-based database whereas HBase is schema-less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Q6 What are the different operational commands in HBase at record level and table level?
Answer: Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
Q7 Explain about the different catalog tables in HBase?
Answer: The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
Q8 Explain the process of row deletion in HBase.
Answer: On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.
Q9 What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?
Answer: The logical deviation of data is represented by a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.
Q10 Explain about HLog and WAL in HBase.
Answer: All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
Q11 what is NoSql?
Answer: Apache HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
Q12 What is region server?
Answer: It is a file which lists the known region server names.
Q13 Give the name of the key components of HBase
Answer: The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.
Q14 What is the reason for using HBase?
Answer: Hbase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.
Q15 Define standalone mode in Hbase?
Answer: It is a default mode of HBase .In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
Q16 Which operating system is supported by HBase?
Answer: HBase supports those OS which supports java like windows, Linux.
Q17 What are the main features of Apache HBase?
Answer: Apache HBase has many features which support both linear and modular scaling, HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).HBase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom Filters).
Q18 What is the difference between HDFS/Hadoop and HBase?
Answer: HDFS doesn’t provide fast lookup records in a file, IN Hbase provides fast lookup records for a large table.
Q19 What are data model operations in HBase?
Answer: 1)Get(returns attributes for a specified row, Gets are executed via HTable.get)
2)put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
3)scan(Scan allow iteration over multiple rows for specified attributes)
4)Delete(Delete removes a row from a table. Deletes are executed via HTable.delete)
HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compaction.
Q20 How many filters are available in Apache HBase?
Answer: Total we have 18 filters are support to hbase.They are:
ColumnPrefixFilter
TimestampsFilter
PageFilter
MultipleColumnPrefixFilter
FamilyFilter
ColumnPaginationFilter
SingleColumnValueFilter
RowFilter
QualifierFilter
ColumnRangeFilter
ValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
ColumnCountGetFilter
InclusiveStopFilter
DependentColumnFilter
FirstKeyOnlyFilter
KeyOnlyFilter
Q21 Does HBase support SQL?
Answer: Not really. SQL-ish support for HBase via Hive is in development, however, Hive is based on MapReduce which is not generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from HBase by using sql queries.
Q22 Is there any difference between HBase data model and RDBMS data model?
Answer: In HBase, data is stored as a table(have rows and columns) similar to RDBMS but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
Q23 What is Apache HBase?
Answer: Apache HBase is one the sub-project of Apache Hadoop, which was designed for NoSql database(Hadoop Database),big data store and a distributed, scalable.Use Apache HBase when you need random, real-time read/write access to your Big Data.A table which contains billions of rows X millions of columns -atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modelled after Google’s Bigtable. Apache HBase provides Bigtable-like capabilities run on top of Hadoop and HDFS.
Q24 What is the use of shutdown command?
Answer: It is used to shut down the cluster.
Q25 How to delete the table with the shell?
Answer: To delete table first disable it then delete it.
Q26 What is the full form of MSLAB?
Answer: MSLAB stands for Memstore-Local Allocation Buffer.
Q27 What is REST?
Answer: Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
Q28 What Is The Difference Between HBase and Hadoop/HDFS?
Answer: HDFS: is a distributed file system that is well suited for the storage of large files. It\’s documentation states that it is not, however, a general-purpose file system, and does not provide fast individual record lookups in files.
HBase: on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed “StoreFiles” that exist on HDFS for high-speed lookups.
Q29 How many Operational commands in Hbase?
Answer: There are five main commands in HBase.
- Get
- Put
- Delete
- Scan
- Increment
Answer: Because of the way HFile works: for efficiency, column values are put on a disk with the length of the value written first and then the bytes of the actual value written second. To navigate through these values in reverse order, these length values would need to be stored twice (at the end as well) or in a side file. A robust secondary index implementation is the likely solution here to ensure the primary use case remains fast.
Q31 Explain what is Hbase?
Answer: HBase is a column-oriented database management system which runs on top of HDFS (Hadoop Distributed File System). HBase is not a relational data store, and it does not support structured query language like SQL.
In HBase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.
Q32 How to connect to Hbase?
Answer: A connection to HBase is established through Hbase Shell which is a Java API.
Q33 Why we describe HBase Schemaless?
Answer: Other than the column family name, HBase doesn’t require you to tell it anything about your data ahead of time. That’s why HBase is often described as a schema-less database.
Q34 What is Hfile?
Answer: All columns in a column family are stored together in the same low-level storage file, called an Hfile.
Q35 How data is written into HBase?
Answer: When data is updated it is first written to a commit log, called a write-ahead log (WAL) in HBase, and then stored in the in-memory memstore. Once the data in memory has exceeded a given maximum value, it is flushed as an HFile to disk. After the flush, the commit logs can be discarded up to the last unflushed modification.
Q36 How data is read back from HBase?
Answer: Reading data back involves a merge of what is stored in the memstores, that is, the data that has not been written to disk, and the on-disk store files. Note that the WAL is never used during data retrieval, but solely for recovery purposes when a server has crashed before writing the in-memory data to disk.
Q37 What is the role of Zookeeper in Hbase?
Answer: The zookeeper maintains configuration information, provides distributed synchronization, and also maintains the communication between clients and region servers.
Q38 What are the different types of filters used in Hbase?
Answer: Filters are used to get specific data form a Hbase table rather than all the records.
They are of the following types.
Column Value Filter
Column Value comparators
KeyValue Metadata filters.
RowKey filters.
Q39 How does Hbase provide high availability?
Answer: Hbase uses a feature called region replication. In this feature for each region of a table, there will be multiple replicas that are opened in different RegionServers. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers.
Q40 Explain what is the row key?
Answer: Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
Q41 What are the different compaction types in Hbase?
Answer: There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.
In Major compaction, all the HFiles of a column are emerged and a single HFiles is created. The delted HFiles are discarded and it is generally triggered manually.
Q42 What is TTL (Time to live) in Hbase?
Answer: TTL is a data retention technique using which the version of a cell can be preserved till a specific time period.Once that timestamp is reached the specific version will be removed
Q43 In Hbase what is log splitting?
Answer: When a region is edited, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.
Q45 Why MultiWAL is needed?
Answer: With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
Q46 What are the different Block Caches in Hbase?
Answer: HBase provides two different BlockCache implementations: the default on-heap LruBlockCache and the BucketCache, which is (usually) off-heap.
Q47 Can you create HBase table without assigning column family.
Answer: No, Column family also impact how the data should be stored physically in the HDFS file system, hence there is a mandate that you should always have at least one column family. We can also alter the column families once the table is created.
Q48 What is HFile ?
Answer: The HFile is the underlying storage format for HBase.
HFiles belong to a column family and a column family can have multiple HFiles.
As we know there are many companies which are converting into Google cloud big data services. with the right direction we can definitely predict the future.
ReplyDeleteNice post devops online training
ReplyDeleteThis comment has been removed by the author.
ReplyDelete