Sunday, March 9, 2025

Partitioning

Hive organizes the tables into partitions, and partitions are subdivided into buckets.

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)

ALTER TABLE employee
ADD PARTITION (year=’2012’)
location '/2012/part2012';

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

ALTER TABLE employee PARTITION (year=’1203’)

RENAME TO PARTITION (Yoj=’1203’);

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...;

ALTER TABLE employee DROP [IF EXISTS]

PARTITION (year=’1203’);


Create, Alter, Drop Databases and Tables

CREATE DATABASE [IF NOT EXISTS] userdb;

DROP DATABASE IF EXISTS userdb;

IF NOT EXISTS : This is a optional clause we are using to remove the table if it exist.

SHOW DATABASE : This command is used to list the databases.


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

(OR)

CREATE TABLE IF NOT EXISTS employee (
eid int,
name String,
salary String,
destination String )

COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Brief explanation of the above syntax:
DELIMITED is a Row Format
‘\t’ is a Field terminator
‘\n’ is a Lines terminator
TEXTFILE is a File type.

Generally, we use to insert data into sql and oracle databases using INSERT statments but here we insert data into hive using LOAD DATA statement. 

LOAD DATA statement is used to store bulk records. 

We have two options to load data
- Local file system
- Hadoop file system

SYNTAX:
-------------
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename 

[PARTITION (partcol1=val1, partcol2=val2 ...)]


QUERY:
-----------
LOAD DATA LOCAL INPATH '/home/user/sample.txt'

OVERWRITE INTO TABLE employee;

PARTITION is optional to use in the query.

ALTER TABLE name RENAME TO new_name

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])

ALTER TABLE name DROP [COLUMN] column_name

ALTER TABLE name CHANGE column_name new_name new_type

ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

DROP TABLE IF EXISTS employee;

Data Types

Data types in Hadoop are classified into 4 types
- Column Types
- Literals
- Null Values
- Complex Types

When the data range exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two data types: VARCHAR and CHAR

Timestamp (“YYYY-MM-DD HH:MM:SS.fffffffff”)

Dates (YYYY-MM-DD)

Decimals (10,0)

Union Types : It is a collection of heterogeneous data types.

Null Value : Missing values are referred here as null values.

Arrays : These are used just like in java.

Maps : These are similar to java maps

Structs : Here complex data is used with comments.


Hadoop/Hive

The term bigdata is used to refer large volumes of data. Using the traditional way to handle this data is very difficult. So, the Apache software has created a framework called hadoop to handle these problems.

Hadoop is the open-source framework to stored and process bigdata. It contains two modules Hadoop Distributed file system (HDFS) and MapReduce  

MapReduce: It is used to process the large volumes or structured, semi-structured and unstructured data.

HDFS : It is used to store the data.

There are certain tools like Sqoop, Pig, and Hive helps hadoop modules.

Hive is not a Relational data base. It is query engine.

Import Command is used to import data from DBMS to hive and Exports command is used to export data into specific location.



Partitioning

Hive organizes the tables into partitions, and partitions are subdivided into buckets. ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION ...