Looking for a Tutor Near You?

Post Learning Requirement »
x

Choose Country Code

x

Direction

x

Ask a Question

x

x
x
x
Hire a Tutor

Hadoop & Big Data

Loading...

Published in: Big Data & Hadoop
2,280 Views

Notes On Basics of Hadoop.

Priyashree B / Mumbai

35 years of teaching experience

Qualification: M.Tech (RGPV BHOPAL, MP - 2016)

Teaches: Mental Maths, All Subjects, EVS, Mathematics, School Level Computer, Science, Social Studies

Contact this Tutor
  1. What is Big Data? 1. Volume: Data in TB or PB or EB etc. 2. Velocity: So many users creating data in high speed, your application is receiving data in high speed. 3. Variety: Structured, text, image, video, audio. How does SQL based solution fail? 1. Scaling: Vertical scaling in SQL. Horizontal scaling in Hadoop. 2. Unknown format of data: Sometimes we dont know data is coming in which datatype. Hadoop works in read time schema. SQL solutions work in write time schema. 3. Unstructured data. HORIZONTAL Increase capacity by connecting multiple hardware or software entities so that they work as a single logical unit. based on partitioning of the data i.e. each node contains only part of the data easier to scale dynamically by adding more machines into the existing pool Cassandra , MongoDB What is write time schema and read time schema? What is Schema on Write? VERTICAL scale by adding more power (CPU, RAM) to an existing machine. data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine. limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit MySQL - Amazon RDS (The cloud version of MySQL). It provides an easy way to scale vertically by switching from small to bigger machines. This process often involves downtime. Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and positions are all delineated. The schema; the columns, rows, tables and relationships are all defined first for the specific purpose that database will serve.
  2. Then the data is filled into its pre-defined positions. The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as E TL (Extract Transform Load). That is why it is called "schema on write" because the data structure is already defined when the data is written and stored. What is Schema on Read? Schema on read is the revolutionary concept that you don't have to know what you're going to do with your data before you store it. Data of many types, sizes, shapes and structures can all be thrown willy nilly into the Hadoop Distributed File System, and other Hadoop data storage systems. While some metadata, data about that data, needs to be stored, so that you know what's in there, you don't yet know how it will be structured. It is entirely possible that data stored for one purpose might even be used for a completely different purpose than originally intended. The data is stored without first deciding what piece of information will be important, what should be used as a unique identifier, or what part of the data needs to be summed and aggregated to be useful. Therefore, the data is stored in its original granular form, with nothing thrown away because it is unimportant, nothing consolidated into a composite, and nothing defined as key information. In fact, no structural information is defined at all when the data is stored. When someone is ready to use that data, then, at that time, they define what pieces are essential to their purpose. They define where to find those pieces of information that matter for that purpose, and which pieces of the data set to ignore. This is why it is called "schema on read" since the schema is defined at the time the data is read and used, not at the time that it is written and stored. Advantages of Schema on Write PRECISION AND QUERY SPEED. Because you define your data structure ahead of time, when you query, you know exactly where your data is. The structure is generally optimized for the fastest possible return of data for the types of questions the data store was designed to answer. This means you write very simple SQL and get back very fast answers. In addition, before data is stored in a database, the data must go through a rigorous process to make sure it matches the structure exactly, and will serve the purpose of the database as it is meant to. The data's quality is checked and enhanced or scrubbed. Duplicates are found and resolved. The data is checked against business rules to make certain it is valid and useful for the purpose defined. This means that the answers you get from querying this data are sharply defined, precise and trustworthy, with little margin for error if your E TL processes and your validation checking have done their job. Advantages of Schema on Read Flexibility in purpose and query power. Because your data is stored in its original form, nothing is discarded, or altered for a specific purpose. This means that your query capabilities are very flexible. You can ask any question that the original data set might hold answers for, not just the type of questions a data store was originally created to answer. You have the flexibility to ask things you hadn't even thought of when the data was stored. Also, different types of data generated by different sources can be stored in the same place. This allows you to query multiple data stores and types at once. If the answer you need isn't in the data you originally thought it would be in, perhaps it could be found if you combined it with other data sources. This power of this ability cannot be underestimated. This is what makes the Hadoop data lake concept which puts all your available data sets in their original form in a single location such a potent one. Disadvantages of Schema on Write The main disadvantages of schema on write are query limitations and inflexible purpose.
  3. The dark side of the tightly controlled precision of a schema on write data store is that the data has been altered and structured specifically to serve a specific purpose. Chances are high that, if another purpose is found for that data, the data store will not suit it well. All the speed that you got from customizing the data structure to match a specific problem set will cost you if you try to use it for a different problem set. And there's no guarantee that the altered version of the data will even be useful at all for the new, unanticipated need. There's no ability to query the data in its original form, and certainly no ability to query any other data set that isn't in the structured format. Also, to fit the data into the structure, E TL processes and validation rules needed to clean, de- dupe, check and transform that data. Those processes take time to build, time to execute, and time to alter if you need to change it to suit a different purpose. There is always a time cost to imposing a schema on data. In schema on write strategies, that time cost is paid in the data loading stage. Disadvantages of Schema on Read The main disadvantages of schema on read are inaccuracies and slow query speed. Since the data is not subjected to rigorous ETL and data cleansing processes, nor does it pass through any validation, that data may be riddled with missing or invalid data, duplicates and a bunch of other problems that may lead to inaccurate or incomplete query results. In addition, since the structure must be defined when the data is queried, the SQL queries tend to be very complex. They take time to write, and even more time to execute. As I said before, there is always a time cost to imposing schema. In schema on read strategies, that time cost is paid when you query the data. Unlike databases, Hadoop does not know what kind of data will be stored as files in HDFS. It does not know whether that data has certain fields in it, or whether its structure is opaque. It is only during the processing step that some kind of a structure will be imposed on those raw files. This is known as the schema-on-read approach to data management:you just dump whatever raw data comes your way into files in HDFS and you do not thinkabout it until the processing step. The repository of unstructured data is called a data/ake. Compare it with the traditional database design, where data is neatly put into tables with fixed columns. This is known as schema-on-write: you actually have to structure your data upfront, before you can store it. This allows traditional relational databases to implement things like indexing and upfront optimizations of SQL queries. With Hadoop, none of it comes for free, but you do get a much higher scalability and fault tolerance. Advantages of Hadoop: 1. Scalability: Horizontal scaling. No upper limit of data that it can handle. 2. Cost: Based on commodity hardware. 3. Can handle unstructured data.
  4. 4. Fault tolerance. 5. High throughput : Number of records that it can process per second. 6. Best for OLAP due to big volume of data handling and high throughput. Disadvantages of Hadoop: 1. Low latency: Time to output first record is slow. 2. Poor in OLTP: Due to low latency it's not good for OLTP. Solution is NoSQL like HBase. Hadoop Core components: Hadoop is combined name for following three apache projects. 1. HDFS 2. MapReduce 3. YARN Hadoop Ecosystem: 1. HDFS 2. MapReduce 3. YARN 4. HBase 5. Sqoop 6. Hive 7. Pig 8. Flume 9. Mahout
  5. Latest additions: Kafka, Spark, Nify, Flink and what not.. Data Storage Architecture: Name Node(Master) and Data Node(Slave): - Hadoop data storage works on Master and Slave architecture. - Every hadoop cluster has a Name Node(master) and several Data Nodes(slaves). - Name Node maintains a registry of Data that is present in the cluster and Data Nodes actually host the Data. Here's what happens when client (means Human Being) tries to store a file in HDFS: Few Very Important Concepts: Block Size: - Every file that is present in hdfs, is divided into one or multiple blocks depending upon the file size. The default block size in Hadoop 2.x is 128MB. It was 64MB by default in Hadoop 0 and 1. - So, suppose if I have a file of 400MB then this file will be stored in HDFS as 3 blocks of size 384MB and 4th block of size 16MB. Hence there are total 4 blocks for one file of size 400MB.
  6. - These 4 blocks may or may not reside on same node. It can be like : node I - block 0,2 node 2 - block 1 node 3 - block 3 or any other combination is also possible. - You can find and see block information and location of blocks of each file in NN as NN contains metadata for each file. Que: Since 4th block contains only 16MB out of 128MB block size, does that extra space gets wasted? No those spaces will be used by some other file to store its data. Replication Factor: Fault Tolerance - remember the file hdfs-site.xml Rack Awareness: - Its a general practice that suggests that two replications of a file should reside in same rack and third in other rack of cluster. Data Processing Architecture: MapReduce:
  7. Job Tracker: for resource allocation and job scheduling and Task Tracker: to actually run the process. - Job Tracker and TT were part of Hadoop0 and 1, and now in Hadoop2 they are replaced by YARN. Name Node failure and HA: - A NN failure is the one and only failure that brings down the entire cluster down. Hence it is a serious situation and must be taken care of. - NN failure is handled by using two ways: 1. Secondary Name Node(SNN): It maintains a copy of NN and on failure starts up. 2. High Availability: It involves Quoram Journal Manager(QJM), ZooKeeper and StandBy NN. It does not need secondary name node. HIGH In Hadoop 2.0, NN was SINGLE POINT OF FAILURE. Entire cluster would become unavailable if NN failed» even maintenanace events(s/w & h/w upgrades) on NN lead to periods of cluster downtime. Solution» HA== it enables a cluster to run redundant NN in an ACTIVE/STANDBY Configuration. Ways to configuring NN HA- 1) WEB UI 2) Manually editing the config. Files and starting/restarting the necessary daemons. YARN: Yet Another Resource Manager/Negotiator
  8. - There are four major components in YARN: 1. Resource Manager (Like Job Tracker) 2. Node Manager (Like Tracker) 3. Container 4. Application Master - How YARN funtions: Client #1 NodeManager Container Resource Manager o eM na er App Master SPARK YARN NodeManager Container 1. Client(edge node) Requests RM for resources to run a job RM contacts any one NM and launches a service called as Application Master and assigns few other NMs to this AM to run the jobs This AM launches Container services in these other NMs --> AM then disconnects with RM and directly talks to cleint and sends logs etc. 2. If client launches new application now, then RM initiates another AM in other NM. 3. Even if RM fails, AM continues to run the job as it is in direct contact with client. 4. However, if AM needs more nodes now after RM failure, it can't get.
  9. YARN Client #1 Client #2 NodeManage App Master e Container Hadoop Commands: 1. hdfs dfs -Is /filepath or hadoop fs -Is /filepath Resource Manager Scheduler Apps Master N eManager App Master Container I'm HA via ZooKeeper NodeManager Container Container 2. FEW MORE EXPLANATION FOR DOUBTS: Actual path is : hadoop fs -Is scheme://namenode/filepath eg: for hdfs: hadoop fs -Is hdfs://namenodeaddress/filepath for unix file system: hadoop fs -Is file://filepath for s3: hadoop fs -Is s3://filepath 3. Then many commands like:
  10. hadoop fs -put localPath /hdfsPath hadoop fs -copyFromLocal /hdfsPath hadoop fs -cp or -mv or -rm -r -skipTrash etc. HA using QJM (Quorum journal manager) QJM JN-I Write editLog ti Client Block Report Heartbeat 4. More list here: IMP POINTS+ JN-2 JN-3 Read editLog Standby NN Block Report Heartbeat Top use case families that have emerged around Hadoop are: Data Discovery, Single View, Predictive Analytics, Active Archive, ETL Onboard, Data Enrichment, etc. Hadoop is an OLAP data platform and it evolved as part of the database industry. Hadoop is: - a scalable, fault tolerant, open source framework for the distributed storing (HDFS) and processing (YARN) of large sets of data on commodity hardware - a clean-room implementation of two papers from Google - optimized for schema-on-read use cases - the basis of the enterprise "data lake" architecturE Apache Bigtop is to Hadoop what Debian is to Linux
  11. • ODPi helps ASF with harmonizing industry expectations around the Hadoop ecosystem.