Cloudera CCA Spark and Hadoop Developer - CCA175 FREE EXAM DUMPS QUESTIONS & ANSWERS

CORRECT TEXT
Problem Scenario 79 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of products table : (product_id | product categoryid | product_name | product_description | product_prtce | product_image )
Please accomplish following activities.
1 . Copy "retaildb.products" table to hdfs in a directory p93_products
2 . Filter out all the empty prices
3 . Sort all the products based on price in both ascending as well as descending order.
4 . Sort all the products based on price as well as product_id in descending order.
5 . Use the below functions to do data ordering or ranking and fetch top 10 elements top() takeOrdered() sortByKey()
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=products -target-dir=p93_products -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following). productsRDD = sc.textFile("p93_products")
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
nonemptyjines = productsRDD.filter(lambda x: len(x.split(",")[4]) > 0)
Step 5 : Now sort data based on product_price in order.
sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(",")[4]),line.split(",")[2]
)).sortByKey()
for line in sortedPriceProducts.collect(): print(line)
Step 6 : Now sort data based on product_price in descending order.
sortedPriceProducts=nonempty_lines.map(lambda line:
(float(line.split(",")[4]),line.split(",")[2])).sortByKey(False)
for line in sortedPriceProducts.collect(): print(line)
Step 7 : Get highest price products name.
sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(",")[4]),line- split(,,,,,)[2]))-sortByKey(False).take(1) print(sortedPriceProducts)
Step 8 : Now sort data based on product_price as well as product_id in descending order.
#Dont forget to cast string #Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line
print(sortedPriceProducts)
Step 9 : Now sort data based on product_price as well as product_id in descending order, using top() function.
#Dont forget to cast string
#Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^
print(sortedPriceProducts)
Step 10 : Now sort data based on product_price as ascending and product_id in ascending order, using takeOrdered{) function.
#Dont forget to cast string
#Tuple as key ((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line:
((float(line.split(","}[4]},int(line.split(","}[0]}},line.split(","}[2]}}.takeOrdered(10, lambda tuple :
(tuple[0][0],tuple[0][1]))
Step 11 : Now sort data based on product_price as descending and product_id in ascending order, using takeOrdered() function.
# Dont forget to cast string
# Tuple as key ((price,id},name)
# Using minus(-) parameter can help you to make descending ordering , only for numeric value.
sortedPrlceProducts=nonemptylines.map(lambda line:
((float(line.split(","}[4]},int(line.split(","}[0]}},line.split(","}[2]}}.takeOrdered(10, lambda tuple :
(-tuple[0][0],tuple[0][1]}}
CORRECT TEXT
Problem Scenario 20 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Write a Sqoop Job which will import "retaildb.categories" table to hdfs, in a directory name "categories_targetJob".
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Connecting to existing MySQL Database mysql -user=retail_dba -- password=cloudera retail_db
Step 2 : Show all the available tables show tables;
Step 3 : Below is the command to create Sqoop Job (Please note that - import space is mandatory) sqoop job -create sqoopjob \ -- import \
-connect "jdbc:mysql://quickstart:3306/retail_db" \
-username=retail_dba \
-password=cloudera \
-table categories \
-target-dir categories_targetJob \
-fields-terminated-by '|' \
-lines-terminated-by '\n'
Step 4 : List all the Sqoop Jobs sqoop job --list
Step 5 : Show details of the Sqoop Job sqoop job --show sqoopjob
Step 6 : Execute the sqoopjob sqoopjob --exec sqoopjob
Step 7 : Check the output of import job
hdfs dfs -Is categories_target_job
hdfs dfs -cat categories_target_job/part*
CORRECT TEXT
Problem Scenario 44 : You have been given 4 files , with the content as given below:
spark11/file1.txt
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework spark11/file2.txt
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File
System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.
spark11/file3.txt
his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking spark11/file4.txt
Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use
Storm to transform unstructured data as it flows into a system into a desired format
(spark11Afile1.txt)
(spark11/file2.txt)
(spark11/file3.txt)
(sparkl 1/file4.txt)
Write a Spark program, which will give you the highest occurring words in each file. With their file name and highest occurring words.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create all 4 file first using Hue in hdfs.
Step 2 : Load all file as an RDD
val file1 = sc.textFile("sparkl1/filel.txt")
val file2 = sc.textFile("spark11/file2.txt")
val file3 = sc.textFile("spark11/file3.txt")
val file4 = sc.textFile("spark11/file4.txt")
Step 3 : Now do the word count for each file and sort in reverse order of count.
val contentl = filel.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content.2 = file2.flatMap( line => line.splitf ")).map(word => (word,1)).reduceByKey(_
+ _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content3 = file3.flatMap( line > line.split)" ")).map(word => (word,1)).reduceByKey(_
+ _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content4 = file4.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_ ).map(item => item.swap).sortByKey(false).map(e=>e.swap)
Step 4 : Split the data and create RDD of all Employee objects.
val filelword = sc.makeRDD(Array(file1.name+"->"+content1(0)._1+"-"+content1(0)._2)) val file2word = sc.makeRDD(Array(file2.name+"->"+content2(0)._1+"-"+content2(0)._2)) val file3word = sc.makeRDD(Array(file3.name+"->"+content3(0)._1+"-"+content3(0)._2)) val file4word = sc.makeRDD(Array(file4.name+M->"+content4(0)._1+"-"+content4(0)._2))
Step 5: Union all the RDDS
val unionRDDs = filelword.union(file2word).union(file3word).union(file4word)
Step 6 : Save the results in a text file as below.
unionRDDs.repartition(1).saveAsTextFile("spark11/union.txt")
CORRECT TEXT
Problem Scenario 54 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle")) val b = a.map(x => (x.length, x)) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String)] = Array((4,lion), (7,panther), (3,dogcat), (5,tigereagle))
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.foidByKey("")(_ + J.collect
foldByKey [Pair]
Very similar to fold, but performs the folding separately for each key of the RDD. This function is only available if the RDD consists of two-component tuples
Listing Variants
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V}]
def foldByKey(zeroValue: V, numPartitions: lnt)(func: (V, V) => V): RDD[(K, V)] def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V}]
CORRECT TEXT
Problem Scenario 33 : You have given a files as below.
spark5/EmployeeName.csv (id,name)
spark5/EmployeeSalary.csv (id,salary)
Data is given below:
EmployeeName.csv
E01,Lokesh
E02,Bhupesh
E03,Amit
E04,Ratan
E05,Dinesh
E06,Pavan
E07,Tejas
E08,Sheela
E09,Kumar
E10,Venkat
EmployeeSalary.csv
E01,50000
E02,50000
E03,45000
E04,45000
E05,50000
E06,45000
E07,50000
E08,10000
E09,10000
E10,10000
Now write a Spark code in scala which will load these two tiles from hdfs and join the same, and produce the (name.salary) values.
And save the data in multiple tile group by salary (Means each file will have name of employees with same salary). Make sure file name include salary as well.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load EmployeeName.csv file from hdfs and create PairRDDs
val name = sc.textFile("spark5/EmployeeName.csv")
val namePairRDD = name.map(x=> (x.split(",")(0),x.split('V')(1)))
Step 3 : Load EmployeeSalary.csv file from hdfs and create PairRDDs
val salary = sc.textFile("spark5/EmployeeSalary.csv")
val salaryPairRDD = salary.map(x=> (x.split(",")(0),x.split(",")(1)))
Step 4 : Join all pairRDDS
val joined = namePairRDD.join(salaryPairRDD}
Step 5 : Remove key from RDD and Salary as a Key. val keyRemoved = joined.values
Step 6 : Now swap filtered RDD.
val swapped = keyRemoved.map(item => item.swap)
Step 7 : Now groupBy keys (It will generate key and value array) val grpByKey = swapped.groupByKey().collect()
Step 8 : Now create RDD for values collection
val rddByKey = grpByKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}
Step 9 : Save the output as a Text file.
rddByKey.foreach{ case (k,rdd) => rdd.saveAsTextFile("spark5/Employee"+k)}
CORRECT TEXT
Problem Scenario 74 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderjd , order_date , ordercustomerid, order status}
Columns of orderjtems table : (order_item_td , order_item_order_id ,
order_item_product_id,
order_item_quantity,order_item_subtotal,order_item_product_price)
Please accomplish following activities.
1. Copy "retaildb.orders" and "retaildb.orderjtems" table to hdfs in respective directory p89_orders and p89_order_items .
2. Join these data using orderjd in Spark and Python
3. Now fetch selected columns from joined data Orderld, Order date and amount collected on this order.
4. Calculate total order placed for each date, and produced the output sorted by date.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution:
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders --target-dir=p89_orders - -m1 sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=order_items ~target-dir=p89_ order items -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command, hadoopfs
-cat p89_orders/part-m-00000 hadoop fs -cat p89_order_items/part-m-00000
Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p89_orders") orderitems = sc.textFile("p89_order_items")
Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
#First value is orderjd
ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line))
#Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
tor line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.
revenuePerOrderPerDay = joinedData.map(lambda row: (row[0]( row[1][1].split(",")[1]( f!oat(row[1][0].split('\M}[4]}}}
#printthe result
for line in revenuePerOrderPerDay.collect():
print(line)
Step 7 : Select distinct order ids for each date.
#distinct(date,order_id)
distinctOrdersDate = joinedData.map(lambda row: row[1][1].split('\")[1] + "," + str(row[0])).distinct() for line in distinctOrdersDate.collect(): print(line)
Step 8 : Similar to word count, generate (date, 1) record for each row. newLineTuple = distinctOrdersDate.map(lambda line: (line.split(",")[0], 1))
Step 9 : Do the count for each key(date), to get total order per date. totalOrdersPerDate = newLineTuple.reduceByKey(lambda a, b: a + b}
#print results
for line in totalOrdersPerDate.collect():
print(line)
step 10 : Sort the results by date sortedData=totalOrdersPerDate.sortByKey().collect()
#print results
for line in sortedData:
print(line)
CORRECT TEXT
Problem Scenario 27 : You need to implement near real time solutions for collecting information when submitted in file with below information.
Data
echo "IBM,100,20160104" >> /tmp/spooldir/bb/.bb.txt
echo "IBM,103,20160105" >> /tmp/spooldir/bb/.bb.txt
mv /tmp/spooldir/bb/.bb.txt /tmp/spooldir/bb/bb.txt
After few mins
echo "IBM,100.2,20160104" >> /tmp/spooldir/dr/.dr.txt
echo "IBM,103.1,20160105" >> /tmp/spooldir/dr/.dr.txt
mv /tmp/spooldir/dr/.dr.txt /tmp/spooldir/dr/dr.txt
Requirements:
You have been given below directory location (if not available than create it) /tmp/spooldir .
You have a finacial subscription for getting stock prices from BloomBerg as well as
Reuters and using ftp you download every hour new files from their respective ftp site in directories /tmp/spooldir/bb and /tmp/spooldir/dr respectively.
As soon as file committed in this directory that needs to be available in hdfs in
/tmp/flume/finance location in a single directory.
Write a flume configuration file named flume7.conf and use it to load data in hdfs with following additional properties .
1 . Spool /tmp/spooldir/bb and /tmp/spooldir/dr
2 . File prefix in hdfs sholuld be events
3 . File suffix should be .log
4 . If file is not commited and in use than it should have _ as prefix.
5 . Data should be written as text to hdfs
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create directory mkdir /tmp/spooldir/bb mkdir /tmp/spooldir/dr
Step 2 : Create flume configuration file, with below configuration for
agent1.sources = source1 source2
agent1 .sinks = sink1
agent1.channels = channel1
agent1 .sources.source1.channels = channel1
agentl .sources.source2.channels = channell agent1 .sinks.sinkl.channel = channell agent1 .sources.source1.type = spooldir agent1 .sources.sourcel.spoolDir = /tmp/spooldir/bb agent1 .sources.source2.type = spooldir
agent1 .sources.source2.spoolDir = /tmp/spooldir/dr
agent1 .sinks.sink1.type = hdfs
agent1 .sinks.sink1.hdfs.path = /tmp/flume/finance
agent1-sinks.sink1.hdfs.filePrefix = events
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1 .sinks.sink1.hdfs.inUsePrefix = _
agent1 .sinks.sink1.hdfs.fileType = Data Stream
agent1.channels.channel1.type = file
Step 4 : Run below command which will use this configuration file and append data in hdfs.
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file
/home/cloudera/fIumeconf/fIume7.conf --name agent1
Step 5 : Open another terminal and create a file in /tmp/spooldir/
echo "IBM,100,20160104" > /tmp/spooldir/bb/.bb.txt
echo "IBM,103,20160105" > /tmp/spooldir/bb/.bb.txt mv /tmp/spooldir/bb/.bb.txt
/tmp/spooldir/bb/bb.txt
After few mins
echo "IBM,100.2,20160104" > /tmp/spooldir/dr/.dr.txt
echo "IBM,103.1,20160105" >/tmp/spooldir/dr/.dr.txt mv /tmp/spooldir/dr/.dr.txt
/tmp/spooldir/dr/dr.txt
CORRECT TEXT
Problem Scenario 81 : You have been given MySQL DB with following details. You have been given following product.csv file product.csv productID,productCode,name,quantity,price
1001,PEN,Pen Red,5000,1.23
1002,PEN,Pen Blue,8000,1.25
1003,PEN,Pen Black,2000,1.25
1004,PEC,Pencil 2B,10000,0.48
1005,PEC,Pencil 2H,8000,0.49
1006,PEC,Pencil HB,0,9999.99
Now accomplish following activities.
1 . Create a Hive ORC table using SparkSql
2 . Load this data in Hive table.
3 . Create a Hive parquet table using SparkSQL and load data in it.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create this tile in HDFS under following directory (Without header}
/user/cloudera/he/exam/task1/productcsv
Step 2 : Now using Spark-shell read the file as RDD
// load the data into a new RDD
val products = sc.textFile("/user/cloudera/he/exam/task1/product.csv")
// Return the first element in this RDD
prod u cts.fi rst()
Step 3 : Now define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Float)
Step 4 : create an RDD of Product objects
val prdRDD = products.map(_.split(",")).map(p =>
Product(p(0).tolnt,p(1),p(2),p(3}.tolnt,p(4}.toFloat))
prdRDD.first()
prdRDD.count()
Step 5 : Now create data frame val prdDF = prdRDD.toDF()
Step 6 : Now store data in hive warehouse directory. (However, table will not be created } import org.apache.spark.sql.SaveMode prdDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("product_orc_table") step 7: Now create table using data stored in warehouse directory. With the help of hive.
hive
show tables
CREATE EXTERNAL TABLE products (productid int,code string,name string .quantity int, price float}
STORED AS ore
LOCATION 7user/hive/warehouse/product_orc_table';
Step 8 : Now create a parquet table
import org.apache.spark.sql.SaveMode
prdDF.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("product_parquet_ table")
Step 9 : Now create table using this
CREATE EXTERNAL TABLE products_parquet (productid int,code string,name string
.quantity int, price float}
STORED AS parquet
LOCATION 7user/hive/warehouse/product_parquet_table';
Step 10 : Check data has been loaded or not.
Select * from products;
Select * from products_parquet;
CORRECT TEXT
Problem Scenario 52 : You have been given below code snippet.
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
Operation_xyz
Write a correct code snippet for Operation_xyz which will produce below output.
scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->
1)
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.countByValue
countByValue
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]
CORRECT TEXT
Problem Scenario 88 : You have been given below three files
product.csv (Create this file in hdfs)
productID,productCode,name,quantity,price,supplierid
1001,PEN,Pen Red,5000,1.23,501
1002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501
1004,PEC,Pencil 2B,10000,0.48,502
1005,PEC,Pencil 2H,8000,0.49,502
1006,PEC,Pencil HB,0,9999.99,502
2001,PEC,Pencil 3B,500,0.52,501
2002,PEC,Pencil 4B,200,0.62,501
2003,PEC,Pencil 5B,100,0.73,501
2004,PEC,Pencil 6B,500,0.47,502
supplier.csv
supplierid,name,phone
501,ABC Traders,88881111
502,XYZ Company,88882222
503,QQ Corp,88883333
products_suppliers.csv
productID,supplierID
2001,501
2002,501
2003,501
2004,502
2001,503
Now accomplish all the queries given in solution.
1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
2. Find all the supllier name, who are supplying 'Pencil 3B'
3. Find all the products , which are supplied by ABC Traders.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
val results = sqlContext.sql(......SELECT products.name AS Product Name', price, suppliers.name AS Supplier Name'
FROM products_suppliers
JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON products_suppliers.supplierlD = suppliers.supplierlD null t results.show()
Step 2 : Find all the supllier name, who are supplying 'Pencil 3B'
val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier
Name'
FROM products_suppliers AS ps
JOIN products AS p ON ps.productID = p.productID
JOIN suppliers AS s ON ps.supplierlD = s.supplierlD
WHERE p.name = 'Pencil 3B"",M )
results.show()
Step 3 : Find all the products , which are supplied by ABC Traders.
val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier
Name'
FROM products AS p, products_suppliers AS ps, suppliers AS s WHERE p.productID = ps.productID AND ps.supplierlD = s.supplierlD
AND s.name = 'ABC Traders".....)
results. show()
0
0
0
10