Read ORC data from Spark

Officially ORC formatted data is not supported and well implemented in Spark 2.4.x. So if you want to access ORC data from Spark, use below workaround

Accessing ORC data

>>> spark.sql("select score_date,count(*) from schema.table").show()

+----------+--------+                                                           

|score_date|count(1)|

+----------+--------+

+----------+--------+


So this query does not print any output, as it was accessing ORC data


Workaround


>>> spark.sql("SET spark.sql.hive.convertMetastoreOrc=false")

DataFrame[key: string, value: string]


>>> spark.sql("select score_date,count(*) from schema.table").show()

+----------+--------+                                                           

|score_date|count(1)|

+----------+--------+

|x         |  y|

|x         |  y|




You can also use Hive warehouse connector, you can find more details below

https://docs.cloudera.com/runtime/7.2.0/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html


Latest
Previous
Next Post »