Reading and Writing Parquet Files in Spark

Parquest file is one of the most common file format used in data engineering because of its performance. It is an Open Source Column oriented file format where data is stored by columns and not by rows. And Parquet file automatically uses compression algorithems to reduce the file size and faster data performance.

In this blog post we will see code examples of how to read, write, partition & manage schema for parquet files using Spark (Scala).


Reading and Writing Parquet files in Spark (Scala)

Before we start reading or writing, very first thing we need to do is to create a spark session as below:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("ParquetExample")
  .master("local[*]")
  .getOrCreate()

Reading Parquet File:

Note: Spark automatically infers schema when reading parquet file.

val df = spark.read.parquet("data/input/users.parquet")
df.show()
df.printSchema()

Reading File with Specific Column:

val df = spark.read.parquet("data/input/users.parquet")
  .select("id", "name")

Writing Parquet:

df.write
  .mode("overwrite")   // overwrite | append | ignore | errorIfExists
  .parquet("data/output/users_parquet")
  

  • Modes:
    • overwrite: If the same file at target location exists, it will delete the existing data and write new data.
    • append: It will append the new data to existing data if file at target location exists.
    • ignore: At the time of writing file, if file at the target location already exists then it will ignore and skip writing the new data.
    • errorIfExists: Will throw an error if file already exists.

Writing Parquet with Partitioning:

df.write
  .mode("overwrite")
  .partitionBy("country")
  .parquet("data/output/users_by_country")

Merge Schema while Reading:

Merge Schema is useful when your dataset changes overtime and older files dont need to be overwritten.

spark.read
  .option("mergeSchema", "true")
  .parquet("path/to/parquet")

Home » Reading and Writing Parquet Files in Spark

Leave a Comment