Text – RCFiles – Parquet – ORC
Based on a study conducted,
Text – RCFiles – Parquet – ORC : Original – 14% Smaller – 62% Smaller – 78% Smaller
Considerations for ORC over Parquet are:
1. ORC format allows block level index for each column. => more efficient I/O allowing Hive to skip reading entire blocks of data if it determines predicate values are not present there. Also the Cost Based Optimizer has the ability to consider column level metadata present in ORC files in order to generate the most efficient graph.
2. ACID transactions are only possible when using ORC as the file format.