How to build and use parquet-tools to read parquet files

Goal:

How to build and use parquet-tools to read parquet files.

Solution:

1. Download and Install maven.

Follow below link:
http://maven.apache.org/download.cgi

2. Download the parquet source code

1
git clone https://github.com/Parquet/parquet-mr.git

3. Build the parquet-tools.

1
2
cd parquet-mr/parquet-tools/
mvn clean package -Plocal
The resulting jar is target/parquet-tools.jar.

Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository
It is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>

4. Show help manual

1
2
cd target
java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help

 5. Dump the schema

Take sample nation.parquet file for example.
1
2
3
4
5
6
7
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet
message root {
  required int64 N_NATIONKEY;
  required binary N_NAME (UTF8);
  required int64 N_REGIONKEY;
  required binary N_COMMENT (UTF8);
}

6. Read the data


1
2
3
4
5
6
7
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f
 
(... ...)

7. Read first n records

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f
 
N_NATIONKEY = 1
N_NAME = ARGENTINA
N_REGIONKEY = 1
N_COMMENT = al foxes promise sly
 
N_NATIONKEY = 2
N_NAME = BRAZIL
N_REGIONKEY = 1
N_COMMENT = y alongside of the p

8. Show meta info


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet
file:        file:/tmp/nation.parquet
creator:     parquet-mr
 
file schema: root
--------------------------------------------------------------------------------
N_NATIONKEY: REQUIRED INT64 R:0 D:0
N_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
N_REGIONKEY: REQUIRED INT64 R:0 D:0
N_COMMENT:   REQUIRED BINARY O:UTF8 R:0 D:0
 
row group 1: RC:25 TS:1352 OFFSET:4
--------------------------------------------------------------------------------
N_NATIONKEY:  INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED
N_NAME:       BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED
N_REGIONKEY:  INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED
N_COMMENT:    BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED

9. Dump all data

Note: Values are in column format.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta  /tmp/nation.parquet
INT64 N_NATIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:2
(...)
 
BINARY N_NAME
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:ALGERIA
value 2:  R:0 D:0 V:ARGENTINA
value 3:  R:0 D:0 V:BRAZIL
(...)
 
INT64 N_REGIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:1
(...)
 
BINARY N_COMMENT
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V: haggle. carefully f
value 2:  R:0 D:0 V:al foxes promise sly
value 3:  R:0 D:0 V:y alongside of the p
(...)

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization