How to build and use parquet-tools to read parquet files

mai 24, 2017

Goal:

How to build and use parquet-tools to read parquet files.

Solution:

1. Download and Install maven.

Follow below link:
http://maven.apache.org/download.cgi

2. Download the parquet source code

git clone https://github.com/Parquet/parquet-mr.git

3. Build the parquet-tools.

cd parquet-mr/parquet-tools/

mvn clean package -Plocal

The resulting jar is target/parquet-tools.jar.

Note, you may meet error such as below:

Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository

It is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>

4. Show help manual

cd target

java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help

5. Dump the schema

Take sample nation.parquet file for example.

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet

message root {

  required int64 N_NATIONKEY;

  required binary N_NAME (UTF8);

  required int64 N_REGIONKEY;

  required binary N_COMMENT (UTF8);

}

6. Read the data

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet

N_NATIONKEY = 0

N_NAME = ALGERIA

N_REGIONKEY = 0

N_COMMENT =  haggle. carefully f

(... ...)

7. Read first n records

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet

N_NATIONKEY = 0

N_NAME = ALGERIA

N_REGIONKEY = 0

N_COMMENT =  haggle. carefully f

N_NATIONKEY = 1

N_NAME = ARGENTINA

N_REGIONKEY = 1

N_COMMENT = al foxes promise sly

N_NATIONKEY = 2

N_NAME = BRAZIL

N_REGIONKEY = 1

N_COMMENT = y alongside of the p

8. Show meta info

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet

file:        file:/tmp/nation.parquet

creator:     parquet-mr

file schema: root

--------------------------------------------------------------------------------

N_NATIONKEY: REQUIRED INT64 R:0 D:0

N_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0

N_REGIONKEY: REQUIRED INT64 R:0 D:0

N_COMMENT:   REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:25 TS:1352 OFFSET:4

--------------------------------------------------------------------------------

N_NATIONKEY:  INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED

N_NAME:       BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED

N_REGIONKEY:  INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED

N_COMMENT:    BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED

9. Dump all data

Note: Values are in column format.

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta  /tmp/nation.parquet

INT64 N_NATIONKEY

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1:  R:0 D:0 V:0

value 2:  R:0 D:0 V:1

value 3:  R:0 D:0 V:2

(...)

BINARY N_NAME

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1:  R:0 D:0 V:ALGERIA

value 2:  R:0 D:0 V:ARGENTINA

value 3:  R:0 D:0 V:BRAZIL

(...)

INT64 N_REGIONKEY

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1:  R:0 D:0 V:0

value 2:  R:0 D:0 V:1

value 3:  R:0 D:0 V:1

(...)

BINARY N_COMMENT

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1:  R:0 D:0 V: haggle. carefully f

value 2:  R:0 D:0 V:al foxes promise sly

value 3:  R:0 D:0 V:y alongside of the p

(...)

Rechercher dans ce blog

Big data