How to build and use parquet-tools to read parquet files
Goal:
How to build and use parquet-tools to read parquet files.Solution:
1. Download and Install maven.
Follow below link:http://maven.apache.org/download.cgi
2. Download the parquet source code
1
| git clone https://github.com/Parquet/parquet-mr.git |
3. Build the parquet-tools.
1
2
| cd parquet-mr/parquet-tools/mvn clean package -Plocal |
Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repositoryIt is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>
4. Show help manual
1
2
| cd targetjava -jar parquet-tools-1.6.1-SNAPSHOT.jar --help |
5. Dump the schema
Take sample nation.parquet file for example.
1
2
3
4
5
6
7
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquetmessage root { required int64 N_NATIONKEY; required binary N_NAME (UTF8); required int64 N_REGIONKEY; required binary N_COMMENT (UTF8);} |
6. Read the data
1
2
3
4
5
6
7
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquetN_NATIONKEY = 0N_NAME = ALGERIAN_REGIONKEY = 0N_COMMENT = haggle. carefully f(... ...) |
7. Read first n records
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquetN_NATIONKEY = 0N_NAME = ALGERIAN_REGIONKEY = 0N_COMMENT = haggle. carefully fN_NATIONKEY = 1N_NAME = ARGENTINAN_REGIONKEY = 1N_COMMENT = al foxes promise slyN_NATIONKEY = 2N_NAME = BRAZILN_REGIONKEY = 1N_COMMENT = y alongside of the p |
8. Show meta info
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquetfile: file:/tmp/nation.parquetcreator: parquet-mrfile schema: root--------------------------------------------------------------------------------N_NATIONKEY: REQUIRED INT64 R:0 D:0N_NAME: REQUIRED BINARY O:UTF8 R:0 D:0N_REGIONKEY: REQUIRED INT64 R:0 D:0N_COMMENT: REQUIRED BINARY O:UTF8 R:0 D:0row group 1: RC:25 TS:1352 OFFSET:4--------------------------------------------------------------------------------N_NATIONKEY: INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKEDN_NAME: BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKEDN_REGIONKEY: INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKEDN_COMMENT: BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED |
9. Dump all data
Note: Values are in column format.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta /tmp/nation.parquetINT64 N_NATIONKEY--------------------------------------------------------------------------------*** row group 1 of 1, values 1 to 25 ***value 1: R:0 D:0 V:0value 2: R:0 D:0 V:1value 3: R:0 D:0 V:2(...)BINARY N_NAME--------------------------------------------------------------------------------*** row group 1 of 1, values 1 to 25 ***value 1: R:0 D:0 V:ALGERIAvalue 2: R:0 D:0 V:ARGENTINAvalue 3: R:0 D:0 V:BRAZIL(...)INT64 N_REGIONKEY--------------------------------------------------------------------------------*** row group 1 of 1, values 1 to 25 ***value 1: R:0 D:0 V:0value 2: R:0 D:0 V:1value 3: R:0 D:0 V:1(...)BINARY N_COMMENT--------------------------------------------------------------------------------*** row group 1 of 1, values 1 to 25 ***value 1: R:0 D:0 V: haggle. carefully fvalue 2: R:0 D:0 V:al foxes promise slyvalue 3: R:0 D:0 V:y alongside of the p(...) |
Commentaires
Enregistrer un commentaire