How to build and use parquet-tools to read parquet files
Goal:
How to build and use parquet-tools to read parquet files.Solution:
1. Download and Install maven.
Follow below link:http://maven.apache.org/download.cgi
2. Download the parquet source code
1
| git clone https: //github .com /Parquet/parquet-mr .git |
3. Build the parquet-tools.
1
2
| cd parquet-mr /parquet-tools/ mvn clean package -Plocal |
Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repositoryIt is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>
4. Show help manual
1
2
| cd target java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help |
5. Dump the schema
Take sample nation.parquet file for example.
1
2
3
4
5
6
7
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet message root { required int64 N_NATIONKEY; required binary N_NAME (UTF8); required int64 N_REGIONKEY; required binary N_COMMENT (UTF8); } |
6. Read the data
1
2
3
4
5
6
7
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f (... ...) |
7. Read first n records
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f N_NATIONKEY = 1 N_NAME = ARGENTINA N_REGIONKEY = 1 N_COMMENT = al foxes promise sly N_NATIONKEY = 2 N_NAME = BRAZIL N_REGIONKEY = 1 N_COMMENT = y alongside of the p |
8. Show meta info
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet file : file : /tmp/nation .parquet creator: parquet-mr file schema: root -------------------------------------------------------------------------------- N_NATIONKEY: REQUIRED INT64 R:0 D:0 N_NAME: REQUIRED BINARY O:UTF8 R:0 D:0 N_REGIONKEY: REQUIRED INT64 R:0 D:0 N_COMMENT: REQUIRED BINARY O:UTF8 R:0 D:0 row group 1: RC:25 TS:1352 OFFSET:4 -------------------------------------------------------------------------------- N_NATIONKEY: INT64 SNAPPY DO:0 FPO:4 SZ:130 /219/1 .68 VC:25 ENC:PLAIN,BIT_PACKED N_NAME: BINARY SNAPPY DO:0 FPO:134 SZ:267 /296/1 .11 VC:25 ENC:PLAIN,BIT_PACKED N_REGIONKEY: INT64 SNAPPY DO:0 FPO:401 SZ:79 /218/2 .76 VC:25 ENC:PLAIN,BIT_PACKED N_COMMENT: BINARY SNAPPY DO:0 FPO:480 SZ:468 /619/1 .32 VC:25 ENC:PLAIN,BIT_PACKED |
9. Dump all data
Note: Values are in column format.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta /tmp/nation.parquet INT64 N_NATIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:2 (...) BINARY N_NAME -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:ALGERIA value 2: R:0 D:0 V:ARGENTINA value 3: R:0 D:0 V:BRAZIL (...) INT64 N_REGIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:1 (...) BINARY N_COMMENT -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V: haggle. carefully f value 2: R:0 D:0 V:al foxes promise sly value 3: R:0 D:0 V:y alongside of the p (...) |
Commentaires
Enregistrer un commentaire