Apache Spark DataFrames Getting Started Guide: Creating a DataFrame

mai 30, 2017

First, create a DataFrame from a csv file

This article will show you how to create a DataFrame from a csv file.

How to do?

Creating a DataFrame from a csv file consists of the following steps:
1, in the build.sbt file inside add spark-csv support library;
2, create SparkConf object, which includes Spark to run all the environment information;
3, create SparkContext object, it is to enter Spark's core entry point, and then we can create a SQLContext object through it;
4, use the SQLContext object to load the CSV file;
5, Spark built-in does not support the analysis of CSV files, but Databricks company developed a class library can support the analysis of CSV files. So we need to load this dependency file into a dependency file (pom.xml or build.sbt)
If you are a SBT project, please add the following dependencies to the build.sbt file:

libraryDependencies += "com.databricks" % "spark-csv_2.10" % "1.3.0"

If you are a Maven project, please add the following dependencies to the pom.xml file:

<dependency>

    <groupid>com.databricks</groupid>

    <artifactid>spark-csv_2.10</artifactid>

    <version>1.3.0</version>

</dependency>

6, SparkConf holds all the information that runs the Spark program. In this example, we will run the program locally, and we intend to use 2 cores (local [2]), part of the code snippet as follows:

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")

7, the use of SparkConf initialization SparkContext object, SparkContext is to enter Spark's core entry point:

val sc = new SparkContext(conf)

One of the easiest ways to query data in Spark is to use SQL queries, so we can define a SQLContext object:

val sqlContext=new SQLContext(sc)

8, now we can load the prepared data in advance:

import com.databricks.spark.csv._

val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') 

Where the class of the students object is org.apache. spark.sql.DataFrame .

How to work

CsvFile method to receive the need to load the csv file path filePath, if the need to load the csv file with header information, we can useHeader set to true, so that the first line of information can be used as a column name to read; delimiter specified csv file The delimiter between columns.
In addition to using the csvFile function, we can also use sqlContext inside the load to load the csv file:

val options = Map("header" -> "true", "path" -> "E:\\StudentData.csv")

val newStudents = sqlContext.read.options(options).format("com.databricks.spark.csv").load()

appendix

In order to facilitate everyone to test, I provided part of the StudentData.csv file data set:

id|studentName|phone|email

Burke|1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk

Kamal|1-668-571-5046|pede.Suspendisse@interdumenim.edu

Olga|1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu

Belle|1-246-894-6340|vitae.aliquet.nec@neque.co.uk

Trevor|1-300-527-4967|dapibus.id@acturpisegestas.net

Laurel|1-691-379-9921|adipiscing@consectetueripsum.edu

Sara|1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu

Kaseem|1-881-586-2689|cursus.et.magna@euismod.org

Lev|1-916-367-5608|Vivamus.nisi@ipsumdolor.com

Maya|1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu

Emi|1-467-270-1337|est@nunc.com

Caleb|1-683-212-0896|Suspendisse@Quisque.edu

Florence|1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca

Anika|1-856-828-7883|euismod@ligulaelit.co.uk

Tarik|1-398-171-2268|turpis@felisorci.com

Amena|1-878-250-3129|lorem.luctus.ut@scelerisque.com

Blossom|1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk

Guy|1-869-521-3230|senectus.et.netus@lectusrutrum.com

Malachi|1-608-637-2772|Proin.mi.Aliquam@estarcu.net

Edward|1-711-710-6552|lectus@aliquetlibero.co.uk

Rechercher dans ce blog

Big data