Apache Spark DataFrames Getting Started Guide: Creating a DataFrame

First, create a DataFrame from a csv file

This article will show you how to create a DataFrame from a csv file.

How to do?

Creating a DataFrame from a csv file consists of the following steps:
1, in the build.sbt file inside add spark-csv support library;
2, create SparkConf object, which includes Spark to run all the environment information;
3, create SparkContext object, it is to enter Spark's core entry point, and then we can create a SQLContext object through it;
4, use the SQLContext object to load the CSV file;
5, Spark built-in does not support the analysis of CSV files, but Databricks company developed a class library can support the analysis of CSV files. So we need to load this dependency file into a dependency file (pom.xml or build.sbt)
If you are a SBT project, please add the following dependencies to the build.sbt file:
libraryDependencies += "com.databricks" % "spark-csv_2.10" % "1.3.0"
If you are a Maven project, please add the following dependencies to the pom.xml file:
<dependency>
    <groupid>com.databricks</groupid>
    <artifactid>spark-csv_2.10</artifactid>
    <version>1.3.0</version>
</dependency>
6, SparkConf holds all the information that runs the Spark program. In this example, we will run the program locally, and we intend to use 2 cores (local [2]), part of the code snippet as follows:
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")
7, the use of SparkConf initialization SparkContext object, SparkContext is to enter Spark's core entry point:
val sc = new SparkContext(conf)
One of the easiest ways to query data in Spark is to use SQL queries, so we can define a SQLContext object:
val sqlContext=new SQLContext(sc)
8, now we can load the prepared data in advance:
import com.databricks.spark.csv._
val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')
Where the class of the students object is org.apache. spark.sql.DataFrame .

How to work

CsvFile method to receive the need to load the csv file path filePath, if the need to load the csv file with header information, we can useHeader set to true, so that the first line of information can be used as a column name to read; delimiter specified csv file The delimiter between columns.
In addition to using the csvFile function, we can also use sqlContext inside the load to load the csv file:
val options = Map("header" -> "true", "path" -> "E:\\StudentData.csv")
val newStudents = sqlContext.read.options(options).format("com.databricks.spark.csv").load()

appendix

In order to facilitate everyone to test, I provided part of the StudentData.csv file data set:
id|studentName|phone|email
1|Burke|1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk
2|Kamal|1-668-571-5046|pede.Suspendisse@interdumenim.edu
3|Olga|1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu
4|Belle|1-246-894-6340|vitae.aliquet.nec@neque.co.uk
5|Trevor|1-300-527-4967|dapibus.id@acturpisegestas.net
6|Laurel|1-691-379-9921|adipiscing@consectetueripsum.edu
7|Sara|1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu
8|Kaseem|1-881-586-2689|cursus.et.magna@euismod.org
9|Lev|1-916-367-5608|Vivamus.nisi@ipsumdolor.com
10|Maya|1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu
11|Emi|1-467-270-1337|est@nunc.com
12|Caleb|1-683-212-0896|Suspendisse@Quisque.edu
13|Florence|1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca
14|Anika|1-856-828-7883|euismod@ligulaelit.co.uk
15|Tarik|1-398-171-2268|turpis@felisorci.com
16|Amena|1-878-250-3129|lorem.luctus.ut@scelerisque.com
17|Blossom|1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk
18|Guy|1-869-521-3230|senectus.et.netus@lectusrutrum.com
19|Malachi|1-608-637-2772|Proin.mi.Aliquam@estarcu.net
20|Edward|1-711-710-6552|lectus@aliquetlibero.co.uk

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization