Apache Spark DataFrames Getting Started Guide: Working with DataFrame

Second, the operation of DataFrame

In the previous article, we introduced how to create a DataFrame. This article describes how to manipulate the data in the DataFrame and print out the data in the DataFrame

Print the pattern inside the DataFrame

After the creation of DataFrame, we generally see the data inside the model, we can printSchema function to view. It prints the name and type of the column:
students.printSchema
root
 |-- id: string (nullable = true)
 |-- studentName: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)
If you are using the load method see DataFrame, students.printSchema the output is as follows:
root
 |-- id|studentName|phone|email: string (nullable = true)

Data on the DataFrame is sampled

After printing the pattern, the second thing we have to do is look at the data loaded into the DataFrame. There are many ways to sample data from a newly created DataFrame. Let's introduce it.
The simplest is to use the show method, show method has four versions:
(1), the first number we need to specify the number of rows def show(numRows: Int);
(2), the second does not require us to specify any parameters, in which case, show function will be loaded by default 20 lines of data def show();
(3), the third need to specify a boolean value, this value shows whether the need for more than 20 def show(truncate: Boolean);
(4), the last need to specify the sampling of the line and whether the need for def show(numRows: Int, truncate: Boolean) of the column def show(numRows: Int, truncate: Boolean) . In fact, the first three functions are called to achieve this function.
The Show function differs from other functions in that it not only displays the lines that need to be printed, but also prints out the header information and directs it directly in the default output stream. To see how to use it:
students.show()  //打印出20行
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
1|      Burke|1-300-746-8446|ullamcorper.velit...|
2|      Kamal|1-668-571-5046|pede.Suspendisse@...|
3|       Olga|1-956-311-1686|Aenean.eget.metus...|
4|      Belle|1-246-894-6340|vitae.aliquet.nec...|
5|     Trevor|1-300-527-4967|dapibus.id@acturp...|
6|     Laurel|1-691-379-9921|adipiscing@consec...|
7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|
8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|
9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|       Maya|1-271-683-2698|accumsan.convalli...|
| 11|        Emi|1-467-270-1337|        est@nunc.com|
| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|
| 16|      Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 17|    Blossom|1-154-406-9596|Nunc.commodo.auct...|
| 18|        Guy|1-869-521-3230|senectus.et.netus...|
| 19|    Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 20|     Edward|1-711-710-6552|lectus@aliquetlib...|
+---+-----------+--------------+--------------------+
only showing top 20 rows
students.show(15)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
1|      Burke|1-300-746-8446|ullamcorper.velit...|
2|      Kamal|1-668-571-5046|pede.Suspendisse@...|
3|       Olga|1-956-311-1686|Aenean.eget.metus...|
4|      Belle|1-246-894-6340|vitae.aliquet.nec...|
5|     Trevor|1-300-527-4967|dapibus.id@acturp...|
6|     Laurel|1-691-379-9921|adipiscing@consec...|
7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|
8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|
9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|       Maya|1-271-683-2698|accumsan.convalli...|
| 11|        Emi|1-467-270-1337|        est@nunc.com|
| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|
+---+-----------+--------------+--------------------+
only showing top 15 rows
 
students.show(true)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
1|      Burke|1-300-746-8446|ullamcorper.velit...|
2|      Kamal|1-668-571-5046|pede.Suspendisse@...|
3|       Olga|1-956-311-1686|Aenean.eget.metus...|
4|      Belle|1-246-894-6340|vitae.aliquet.nec...|
5|     Trevor|1-300-527-4967|dapibus.id@acturp...|
6|     Laurel|1-691-379-9921|adipiscing@consec...|
7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|
8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|
9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|       Maya|1-271-683-2698|accumsan.convalli...|
| 11|        Emi|1-467-270-1337|        est@nunc.com|
| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|
| 16|      Amena|1-878-250-3129|lorem.luctus.ut@s...|
| 17|    Blossom|1-154-406-9596|Nunc.commodo.auct...|
| 18|        Guy|1-869-521-3230|senectus.et.netus...|
| 19|    Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 20|     Edward|1-711-710-6552|lectus@aliquetlib...|
+---+-----------+--------------+--------------------+
only showing top 20 rows
 
students.show(false)
+---+-----------+--------------+-----------------------------------------+
|id |studentName|phone         |email                                    |
+---+-----------+--------------+-----------------------------------------+
|1  |Burke      |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk|
|2  |Kamal      |1-668-571-5046|pede.Suspendisse@interdumenim.edu        |
|3  |Olga       |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu   |
|4  |Belle      |1-246-894-6340|vitae.aliquet.nec@neque.co.uk            |
|5  |Trevor     |1-300-527-4967|dapibus.id@acturpisegestas.net           |
|6  |Laurel     |1-691-379-9921|adipiscing@consectetueripsum.edu         |
|7  |Sara       |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu        |
|8  |Kaseem     |1-881-586-2689|cursus.et.magna@euismod.org              |
|9  |Lev        |1-916-367-5608|Vivamus.nisi@ipsumdolor.com              |
|10 |Maya       |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu |
|11 |Emi        |1-467-270-1337|est@nunc.com                             |
|12 |Caleb      |1-683-212-0896|Suspendisse@Quisque.edu                  |
|13 |Florence   |1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca   |
|14 |Anika      |1-856-828-7883|euismod@ligulaelit.co.uk                 |
|15 |Tarik      |1-398-171-2268|turpis@felisorci.com                     |
|16 |Amena      |1-878-250-3129|lorem.luctus.ut@scelerisque.com          |
|17 |Blossom    |1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk        |
|18 |Guy        |1-869-521-3230|senectus.et.netus@lectusrutrum.com       |
|19 |Malachi    |1-608-637-2772|Proin.mi.Aliquam@estarcu.net             |
|20 |Edward     |1-711-710-6552|lectus@aliquetlibero.co.uk               |
+---+-----------+--------------+-----------------------------------------+
only showing top 20 rows
 
students.show(10,false)
 
+---+-----------+--------------+-----------------------------------------+
|id |studentName|phone         |email                                    |
+---+-----------+--------------+-----------------------------------------+
|1  |Burke      |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk|
|2  |Kamal      |1-668-571-5046|pede.Suspendisse@interdumenim.edu        |
|3  |Olga       |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu   |
|4  |Belle      |1-246-894-6340|vitae.aliquet.nec@neque.co.uk            |
|5  |Trevor     |1-300-527-4967|dapibus.id@acturpisegestas.net           |
|6  |Laurel     |1-691-379-9921|adipiscing@consectetueripsum.edu         |
|7  |Sara       |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu        |
|8  |Kaseem     |1-881-586-2689|cursus.et.magna@euismod.org              |
|9  |Lev        |1-916-367-5608|Vivamus.nisi@ipsumdolor.com              |
|10 |Maya       |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu |
+---+-----------+--------------+-----------------------------------------+
only showing top 10 rows
We can also use the head (n: Int) method to sample the data. This function also requires a parameter to indicate the number of rows that need to be sampled, and the function returns a Row array. We need to traverse the print. Of course, we can also use the head () function to print directly, this function simply returns the line of data, the type is Row.
students.head(5).foreach(println)
[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]
[2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu]
[3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu]
[4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk]
[5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net]
println(students.head())
[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]
Except for the show, head function. We can also use the first and take functions, which call head () and head (n)
println(students.first())
[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]
students.take(5).foreach(println)
[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]
[2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu]
[3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu]
[4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk]
[5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net]

Query the columns inside the DataFrame

As you can see, all the columns in the DataFrame are named. The Select function can help us select the required columns from the DataFrame and return a new DataFrame, which I will introduce below.
1, select only one column. If we only want to select this email from the DataFrame, the DataFrame is immutable, so this will return a new DataFrame:
val emailDataFrame: DataFrame = students.select("email")
Now we have a new DataFrame named emailDataFrame, and it only contains email, let's use show to see if this is the case:
emailDataFrame.show(3)
+--------------------+
|               email|
+--------------------+
|ullamcorper.velit...|
|pede.Suspendisse@...|
|Aenean.eget.metus...|
+--------------------+
only showing top 3 rows
2, select multiple columns. In fact, the select function supports multiple columns.
val studentEmailDF = students.select("studentName", "email")
studentEmailDF.show(5)
+-----------+--------------------+
|studentName|               email|
+-----------+--------------------+
|      Burke|ullamcorper.velit...|
|      Kamal|pede.Suspendisse@...|
|       Olga|Aenean.eget.metus...|
|      Belle|vitae.aliquet.nec...|
|     Trevor|dapibus.id@acturp...|
+-----------+--------------------+
only showing top 5 rows
Need to be the main, we select the column, the need to ensure that the select column is valid, in other words, it is printSchema to ensure that the select column is printSchema print out. If the column name is invalid, the org.apache.spark.sql.AnalysisException will appear as follows:
val studentEmailDF = students.select("studentName", "iteblog")
studentEmailDF.show(5)
 
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'iteblog' given input columns id, studentName, phone, email;

Filter the data according to the criteria

Now that we know how to select the required columns in the DataFrame, let's look at how to filter the data in the DataFrame according to the criteria. Corresponding to Row-based data, we can view the DataFrame as a regular collection of Scala, and then we need to filter the relevant conditions, in order to show clearly, I did not use the show function behind the show filter results.
students.filter("id > 5").show(7)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
6|     Laurel|1-691-379-9921|adipiscing@consec...|
7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|
8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|
9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|       Maya|1-271-683-2698|accumsan.convalli...|
| 11|        Emi|1-467-270-1337|        est@nunc.com|
| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|
| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|
| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|
| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|
+---+-----------+--------------+--------------------+
only showing top 10 rows
 
students.filter("studentName =''").show(7)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 21|           |1-598-439-7549|consectetuer.adip...|
| 32|           |1-184-895-9602|accumsan.laoreet@...|
| 45|           |1-245-752-0481|Suspendisse.eleif...|
| 83|           |1-858-810-2204|sociis.natoque@eu...|
| 94|           |1-443-410-7878|Praesent.eu.nulla...|
+---+-----------+--------------+--------------------+
Look at the first filter statement, although the id is parsed into a String, but the program is still correct to make a comparison. We can also filter multiple conditions:
students.filter("studentName ='' OR studentName = 'NULL'").show(7)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 21|           |1-598-439-7549|consectetuer.adip...|
| 32|           |1-184-895-9602|accumsan.laoreet@...|
| 33|       NULL|1-105-503-0141|Donec@Inmipede.co.uk|
| 45|           |1-245-752-0481|Suspendisse.eleif...|
| 83|           |1-858-810-2204|sociis.natoque@eu...|
| 94|           |1-443-410-7878|Praesent.eu.nulla...|
+---+-----------+--------------+--------------------+
We can also use the syntax of class SQL to filter the data:
students.filter("SUBSTR(studentName,0,1) ='M'").show(7)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 10|       Maya|1-271-683-2698|accumsan.convalli...|
| 19|    Malachi|1-608-637-2772|Proin.mi.Aliquam@...|
| 24|    Marsden|1-477-629-7528|Donec.dignissim.m...|
| 37|      Maggy|1-910-887-6777|facilisi.Sed.nequ...|
| 61|     Maxine|1-422-863-3041|aliquet.molestie....|
| 77|      Maggy|1-613-147-4380| pellentesque@mi.net|
| 97|    Maxwell|1-607-205-1273|metus.In@musAenea...|
+---+-----------+--------------+--------------------+
only showing top 7 rows

Sort the data inside the DataFrame

Using the sort function, we can sort the columns specified in the DataFrame:
students.sort(students("studentName").desc).show(7)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 50|      Yasir|1-282-511-4445|eget.odio.Aliquam...|
| 52|       Xena|1-527-990-8606|in.faucibus.orci@...|
| 86|     Xandra|1-677-708-5691|libero@arcuVestib...|
| 43|     Wynter|1-440-544-1851|amet.risus.Donec@...|
| 31|    Wallace|1-144-220-8159| lorem.lorem@non.net|
| 66|      Vance|1-268-680-0857|pellentesque@netu...|
| 41|     Tyrone|1-907-383-5293|non.bibendum.sed@...|
+---+-----------+--------------+--------------------+
only showing top 7 rows
You can also sort multiple columns:
students.sort("studentName", "id").show(10)
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 21|           |1-598-439-7549|consectetuer.adip...|
| 32|           |1-184-895-9602|accumsan.laoreet@...|
| 45|           |1-245-752-0481|Suspendisse.eleif...|
| 83|           |1-858-810-2204|sociis.natoque@eu...|
| 94|           |1-443-410-7878|Praesent.eu.nulla...|
| 91|       Abel|1-530-527-7467|    urna@veliteu.edu|
| 69|       Aiko|1-682-230-7013|turpis.vitae.puru...|
| 47|       Alma|1-747-382-6775|    nec.enim@non.org|
| 26|      Amela|1-526-909-2605| in@vitaesodales.edu|
| 16|      Amena|1-878-250-3129|lorem.luctus.ut@s...|
+---+-----------+--------------+--------------------+
only showing top 10 rows
From the above results we can see that the default is sorted by ascending order. We can also write the above statement into the following:
students.sort(students("studentName").asc, students("id").asc).show(10)
These two statements run the same effect.

Rename the column

If we are not interested in the default column name in the DataFrame, we can rename it with as if we selected it, and the following column will studentName to name and email the column name studentName :
students.select(students("studentName").as("name"), students("email")).show(10)
+--------+--------------------+
|    name|               email|
+--------+--------------------+
|   Burke|ullamcorper.velit...|
|   Kamal|pede.Suspendisse@...|
|    Olga|Aenean.eget.metus...|
|   Belle|vitae.aliquet.nec...|
|  Trevor|dapibus.id@acturp...|
|  Laurel|adipiscing@consec...|
|    Sara|Donec.nibh@enimEt...|
|  Kaseem|cursus.et.magna@e...|
|     Lev|Vivamus.nisi@ipsu...|
|    Maya|accumsan.convalli...|
+--------+--------------------+
only showing top 10 rows

Think of DataFrame as a relational data table

One of the strengths of DataFrame is that we can think of it as a relational data table and then run SQL queries on it as long as we do the following two steps:
(1), the DataFrame registered as a student named table:
students.registerTempTable("students")
(2), then we use it on the standard SQL query:
sqlContext.sql("select * from students where studentName!='' order by email desc").show(7)
 
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 87|      Selma|1-601-330-4409|vulputate.velit@p...|
| 96|   Channing|1-984-118-7533|viverra.Donec.tem...|
4|      Belle|1-246-894-6340|vitae.aliquet.nec...|
| 78|       Finn|1-213-781-6969|vestibulum.massa@...|
| 53|     Kasper|1-155-575-9346|velit.eget@pedeCu...|
| 63|      Dylan|1-417-943-8961|vehicula.aliquet@...|
| 35|     Cadman|1-443-642-5919|ut.lacus@adipisci...|
+---+-----------+--------------+--------------------+
only showing top 7 rows

Join two DataFrame operations

We already know how to register a DataFrame as a table, and now let's look at how to use normal SQL to join the two DataFrame.
1, inline : the inline is the default Join operation, it only returns two DataFrame are matched to the results, take a look at the following example:
val students1 = sqlContext.csvFile(filePath = "E:\\StudentPrep1.csv", useHeader = true, delimiter = '|')
val students2 = sqlContext.csvFile(filePath = "E:\\StudentPrep2.csv", useHeader = true, delimiter = '|')
val studentsJoin = students1.join(students2, students1("id") === students2("id"))
studentsJoin.show(studentsJoin.count.toInt)
 
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
| id|studentName|         phone|               email| id|       studentName|         phone|               email|
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
1|      Burke|1-300-746-8446|ullamcorper.velit...|  1|BurkeDifferentName|1-300-746-8446|ullamcorper.velit...|
2|      Kamal|1-668-571-5046|pede.Suspendisse@...|  2|KamalDifferentName|1-668-571-5046|pede.Suspendisse@...|
3|       Olga|1-956-311-1686|Aenean.eget.metus...|  3|              Olga|1-956-311-1686|Aenean.eget.metus...|
4|      Belle|1-246-894-6340|vitae.aliquet.nec...|  4|BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...|
5|     Trevor|1-300-527-4967|dapibus.id@acturp...|  5|            Trevor|1-300-527-4967|dapibusDifferentE...|
6|     Laurel|1-691-379-9921|adipiscing@consec...|  6|LaurelInvalidPhone|     000000000|adipiscing@consec...|
7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|  7|              Sara|1-608-140-1995|Donec.nibh@enimEt...|
8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|  8|            Kaseem|1-881-586-2689|cursus.et.magna@e...|
9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|  9|               Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|       Maya|1-271-683-2698|accumsan.convalli...| 10|              Maya|1-271-683-2698|accumsan.convalli...|
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
2, right outreach : on the basis of the connection, but also contains all the right table does not meet the conditions of the data line, and in which the left column to fill in NULL, to see the following example:
val studentsRightOuterJoin = students1.join(students2, students1("id") === students2("id"), "right_outer")
studentsRightOuterJoin.show(studentsRightOuterJoin.count.toInt)
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
|  id|studentName|         phone|               email| id|         studentName|         phone|               email|
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
|   1|      Burke|1-300-746-8446|ullamcorper.velit...|  1|  BurkeDifferentName|1-300-746-8446|ullamcorper.velit...|
|   2|      Kamal|1-668-571-5046|pede.Suspendisse@...|  2|  KamalDifferentName|1-668-571-5046|pede.Suspendisse@...|
|   3|       Olga|1-956-311-1686|Aenean.eget.metus...|  3|                Olga|1-956-311-1686|Aenean.eget.metus...|
|   4|      Belle|1-246-894-6340|vitae.aliquet.nec...|  4|  BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...|
|   5|     Trevor|1-300-527-4967|dapibus.id@acturp...|  5|              Trevor|1-300-527-4967|dapibusDifferentE...|
|   6|     Laurel|1-691-379-9921|adipiscing@consec...|  6|  LaurelInvalidPhone|     000000000|adipiscing@consec...|
|   7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|  7|                Sara|1-608-140-1995|Donec.nibh@enimEt...|
|   8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|  8|              Kaseem|1-881-586-2689|cursus.et.magna@e...|
|   9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|  9|                 Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
10|       Maya|1-271-683-2698|accumsan.convalli...| 10|                Maya|1-271-683-2698|accumsan.convalli...|
|null|       null|          null|                null|999|LevUniqueToSecondRDD|1-916-367-5608|Vivamus.nisi@ipsu...|
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
3, left outreach : on the basis of the connection, but also contains all the left table does not meet the conditions of the data line, and in which the right column to fill in the NULL, the same we look at the following example:
val studentsLeftOuterJoin = students1.join(students2, students1("id") === students2("id"), "left_outer")
studentsLeftOuterJoin.show(studentsLeftOuterJoin.count.toInt)
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
| id|studentName|         phone|               email|  id|       studentName|         phone|               email|
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
1|      Burke|1-300-746-8446|ullamcorper.velit...|   1|BurkeDifferentName|1-300-746-8446|ullamcorper.velit...|
2|      Kamal|1-668-571-5046|pede.Suspendisse@...|   2|KamalDifferentName|1-668-571-5046|pede.Suspendisse@...|
3|       Olga|1-956-311-1686|Aenean.eget.metus...|   3|              Olga|1-956-311-1686|Aenean.eget.metus...|
4|      Belle|1-246-894-6340|vitae.aliquet.nec...|   4|BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...|
5|     Trevor|1-300-527-4967|dapibus.id@acturp...|   5|            Trevor|1-300-527-4967|dapibusDifferentE...|
6|     Laurel|1-691-379-9921|adipiscing@consec...|   6|LaurelInvalidPhone|     000000000|adipiscing@consec...|
7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|   7|              Sara|1-608-140-1995|Donec.nibh@enimEt...|
8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|   8|            Kaseem|1-881-586-2689|cursus.et.magna@e...|
9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|   9|               Lev|1-916-367-5608|Vivamus.nisi@ipsu...|
| 10|       Maya|1-271-683-2698|accumsan.convalli...|  10|              Maya|1-271-683-2698|accumsan.convalli...|
| 11|    iteblog|        999999| iteblog@iteblog.com|null|              null|          null|                null|
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+

Save the DataFrame as a file

Let me introduce how to save a DataFrame into a file. We used to load the csv file load function, and for the preservation of the file can use the save function. Specific operations include the following two steps:
1, first create a map object, used to store some of the save function needs to use some of the properties. Here I will develop a save file to save the path and csv header information.
val saveOptions = Map("header" -> "true", "path" -> "iteblog.csv")
In order to learn based on the attitude, we choose from the DataFrame studentName and email two columns, and the studentName column name redefined as name.
val copyOfStudents = students.select(students("studentName").as("name"), students("email"))
2, we call save function below to save the above DataFrame data to the iteblog.csv folder
copyOfStudents.write.format("com.databricks.spark.csv").mode(SaveMode.Overwrite).options(saveOptions).save()
The parameters that the function can receive are Overwrite, Append, Ignore, and ErrorIfExists. From the name can be a very good understanding, Overwrite on behalf of the directory before the existence of the data; Append on behalf of the specified directory to add data; Ignore on behalf of the directory if there is a document, then nothing; ErrorIfExists said if the preservation of the directory There is a file, then the corresponding exception is thrown.
It should be noted that the path parameter specified above is to save the folder, not the last save the file name.

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization