package sql
- Alphabetic
- By Inheritance
- sql
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
class
Column extends Logging
A column that will be computed based on the data in a
DataFrame.A column that will be computed based on the data in a
DataFrame.A new column can be constructed based on the input columns present in a DataFrame:
df("columnName") // On a specific `df` DataFrame. col("columnName") // A generic column not yet associated with a DataFrame. col("columnName.field") // Extracting a struct field col("`a.column.with.dots`") // Escape `.` in column names. $"columnName" // Scala short hand for a named column.
Column objects can be composed to form complex expressions:
$"a" + 1
- Since
3.4.0
-
class
ColumnName extends Column
A convenient class used for constructing schema.
A convenient class used for constructing schema.
- Since
3.4.0
-
trait
CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]]
Trait to restrict calls to create and replace operations.
Trait to restrict calls to create and replace operations.
- Since
3.4.0
- type DataFrame = Dataset[Row]
-
final
class
DataFrameNaFunctions extends AnyRef
Functionality for working with missing data in
DataFrames.Functionality for working with missing data in
DataFrames.- Since
3.4.0
-
class
DataFrameReader extends Logging
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use
SparkSession.readto access this.- Annotations
- @Stable()
- Since
3.4.0
-
final
class
DataFrameStatFunctions extends AnyRef
Statistic functions for
DataFrames.Statistic functions for
DataFrames.- Since
3.4.0
-
final
class
DataFrameWriter[T] extends AnyRef
Interface used to write a Dataset to external storage systems (e.g.
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use
Dataset.writeto access this.- Annotations
- @Stable()
- Since
3.4.0
-
final
class
DataFrameWriterV2[T] extends CreateTableWriter[T]
Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API.
Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API.
- Annotations
- @Experimental()
- Since
3.4.0
-
class
Dataset[T] extends Serializable
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a
DataFrame, which is a Dataset of Row.Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (
groupBy). Example actions count, show, or writing data out to file systems.Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the
explainfunction.To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type
Tto Spark's internal type system. For example, given a classPersonwith two fields,name(string) andage(int), an encoder is used to tell Spark to generate code at runtime to serialize thePersonobject into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use theschemafunction.There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the
readfunction available on aSparkSession.val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use
applymethod in Scala andcolin Java.val ageCol = people("age") // in Scala Column ageCol = people.col("age"); // in Java
Note that the Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), people("gender")) .agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter(people.col("age").gt(30)) .join(department, people.col("deptId").equalTo(department.col("id"))) .groupBy(department.col("name"), people.col("gender")) .agg(avg(people.col("salary")), max(people.col("age")));
- Since
3.4.0
-
case class
DatasetHolder[T] extends Product with Serializable
A container for a Dataset, used for implicit conversions in Scala.
A container for a Dataset, used for implicit conversions in Scala.
To use this, import implicit conversions in SQL:
val spark: SparkSession = ... import spark.implicits._
- Since
3.4.0
-
trait
LowPrioritySQLImplicits extends AnyRef
Lower priority implicit methods for converting Scala objects into Datasets.
Lower priority implicit methods for converting Scala objects into Datasets. Conflicting implicits are placed here to disambiguate resolution.
Reasons for including specific implicits: newProductEncoder - to disambiguate for
Lists which are bothSeqandProduct -
class
RelationalGroupedDataset extends AnyRef
A set of methods for aggregations on a
DataFrame, created by groupBy, cube or rollup (and alsopivot).A set of methods for aggregations on a
DataFrame, created by groupBy, cube or rollup (and alsopivot).The main method is the
aggfunction, which has multiple variants. This class also contains some first-order statistics such asmean,sumfor convenience.- Since
3.4.0
- Note
This class was named
GroupedDatain Spark 1.x.
-
class
RuntimeConfig extends Logging
Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use
SparkSession.conf.- Since
3.4.0
-
abstract
class
SQLImplicits extends LowPrioritySQLImplicits
A collection of implicit methods for converting names and Symbols into Columns, and for converting common Scala objects into Datasets.
-
sealed abstract final
class
SaveMode extends Enum[SaveMode]
SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
- Annotations
- @Stable()
- Since
3.4.0
-
class
SparkSession extends Serializable with Closeable with Logging
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session:
SparkSession.builder().getOrCreate()
The builder can also be used to create a new session:
SparkSession.builder .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate()
-
class
TypedColumn[-T, U] extends Column
A Column where an Encoder has been given for the expected input and return type.
A Column where an Encoder has been given for the expected input and return type. To create a TypedColumn, use the
asfunction on a Column.- T
The input type expected for this expression. Can be
Anyif the expression is type checked by the analyzer instead of the compiler (i.e.expr("sum(...)")).- U
The output type of this column.
- Since
3.4.0
-
trait
WriteConfigMethods[R] extends AnyRef
Configuration methods common to create/replace operations and insert/overwrite operations.
Configuration methods common to create/replace operations and insert/overwrite operations.
- R
builder type to return
- Since
3.4.0
Value Members
- object SparkSession extends Logging with Serializable
-
object
functions
Commonly used functions available for DataFrame operations.
Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.
Spark also includes more built-in functions that are less common and are not defined here. You can still access them (and all the functions defined here) using the
functions.expr()API and calling them through a SQL expression string. You can find the entire list of functions at SQL API documentation of your Spark version, see also <a href="https://spark.apache.org/docs/latest/api/sql/index.html">the latest listAs an example,
isnanis a function that is defined here. You can useisnan(col("myCol"))to invoke theisnanfunction. This way the programming language's compiler ensuresisnanexists and is of the proper form. You can also useexpr("isnan(myCol)")function to invoke the same function. In this case, Spark itself will ensureisnanexists when it analyzes the query.regr_countis an example of a function that is built-in but not defined here, because it is less commonly used. To invoke it, useexpr("regr_count(yCol, xCol)").This function APIs usually have methods with
Columnsignature only because it can support not onlyColumnbut also other types such as a native string. The other variants currently exist for historical reasons.- Since
3.4.0