-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4131] Support "Writing data into the filesystem from queries" #18975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
6ca7771
a975536
a15bf4e
9f596fd
b9db02e
e516bec
e05624f
7f5664d
d50b3a2
61a18a2
4c19aaf
47fde8a
068662a
da7065b
7f4b488
051018e
73f605e
8261b39
163c124
0882dd1
c813ad8
bc5424c
675ffd7
f36e933
64f37f4
b2068ce
8ebe5e2
51f9a0a
e2db5e1
7bb5c85
2e7a29b
7ccbde4
52350e8
2ec9947
05d9d20
392593b
511cfc3
77948bb
fd9322c
e590847
d1b4ec8
dd6a418
c2c693c
e9c88b5
62370fd
b00acdc
b64520b
3aaf6e8
b461e00
e9c24de
28fcb39
0c03a2b
6c24b1b
c4ab411
0ec103e
160c0ec
95ebfd3
4a5ff29
c99872b
449249e
7919041
aeb5d5e
81382df
f93d57a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,8 @@ | |
|
|
||
| package org.apache.spark.sql.execution.command | ||
|
|
||
| import org.apache.hadoop.fs.FileSystem | ||
| import org.apache.spark.SparkException | ||
| import org.apache.spark.sql._ | ||
| import org.apache.spark.sql.catalyst.catalog._ | ||
| import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan | ||
|
|
@@ -52,12 +54,19 @@ case class InsertIntoDataSourceDirCommand( | |
|
|
||
| // Create the relation based on the input logical plan: `query`. | ||
| val pathOption = storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) | ||
|
|
||
| val dataSource = DataSource( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test case only covers
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gatorsmile I am not familiar with data source. Is it possible that you can give me some hints how to limit this to only "FileFormat"? |
||
| sparkSession, | ||
| className = provider, | ||
| options = storage.properties ++ pathOption, | ||
| catalogTable = None) | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can add an extra check here.
If A simple example is like sql(
s"""
|INSERT OVERWRITE DIRECTORY '${path}'
|USING JDBC
|OPTIONS (uRl '$url1', DbTaBlE 'TEST.PEOPLE1', User 'testUser', PassWord 'testPass')
|SELECT 1, 2
""".stripMargin)Currently, the above query can pass. We should get an exception instead.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated. |
||
| val isFileFormat = classOf[FileFormat].isAssignableFrom(dataSource.providingClass) | ||
| if (!isFileFormat) { | ||
| throw new SparkException( | ||
| "Only Data Sources providing FileFormat are supported.") | ||
|
||
| } | ||
|
|
||
| val saveMode = if (overwrite) SaveMode.Overwrite else SaveMode.ErrorIfExists | ||
| try { | ||
| sparkSession.sessionState.executePlan(dataSource.planForWriting(saveMode, query)) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,6 +20,8 @@ package org.apache.spark.sql.execution.datasources | |
| import java.util.Locale | ||
| import java.util.concurrent.Callable | ||
|
|
||
| import org.apache.hadoop.fs.Path | ||
|
|
||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql._ | ||
|
|
@@ -141,8 +143,12 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast | |
| parts, query, overwrite, false) if parts.isEmpty => | ||
| InsertIntoDataSourceCommand(l, query, overwrite) | ||
|
|
||
| case InsertIntoDir(_, storage, provider, query, overwrite) | ||
| case InsertIntoDir(isLocal, storage, provider, query, overwrite) | ||
|
||
| if provider.isDefined && provider.get.toLowerCase(Locale.ROOT) != DDLUtils.HIVE_PROVIDER => | ||
|
|
||
| val outputPath = new Path(storage.locationUri.get) | ||
| if (overwrite) DDLUtils.verifyNotReadPath(query, outputPath) | ||
|
|
||
| InsertIntoDataSourceDirCommand(storage, provider.get, query, overwrite) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to block both
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated. |
||
|
|
||
| case i @ InsertIntoTable( | ||
|
|
@@ -181,15 +187,9 @@ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with Cast | |
| } | ||
|
|
||
| val outputPath = t.location.rootPaths.head | ||
| val inputPaths = actualQuery.collect { | ||
| case LogicalRelation(r: HadoopFsRelation, _, _, _) => r.location.rootPaths | ||
| }.flatten | ||
| if (overwrite) DDLUtils.verifyNotReadPath(actualQuery, outputPath) | ||
|
|
||
| val mode = if (overwrite) SaveMode.Overwrite else SaveMode.Append | ||
| if (overwrite && inputPaths.contains(outputPath)) { | ||
| throw new AnalysisException( | ||
| "Cannot overwrite a path that is also being read from.") | ||
| } | ||
|
|
||
| val partitionSchema = actualQuery.resolve( | ||
| t.partitionSchema, t.sparkSession.sessionState.analyzer.resolver) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,18 +20,19 @@ package org.apache.spark.sql.hive.execution | |
| import java.util.Properties | ||
|
|
||
| import scala.language.existentials | ||
|
|
||
| import org.apache.hadoop.fs.{FileSystem, Path} | ||
| import org.apache.hadoop.hive.common.FileUtils | ||
| import org.apache.hadoop.hive.ql.plan.TableDesc | ||
| import org.apache.hadoop.hive.serde.serdeConstants | ||
| import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe | ||
| import org.apache.hadoop.mapred._ | ||
|
|
||
| import org.apache.spark.sql.catalyst.TableIdentifier | ||
| import org.apache.spark.sql.{Row, SparkSession} | ||
| import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat | ||
| import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType} | ||
| import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan | ||
| import org.apache.spark.sql.execution.SparkPlan | ||
| import org.apache.spark.sql.hive.client.HiveClientImpl | ||
| import org.apache.spark.sql.types.StructType | ||
| import org.apache.spark.util.Utils | ||
|
|
||
| /** | ||
|
|
@@ -63,28 +64,40 @@ case class InsertIntoHiveDirCommand( | |
| assert(children.length == 1) | ||
| assert(storage.locationUri.nonEmpty) | ||
|
|
||
| val Array(cols, types) = children.head.output.foldLeft(Array("", "")) { case (r, a) => | ||
| r(0) = r(0) + a.name + "," | ||
| r(1) = r(1) + a.dataType.catalogString + ":" | ||
| r | ||
| } | ||
|
|
||
| val properties = new Properties() | ||
| properties.put("columns", cols.dropRight(1)) | ||
| properties.put("columns.types", types.dropRight(1)) | ||
|
|
||
| val sqlContext = sparkSession.sqlContext | ||
|
|
||
| properties.put(serdeConstants.SERIALIZATION_LIB, | ||
| // val Array(cols, types) = children.head.output.foldLeft(Array("", "")) { case (r, a) => | ||
|
||
| // r(0) = r(0) + a.name + "," | ||
| // r(1) = r(1) + a.dataType.catalogString + ":" | ||
| // r | ||
| // } | ||
| // | ||
| // val properties = new Properties() | ||
| // properties.put("columns", cols.dropRight(1)) | ||
| // properties.put("columns.types", types.dropRight(1)) | ||
| // properties.put(serdeConstants.SERIALIZATION_LIB, | ||
| // storage.serde.getOrElse(classOf[LazySimpleSerDe].getName)) | ||
| // | ||
| // import scala.collection.JavaConverters._ | ||
| // properties.putAll(storage.properties.asJava) | ||
| // | ||
| // val tableDesc = new TableDesc( | ||
| // Utils.classForName(storage.inputFormat.get).asInstanceOf[Class[_ <: InputFormat[_, _]]], | ||
| // Utils.classForName(storage.outputFormat.get), | ||
| // properties | ||
| // ) | ||
|
||
|
|
||
| val hiveTable = HiveClientImpl.toHiveTable(CatalogTable( | ||
| identifier = TableIdentifier(storage.locationUri.get.toString, Some("default")), | ||
| tableType = org.apache.spark.sql.catalyst.catalog.CatalogTableType.VIEW, | ||
| storage = storage, | ||
| schema = query.schema | ||
| )) | ||
| hiveTable.getMetadata.put(serdeConstants.SERIALIZATION_LIB, | ||
| storage.serde.getOrElse(classOf[LazySimpleSerDe].getName)) | ||
|
|
||
| import scala.collection.JavaConverters._ | ||
| properties.putAll(storage.properties.asJava) | ||
|
|
||
| var tableDesc = new TableDesc( | ||
| Utils.classForName(storage.inputFormat.get).asInstanceOf[Class[_ <: InputFormat[_, _]]], | ||
| Utils.classForName(storage.outputFormat.get), | ||
| properties | ||
| val tableDesc = new TableDesc( | ||
| hiveTable.getInputFormatClass, | ||
| hiveTable.getOutputFormatClass, | ||
| hiveTable.getMetadata | ||
| ) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not 100% sure the above logics work well for all Hive versions and all the file formats. Another safer way is to use our existing way val hiveTable = HiveClientImpl.toHiveTable(dummyCatalogTableWithUserSpecifiedStorage)
val tableDesc = new TableDesc(
hiveTable.getInputFormatClass,
hiveTable.getOutputFormatClass,
hiveTable.getMetadata
)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we use the schema of the query as the dummyTableSchema, do we still need to populate the properties by ourselves?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we use val hiveTable = HiveClientImpl.toHiveTable(dummyCatalogTableWithUserSpecifiedStorage), it requires a tablename and a database name. but in our case, we dont have it. I commented out my original code and implement the dummy hive table. Let me know if that's ok to. |
||
|
|
||
| val hadoopConf = sparkSession.sessionState.newHadoopConf() | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't support
LOCALfor data source, should we remove it from the parsing rule?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, LOCAL was not added.
@gatorsmile had some comment that the parser might have some weird exception and he requested to add it.