Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Aug 21, 2019

What changes were proposed in this pull request?

Document LOAD DATA statement in SQL Reference

Why are the changes needed?

To complete the SQL Reference

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Tested using jykyll build --serve

Here are the screen shots:

image

image

image

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109464 has finished for PR 25522 at commit efea0b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

@huaxingao Could you please attach a screenshot of the page ?


**This page is under construction**
### Description
The LOAD DATA statement can be used to load data from a file into a table or a partition in the table. The target table must not be temporary. A partition spec must be provided if and only if the target table is partitioned. The LOAD DATA statement is only supported for tables created using the Hive format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back-tick the reserved words for clarity.
Nit: I'd write "..., or into a partition ..." to make sure it doesn't suggest it's the alternative to "from a file"

One or more partition column name and value pairs.

##### ***LOCAL***:
If specified, local file system is used. Otherwise, the default file system is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say that it causes the INPATH to be resolved against the local file system, instead of the default file system, which is typically distributed storage.

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109520 has finished for PR 25522 at commit 19deb6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109522 has finished for PR 25522 at commit c680337.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

updated

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109535 has finished for PR 25522 at commit 08ba5f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109932 has finished for PR 25522 at commit 8b23eae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

@dilipbiswal @srowen @gatorsmile
Could you please review? Thanks!


**This page is under construction**
### Description
`LOAD DATA` loads data from a directoy or a file into a table or into a partition in the table. A partition spec should be specified whenever the target table is partitioned. The `LOAD DATA` statement can only be used with tables created using the Hive format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao What if we say it this way ? Please feel free to change it the way you see fit.

LOAD DATA statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the LOAD statement takes an optional partition specication. When a partiion-spec is specified, the data files (when input source is a directory) or the single file (when input source is a file) are loaded into the target table.

### Parameters
<dl>
<dt><code><em>path</em></code></dt>
<dd>Path of the file system.</dd>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao I was looking up the hive documentation. They mention about load deleting the data from source directory i.e they do a move operation. I think this is a important thing to say if that is the behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao Also seems like this path can be both an absolute path or relative path. Do we need to mention that here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal I checked, actually the data in the source directory is not deleted.

@SparkQA
Copy link

SparkQA commented Sep 1, 2019

Test build #109993 has finished for PR 25522 at commit b29223f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AbhishekNew
Copy link

load data should have limitation also because when user gives as below
load data local inpath '/opt/abhi/test/_typeddate.txt' into table wild1;
command is success but data will not be loaded in table because Hadoop treat this as Hidden File.
load from local or hdfs same behavior.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal any final comments, based on your other reviews? this is looking OK.

@dilipbiswal
Copy link
Contributor

@srowen Actually this pr may need a minor clarification. I have already discussed with Huaxin. We just need to clarify the "move" vs "copy" behavior.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the move vs copy issue?


**This page is under construction**
### Description
`LOAD DATA` statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the `LOAD DATA` statement takes an optional partition specification. When a partition is specified, the data files ( when input source is a directory ) or the single file ( when input source is a file ) are loaded into the partition of the target table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove spaces inside parentheses

@dilipbiswal
Copy link
Contributor

@srowen

What's the move vs copy issue?

If we see https://cwiki.apache.org/confluence/display/Hive/GettingStarted and look for "LOAD DATA" command, we see the following comments under NOTES
NO verification of data against the schema is performed by the load command.

  1. If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
  2. The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

The question i had was , "should we document our exact behaviour" i.e do we move the data from original location to the target location vs do we copy ? Can we move ahead on this PR as is and clarify it in a follow-up ?

@srowen
Copy link
Member

srowen commented Oct 21, 2019

Looks OK as-is then; I have one minor comment above otherwise.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Oct 21, 2019

@srowen Thanks a lot.
cc @huaxingao

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112410 has finished for PR 25522 at commit d7fec55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 22, 2019

Merged to master

@srowen srowen closed this in 8779938 Oct 22, 2019
@huaxingao
Copy link
Contributor Author

Thanks a lot! @srowen @dilipbiswal

@huaxingao huaxingao deleted the spark-28787 branch October 22, 2019 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants