Apache CarbonData Dev Mailing List archive

Question about CarbonDataFrameWriter

Classic

List

Threaded

2 messages Options

xuchuanyin

Question about CarbonDataFrameWriter

Hi, community:

When I go through the DataFrame.write related code in Carbondata, I find there is an option to control whether to save the dataframe's data to a temporary directory as CSV on disk.

My question is why we need this procedure which will consume more disk IO and why the option(tempCSV) is true by default?

Related code can be referred:

https://github.com/apache/carbondata/blob/master/integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala#L45

https://github.com/apache/carbondata/blob/master/integration/spark-common/src/main/scala/org/apache/carbondata/spark/CarbonOption.scala#L43

Jacky Li

Re: Question about CarbonDataFrameWriter

Hi,

When writing dataframe to carbon table, If the dataframe compute is costly, it is better to materialize it by saving to temporary CSV files and then load into carbon table. If the dataframe compute is not costly, for example, dataframe is the scan result of a hive table, then user can set the tempCSV option to false, and carbon will load it directly.

Regards,
Jacky

> 在 2017年10月17日，下午11:17，徐传印 <[hidden email]> 写道：
>
> Hi, community:
>
>
>
>
> When I go through the DataFrame.write related code in Carbondata, I find there is an option to control whether to save the dataframe's data to a temporary directory as CSV on disk.
>
>
>
>
> My question is why we need this procedure which will consume more disk IO and why the option(tempCSV) is true by default?
>
>
>
>
> Related code can be referred:
>
> https://github.com/apache/carbondata/blob/master/integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala#L45
>
>
>
>
> https://github.com/apache/carbondata/blob/master/integration/spark-common/src/main/scala/org/apache/carbondata/spark/CarbonOption.scala#L43