Question about CarbonDataFrameWriter

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about CarbonDataFrameWriter

xuchuanyin
Hi, community:




When I go through the DataFrame.write related code in Carbondata, I find there is an option to control whether to save the dataframe's data to a temporary directory as CSV on disk.




My question is why we need this procedure which will consume more disk IO and why the option(tempCSV) is true by default?




Related code can be referred:

https://github.com/apache/carbondata/blob/master/integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala#L45




https://github.com/apache/carbondata/blob/master/integration/spark-common/src/main/scala/org/apache/carbondata/spark/CarbonOption.scala#L43
Reply | Threaded
Open this post in threaded view
|

Re: Question about CarbonDataFrameWriter

Jacky Li
Hi,

When writing dataframe to carbon table, If the dataframe compute is costly, it is better to materialize it by saving to temporary CSV files and then load into carbon table. If the dataframe compute is not costly, for example, dataframe is the scan result of a hive table, then user can set the tempCSV option to false, and carbon will load it directly.

Regards,
Jacky


> 在 2017年10月17日,下午11:17,徐传印 <[hidden email]> 写道:
>
> Hi, community:
>
>
>
>
> When I go through the DataFrame.write related code in Carbondata, I find there is an option to control whether to save the dataframe's data to a temporary directory as CSV on disk.
>
>
>
>
> My question is why we need this procedure which will consume more disk IO and why the option(tempCSV) is true by default?
>
>
>
>
> Related code can be referred:
>
> https://github.com/apache/carbondata/blob/master/integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala#L45
>
>
>
>
> https://github.com/apache/carbondata/blob/master/integration/spark-common/src/main/scala/org/apache/carbondata/spark/CarbonOption.scala#L43