Apache CarbonData Dev Mailing List archive

Storing Data Frame as CarbonData Table

Classic

List

Threaded

5 messages Options

Michael Shtelma

Mar 31, 2018; 10:15am

Storing Data Frame as CarbonData Table

Hi Team,

I am new to CarbonData and wanted to test it using a couple of my test queries.
In my test I have used CarbonData 1.3.1 and Spark 2.2.1.

I have tried saving my data frame as carbon data table using the
following command :

myDF.write.format("carbondata").mode("overwrite").saveAsTable("MyTable")

As a result I have got the following exception:

java.lang.IllegalArgumentException: requirement failed: 'path' should
not be specified, the path to store carbon file is the 'storePath'
specified when creating CarbonContext

at scala.Predef$.require(Predef.scala:224)

at org.apache.spark.sql.CarbonSource.createRelation(CarbonSource.scala:90)

at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:449)

at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)

at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)

at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:419)

at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:398)

at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354)

... 54 elided

I am wondering now, if there is a way to save any spark data frame as
hive tables backed by carbon data format?
Am I doing smth wrong?

Best,
Michael

Liang Chen

Apr 02, 2018; 6:53am

Re: Storing Data Frame as CarbonData Table

Administrator

Hi Michael

Yes, it is very easy to save any spark data to carbondata.
Just need to do small change based on your script, as below :
myDF.write
.format("carbondata")
.option("tableName" "MyTable")
.mode(SaveMode.Overwrite)
.save()

For more detail, you can refer to examples:
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/CarbonDataFrameExample.scala

HTH.

Regards
Liang

2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:

> Hi Team,
>
> I am new to CarbonData and wanted to test it using a couple of my test
> queries.
> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
>
> I have tried saving my data frame as carbon data table using the
> following command :
>
> myDF.write.format("carbondata").mode("overwrite").saveAsTable("MyTable")
>
> As a result I have got the following exception:
>
> java.lang.IllegalArgumentException: requirement failed: 'path' should
> not be specified, the path to store carbon file is the 'storePath'
> specified when creating CarbonContext
>
> at scala.Predef$.require(Predef.scala:224)
>
> at org.apache.spark.sql.CarbonSource.createRelation(
> CarbonSource.scala:90)
>
> at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(
> DataSource.scala:449)
>
> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
>
> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
> ommand.run(createDataSourceTables.scala:177)
>
> at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult$lzycompute(commands.scala:58)
>
> at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult(commands.scala:56)
>
> at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(
> commands.scala:74)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:117)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:117)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:138)
>
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>
> at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:135)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>
> at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> QueryExecution.scala:92)
>
> at org.apache.spark.sql.execution.QueryExecution.
> toRdd(QueryExecution.scala:92)
>
> at org.apache.spark.sql.DataFrameWriter.runCommand(
> DataFrameWriter.scala:609)
>
> at org.apache.spark.sql.DataFrameWriter.createTable(
> DataFrameWriter.scala:419)
>
> at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> DataFrameWriter.scala:398)
>
> at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> DataFrameWriter.scala:354)
>
> ... 54 elided
>
> I am wondering now, if there is a way to save any spark data frame as
> hive tables backed by carbon data format?
> Am I doing smth wrong?
>
> Best,
> Michael
>

... [show rest of quote]

--
Regards
Liang

Michael Shtelma

Apr 03, 2018; 12:54pm

Re: Storing Data Frame as CarbonData Table

Hi Liang,

Many thanks for your answer!
It has worked in this way.
I am wondering now, how should I configure carbon to get performance
comparable with parquet.
Now I am using default properties, actually no properties at all.
I have tried saving one table to carbon, and it took ages comparable to parquet.
Should I configure somewhere number of writer threads or smth like this ?
I have started spark shell with local[*] option, so I have hoped, that
the write process will use all available cores, but this was not the
case.
It is looking, that only one or two cores are actively used.

Another question: where can I place carbon.properties ? If I place it
to the same folder as spark-defaults.properties, will carbon
automatically use them?

Best,
Michael

On Mon, Apr 2, 2018 at 8:53 AM, Liang Chen <[hidden email]> wrote:

> Hi Michael
>
> Yes, it is very easy to save any spark data to carbondata.
> Just need to do small change based on your script, as below :
> myDF.write
> .format("carbondata")
> .option("tableName" "MyTable")
> .mode(SaveMode.Overwrite)
> .save()
>
> For more detail, you can refer to examples:
> https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/CarbonDataFrameExample.scala
>
>
> HTH.
>
> Regards
> Liang
>
>
> 2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:
>
>> Hi Team,
>>
>> I am new to CarbonData and wanted to test it using a couple of my test
>> queries.
>> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
>>
>> I have tried saving my data frame as carbon data table using the
>> following command :
>>
>> myDF.write.format("carbondata").mode("overwrite").saveAsTable("MyTable")
>>
>> As a result I have got the following exception:
>>
>> java.lang.IllegalArgumentException: requirement failed: 'path' should
>> not be specified, the path to store carbon file is the 'storePath'
>> specified when creating CarbonContext
>>
>> at scala.Predef$.require(Predef.scala:224)
>>
>> at org.apache.spark.sql.CarbonSource.createRelation(
>> CarbonSource.scala:90)
>>
>> at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(
>> DataSource.scala:449)
>>
>> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
>> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
>>
>> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
>> ommand.run(createDataSourceTables.scala:177)
>>
>> at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> sideEffectResult$lzycompute(commands.scala:58)
>>
>> at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> sideEffectResult(commands.scala:56)
>>
>> at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(
>> commands.scala:74)
>>
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> execute$1.apply(SparkPlan.scala:117)
>>
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> execute$1.apply(SparkPlan.scala:117)
>>
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> executeQuery$1.apply(SparkPlan.scala:138)
>>
>> at org.apache.spark.rdd.RDDOperationScope$.withScope(
>> RDDOperationScope.scala:151)
>>
>> at org.apache.spark.sql.execution.SparkPlan.
>> executeQuery(SparkPlan.scala:135)
>>
>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>>
>> at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
>> QueryExecution.scala:92)
>>
>> at org.apache.spark.sql.execution.QueryExecution.
>> toRdd(QueryExecution.scala:92)
>>
>> at org.apache.spark.sql.DataFrameWriter.runCommand(
>> DataFrameWriter.scala:609)
>>
>> at org.apache.spark.sql.DataFrameWriter.createTable(
>> DataFrameWriter.scala:419)
>>
>> at org.apache.spark.sql.DataFrameWriter.saveAsTable(
>> DataFrameWriter.scala:398)
>>
>> at org.apache.spark.sql.DataFrameWriter.saveAsTable(
>> DataFrameWriter.scala:354)
>>
>> ... 54 elided
>>
>> I am wondering now, if there is a way to save any spark data frame as
>> hive tables backed by carbon data format?
>> Am I doing smth wrong?
>>
>> Best,
>> Michael
>>
>
>
>
> --
> Regards
> Liang

... [show rest of quote]

mohdshahidkhan

Apr 03, 2018; 4:43pm

Re: Storing Data Frame as CarbonData Table

Hi Michael,
Hope below details will help you.

1. How should I configure carbon to get performance ?
Please refer below link to optimize data loading performance in Carbon.
*https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data
<https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data>*

2. How to configure carbon.properties?

PropertyValueDescription
spark.driver.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
the driver. For instance, GC settings or other logging.
spark.executor.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
executors. For instance, GC settings or other logging. *NOTE*: You can
enter multiple values separated by space.

On Tue, Apr 3, 2018 at 6:24 PM, Michael Shtelma <[hidden email]> wrote:

> Hi Liang,
>
> Many thanks for your answer!
> It has worked in this way.
> I am wondering now, how should I configure carbon to get performance
> comparable with parquet.
> Now I am using default properties, actually no properties at all.
> I have tried saving one table to carbon, and it took ages comparable to
> parquet.
> Should I configure somewhere number of writer threads or smth like this ?
> I have started spark shell with local[*] option, so I have hoped, that
> the write process will use all available cores, but this was not the
> case.
> It is looking, that only one or two cores are actively used.
>
> Another question: where can I place carbon.properties ? If I place it
> to the same folder as spark-defaults.properties, will carbon
> automatically use them?
>
> Best,
> Michael
>
>
> On Mon, Apr 2, 2018 at 8:53 AM, Liang Chen <[hidden email]>
> wrote:
> > Hi Michael
> >
> > Yes, it is very easy to save any spark data to carbondata.
> > Just need to do small change based on your script, as below :
> > myDF.write
> > .format("carbondata")
> > .option("tableName" "MyTable")
> > .mode(SaveMode.Overwrite)
> > .save()
> >
> > For more detail, you can refer to examples:
> > https://github.com/apache/carbondata/blob/master/
> examples/spark2/src/main/scala/org/apache/carbondata/examples/
> CarbonDataFrameExample.scala
> >
> >
> > HTH.
> >
> > Regards
> > Liang
> >
> >
> > 2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:
> >
> >> Hi Team,
> >>
> >> I am new to CarbonData and wanted to test it using a couple of my test
> >> queries.
> >> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
> >>
> >> I have tried saving my data frame as carbon data table using the
> >> following command :
> >>
> >> myDF.write.format("carbondata").mode("overwrite").
> saveAsTable("MyTable")
> >>
> >> As a result I have got the following exception:
> >>
> >> java.lang.IllegalArgumentException: requirement failed: 'path' should
> >> not be specified, the path to store carbon file is the 'storePath'
> >> specified when creating CarbonContext
> >>
> >> at scala.Predef$.require(Predef.scala:224)
> >>
> >> at org.apache.spark.sql.CarbonSource.createRelation(
> >> CarbonSource.scala:90)
> >>
> >> at org.apache.spark.sql.execution.datasources.
> DataSource.writeAndRead(
> >> DataSource.scala:449)
> >>
> >> at org.apache.spark.sql.execution.command.
> CreateDataSourceTableAsSelectC
> >> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
> >>
> >> at org.apache.spark.sql.execution.command.
> CreateDataSourceTableAsSelectC
> >> ommand.run(createDataSourceTables.scala:177)
> >>
> >> at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> sideEffectResult$lzycompute(commands.scala:58)
> >>
> >> at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> sideEffectResult(commands.scala:56)
> >>
> >> at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.doExecute(
> >> commands.scala:74)
> >>
> >> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> execute$1.apply(SparkPlan.scala:117)
> >>
> >> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> execute$1.apply(SparkPlan.scala:117)
> >>
> >> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> executeQuery$1.apply(SparkPlan.scala:138)
> >>
> >> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> >> RDDOperationScope.scala:151)
> >>
> >> at org.apache.spark.sql.execution.SparkPlan.
> >> executeQuery(SparkPlan.scala:135)
> >>
> >> at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:116)
> >>
> >> at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> >> QueryExecution.scala:92)
> >>
> >> at org.apache.spark.sql.execution.QueryExecution.
> >> toRdd(QueryExecution.scala:92)
> >>
> >> at org.apache.spark.sql.DataFrameWriter.runCommand(
> >> DataFrameWriter.scala:609)
> >>
> >> at org.apache.spark.sql.DataFrameWriter.createTable(
> >> DataFrameWriter.scala:419)
> >>
> >> at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> >> DataFrameWriter.scala:398)
> >>
> >> at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> >> DataFrameWriter.scala:354)
> >>
> >> ... 54 elided
> >>
> >> I am wondering now, if there is a way to save any spark data frame as
> >> hive tables backed by carbon data format?
> >> Am I doing smth wrong?
> >>
> >> Best,
> >> Michael
> >>
> >
> >
> >
> > --
> > Regards
> > Liang
>

... [show rest of quote]

mohdshahidkhan

Apr 03, 2018; 4:46pm

Re: Storing Data Frame as CarbonData Table

In reply to this post by Michael Shtelma

Hi Michael,
Hope below details will help you.

1. How should I configure carbon to get performance ?
Please refer below link to optimize data loading performance in Carbon.
*https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data
<https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data>*

2. How to configure carbon.properties?

PropertyValueDescription
spark.driver.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
the driver. For instance, GC settings or other logging.
spark.executor.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
executors. For instance, GC settings or other logging. *NOTE*: You can e
For more details, you can refer be below.

*https://github.com/apache/carbondata/blob/master/docs/installation-guide.md#installing-and-configuring-carbondata-on-standalone-spark-cluster
<https://github.com/apache/carbondata/blob/master/docs/installation-guide.md#installing-and-configuring-carbondata-on-standalone-spark-cluster>*

On Tue, Apr 3, 2018 at 6:24 PM, Michael Shtelma <[hidden email]> wrote:

... [show rest of quote]