Storing Data Frame as CarbonData Table

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Storing Data Frame as CarbonData Table

Michael Shtelma
Hi Team,

I am new to CarbonData and wanted to test it using a couple of my test queries.
In my test I have used CarbonData 1.3.1 and Spark 2.2.1.

I have tried saving my data frame as carbon data table using the
following command :

myDF.write.format("carbondata").mode("overwrite").saveAsTable("MyTable")

As a result I have got the following exception:

java.lang.IllegalArgumentException: requirement failed: 'path' should
not be specified, the path to store carbon file is the 'storePath'
specified when creating CarbonContext

  at scala.Predef$.require(Predef.scala:224)

  at org.apache.spark.sql.CarbonSource.createRelation(CarbonSource.scala:90)

  at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:449)

  at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)

  at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177)

  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)

  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)

  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)

  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)

  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)

  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)

  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:419)

  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:398)

  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354)

  ... 54 elided

I am wondering now, if there is a way to save any spark data frame as
hive tables backed by carbon data format?
Am I doing smth wrong?

Best,
Michael
Reply | Threaded
Open this post in threaded view
|

Re: Storing Data Frame as CarbonData Table

Liang Chen
Administrator
Hi Michael

Yes, it is very easy to save any spark data to carbondata.
Just need to do small change based on your script, as below :
myDF.write
  .format("carbondata")
  .option("tableName" "MyTable")
  .mode(SaveMode.Overwrite)
  .save()

For more detail, you can refer to examples:
https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/CarbonDataFrameExample.scala


HTH.

Regards
Liang


2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:

> Hi Team,
>
> I am new to CarbonData and wanted to test it using a couple of my test
> queries.
> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
>
> I have tried saving my data frame as carbon data table using the
> following command :
>
> myDF.write.format("carbondata").mode("overwrite").saveAsTable("MyTable")
>
> As a result I have got the following exception:
>
> java.lang.IllegalArgumentException: requirement failed: 'path' should
> not be specified, the path to store carbon file is the 'storePath'
> specified when creating CarbonContext
>
>   at scala.Predef$.require(Predef.scala:224)
>
>   at org.apache.spark.sql.CarbonSource.createRelation(
> CarbonSource.scala:90)
>
>   at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(
> DataSource.scala:449)
>
>   at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
>
>   at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
> ommand.run(createDataSourceTables.scala:177)
>
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult$lzycompute(commands.scala:58)
>
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult(commands.scala:56)
>
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(
> commands.scala:74)
>
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:117)
>
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:117)
>
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:138)
>
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>
>   at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:135)
>
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>
>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> QueryExecution.scala:92)
>
>   at org.apache.spark.sql.execution.QueryExecution.
> toRdd(QueryExecution.scala:92)
>
>   at org.apache.spark.sql.DataFrameWriter.runCommand(
> DataFrameWriter.scala:609)
>
>   at org.apache.spark.sql.DataFrameWriter.createTable(
> DataFrameWriter.scala:419)
>
>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> DataFrameWriter.scala:398)
>
>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> DataFrameWriter.scala:354)
>
>   ... 54 elided
>
> I am wondering now, if there is a way to save any spark data frame as
> hive tables backed by carbon data format?
> Am I doing smth wrong?
>
> Best,
> Michael
>



--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: Storing Data Frame as CarbonData Table

Michael Shtelma
Hi Liang,

Many thanks for your answer!
It has worked in this way.
I am wondering now, how should I configure carbon to get performance
comparable with parquet.
Now I am using default properties, actually no properties at all.
I have tried saving one table to carbon, and it took ages comparable to parquet.
Should I configure somewhere number of writer threads or smth like this ?
I have started spark shell with local[*] option, so I have hoped, that
the write process will use all available cores, but this was not the
case.
It is looking, that only one or two cores are actively used.

Another question: where can I place carbon.properties ? If I place it
to the same folder as spark-defaults.properties, will carbon
automatically use them?

Best,
Michael


On Mon, Apr 2, 2018 at 8:53 AM, Liang Chen <[hidden email]> wrote:

> Hi Michael
>
> Yes, it is very easy to save any spark data to carbondata.
> Just need to do small change based on your script, as below :
> myDF.write
>   .format("carbondata")
>   .option("tableName" "MyTable")
>   .mode(SaveMode.Overwrite)
>   .save()
>
> For more detail, you can refer to examples:
> https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/CarbonDataFrameExample.scala
>
>
> HTH.
>
> Regards
> Liang
>
>
> 2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:
>
>> Hi Team,
>>
>> I am new to CarbonData and wanted to test it using a couple of my test
>> queries.
>> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
>>
>> I have tried saving my data frame as carbon data table using the
>> following command :
>>
>> myDF.write.format("carbondata").mode("overwrite").saveAsTable("MyTable")
>>
>> As a result I have got the following exception:
>>
>> java.lang.IllegalArgumentException: requirement failed: 'path' should
>> not be specified, the path to store carbon file is the 'storePath'
>> specified when creating CarbonContext
>>
>>   at scala.Predef$.require(Predef.scala:224)
>>
>>   at org.apache.spark.sql.CarbonSource.createRelation(
>> CarbonSource.scala:90)
>>
>>   at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(
>> DataSource.scala:449)
>>
>>   at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
>> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
>>
>>   at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectC
>> ommand.run(createDataSourceTables.scala:177)
>>
>>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> sideEffectResult$lzycompute(commands.scala:58)
>>
>>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> sideEffectResult(commands.scala:56)
>>
>>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(
>> commands.scala:74)
>>
>>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> execute$1.apply(SparkPlan.scala:117)
>>
>>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> execute$1.apply(SparkPlan.scala:117)
>>
>>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> executeQuery$1.apply(SparkPlan.scala:138)
>>
>>   at org.apache.spark.rdd.RDDOperationScope$.withScope(
>> RDDOperationScope.scala:151)
>>
>>   at org.apache.spark.sql.execution.SparkPlan.
>> executeQuery(SparkPlan.scala:135)
>>
>>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>>
>>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
>> QueryExecution.scala:92)
>>
>>   at org.apache.spark.sql.execution.QueryExecution.
>> toRdd(QueryExecution.scala:92)
>>
>>   at org.apache.spark.sql.DataFrameWriter.runCommand(
>> DataFrameWriter.scala:609)
>>
>>   at org.apache.spark.sql.DataFrameWriter.createTable(
>> DataFrameWriter.scala:419)
>>
>>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
>> DataFrameWriter.scala:398)
>>
>>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
>> DataFrameWriter.scala:354)
>>
>>   ... 54 elided
>>
>> I am wondering now, if there is a way to save any spark data frame as
>> hive tables backed by carbon data format?
>> Am I doing smth wrong?
>>
>> Best,
>> Michael
>>
>
>
>
> --
> Regards
> Liang
Reply | Threaded
Open this post in threaded view
|

Re: Storing Data Frame as CarbonData Table

mohdshahidkhan
Hi Michael,
Hope below details will help you.

1. How should I configure carbon to get performance ?
Please refer below link to optimize data loading performance in Carbon.
*https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data
<https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data>*


2. How to configure carbon.properties?


PropertyValueDescription
spark.driver.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
the driver. For instance, GC settings or other logging.
spark.executor.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
executors. For instance, GC settings or other logging. *NOTE*: You can
enter multiple values separated by space.




On Tue, Apr 3, 2018 at 6:24 PM, Michael Shtelma <[hidden email]> wrote:

> Hi Liang,
>
> Many thanks for your answer!
> It has worked in this way.
> I am wondering now, how should I configure carbon to get performance
> comparable with parquet.
> Now I am using default properties, actually no properties at all.
> I have tried saving one table to carbon, and it took ages comparable to
> parquet.
> Should I configure somewhere number of writer threads or smth like this ?
> I have started spark shell with local[*] option, so I have hoped, that
> the write process will use all available cores, but this was not the
> case.
> It is looking, that only one or two cores are actively used.
>
> Another question: where can I place carbon.properties ? If I place it
> to the same folder as spark-defaults.properties, will carbon
> automatically use them?
>
> Best,
> Michael
>
>
> On Mon, Apr 2, 2018 at 8:53 AM, Liang Chen <[hidden email]>
> wrote:
> > Hi Michael
> >
> > Yes, it is very easy to save any spark data to carbondata.
> > Just need to do small change based on your script, as below :
> > myDF.write
> >   .format("carbondata")
> >   .option("tableName" "MyTable")
> >   .mode(SaveMode.Overwrite)
> >   .save()
> >
> > For more detail, you can refer to examples:
> > https://github.com/apache/carbondata/blob/master/
> examples/spark2/src/main/scala/org/apache/carbondata/examples/
> CarbonDataFrameExample.scala
> >
> >
> > HTH.
> >
> > Regards
> > Liang
> >
> >
> > 2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:
> >
> >> Hi Team,
> >>
> >> I am new to CarbonData and wanted to test it using a couple of my test
> >> queries.
> >> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
> >>
> >> I have tried saving my data frame as carbon data table using the
> >> following command :
> >>
> >> myDF.write.format("carbondata").mode("overwrite").
> saveAsTable("MyTable")
> >>
> >> As a result I have got the following exception:
> >>
> >> java.lang.IllegalArgumentException: requirement failed: 'path' should
> >> not be specified, the path to store carbon file is the 'storePath'
> >> specified when creating CarbonContext
> >>
> >>   at scala.Predef$.require(Predef.scala:224)
> >>
> >>   at org.apache.spark.sql.CarbonSource.createRelation(
> >> CarbonSource.scala:90)
> >>
> >>   at org.apache.spark.sql.execution.datasources.
> DataSource.writeAndRead(
> >> DataSource.scala:449)
> >>
> >>   at org.apache.spark.sql.execution.command.
> CreateDataSourceTableAsSelectC
> >> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
> >>
> >>   at org.apache.spark.sql.execution.command.
> CreateDataSourceTableAsSelectC
> >> ommand.run(createDataSourceTables.scala:177)
> >>
> >>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> sideEffectResult$lzycompute(commands.scala:58)
> >>
> >>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> sideEffectResult(commands.scala:56)
> >>
> >>   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.doExecute(
> >> commands.scala:74)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> execute$1.apply(SparkPlan.scala:117)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> execute$1.apply(SparkPlan.scala:117)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> executeQuery$1.apply(SparkPlan.scala:138)
> >>
> >>   at org.apache.spark.rdd.RDDOperationScope$.withScope(
> >> RDDOperationScope.scala:151)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan.
> >> executeQuery(SparkPlan.scala:135)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:116)
> >>
> >>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> >> QueryExecution.scala:92)
> >>
> >>   at org.apache.spark.sql.execution.QueryExecution.
> >> toRdd(QueryExecution.scala:92)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.runCommand(
> >> DataFrameWriter.scala:609)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.createTable(
> >> DataFrameWriter.scala:419)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> >> DataFrameWriter.scala:398)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> >> DataFrameWriter.scala:354)
> >>
> >>   ... 54 elided
> >>
> >> I am wondering now, if there is a way to save any spark data frame as
> >> hive tables backed by carbon data format?
> >> Am I doing smth wrong?
> >>
> >> Best,
> >> Michael
> >>
> >
> >
> >
> > --
> > Regards
> > Liang
>
Reply | Threaded
Open this post in threaded view
|

Re: Storing Data Frame as CarbonData Table

mohdshahidkhan
In reply to this post by Michael Shtelma
 Hi Michael,
Hope below details will help you.

1. How should I configure carbon to get performance ?
Please refer below link to optimize data loading performance in Carbon.
*https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data
<https://github.com/apache/carbondata/blob/master/docs/useful-tips-on-carbondata.md#configuration-for-optimizing-data-loading-performance-for-massive-data>*


2. How to configure carbon.properties?


PropertyValueDescription
spark.driver.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
the driver. For instance, GC settings or other logging.
spark.executor.extraJavaOptions -Dcarbon.properties.filepath =
$SPARK_HOME/conf/carbon.properties A string of extra JVM options to pass to
executors. For instance, GC settings or other logging. *NOTE*: You can e
For more details, you can refer be below.

*https://github.com/apache/carbondata/blob/master/docs/installation-guide.md#installing-and-configuring-carbondata-on-standalone-spark-cluster
<https://github.com/apache/carbondata/blob/master/docs/installation-guide.md#installing-and-configuring-carbondata-on-standalone-spark-cluster>*


On Tue, Apr 3, 2018 at 6:24 PM, Michael Shtelma <[hidden email]> wrote:

> Hi Liang,
>
> Many thanks for your answer!
> It has worked in this way.
> I am wondering now, how should I configure carbon to get performance
> comparable with parquet.
> Now I am using default properties, actually no properties at all.
> I have tried saving one table to carbon, and it took ages comparable to
> parquet.
> Should I configure somewhere number of writer threads or smth like this ?
> I have started spark shell with local[*] option, so I have hoped, that
> the write process will use all available cores, but this was not the
> case.
> It is looking, that only one or two cores are actively used.
>
> Another question: where can I place carbon.properties ? If I place it
> to the same folder as spark-defaults.properties, will carbon
> automatically use them?
>
> Best,
> Michael
>
>
> On Mon, Apr 2, 2018 at 8:53 AM, Liang Chen <[hidden email]>
> wrote:
> > Hi Michael
> >
> > Yes, it is very easy to save any spark data to carbondata.
> > Just need to do small change based on your script, as below :
> > myDF.write
> >   .format("carbondata")
> >   .option("tableName" "MyTable")
> >   .mode(SaveMode.Overwrite)
> >   .save()
> >
> > For more detail, you can refer to examples:
> > https://github.com/apache/carbondata/blob/master/
> examples/spark2/src/main/scala/org/apache/carbondata/examples/
> CarbonDataFrameExample.scala
> >
> >
> > HTH.
> >
> > Regards
> > Liang
> >
> >
> > 2018-03-31 18:15 GMT+08:00 Michael Shtelma <[hidden email]>:
> >
> >> Hi Team,
> >>
> >> I am new to CarbonData and wanted to test it using a couple of my test
> >> queries.
> >> In my test I have used CarbonData 1.3.1 and Spark 2.2.1.
> >>
> >> I have tried saving my data frame as carbon data table using the
> >> following command :
> >>
> >> myDF.write.format("carbondata").mode("overwrite").
> saveAsTable("MyTable")
> >>
> >> As a result I have got the following exception:
> >>
> >> java.lang.IllegalArgumentException: requirement failed: 'path' should
> >> not be specified, the path to store carbon file is the 'storePath'
> >> specified when creating CarbonContext
> >>
> >>   at scala.Predef$.require(Predef.scala:224)
> >>
> >>   at org.apache.spark.sql.CarbonSource.createRelation(
> >> CarbonSource.scala:90)
> >>
> >>   at org.apache.spark.sql.execution.datasources.
> DataSource.writeAndRead(
> >> DataSource.scala:449)
> >>
> >>   at org.apache.spark.sql.execution.command.
> CreateDataSourceTableAsSelectC
> >> ommand.saveDataIntoTable(createDataSourceTables.scala:217)
> >>
> >>   at org.apache.spark.sql.execution.command.
> CreateDataSourceTableAsSelectC
> >> ommand.run(createDataSourceTables.scala:177)
> >>
> >>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> sideEffectResult$lzycompute(commands.scala:58)
> >>
> >>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> sideEffectResult(commands.scala:56)
> >>
> >>   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.doExecute(
> >> commands.scala:74)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> execute$1.apply(SparkPlan.scala:117)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> execute$1.apply(SparkPlan.scala:117)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> >> executeQuery$1.apply(SparkPlan.scala:138)
> >>
> >>   at org.apache.spark.rdd.RDDOperationScope$.withScope(
> >> RDDOperationScope.scala:151)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan.
> >> executeQuery(SparkPlan.scala:135)
> >>
> >>   at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:116)
> >>
> >>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> >> QueryExecution.scala:92)
> >>
> >>   at org.apache.spark.sql.execution.QueryExecution.
> >> toRdd(QueryExecution.scala:92)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.runCommand(
> >> DataFrameWriter.scala:609)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.createTable(
> >> DataFrameWriter.scala:419)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> >> DataFrameWriter.scala:398)
> >>
> >>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(
> >> DataFrameWriter.scala:354)
> >>
> >>   ... 54 elided
> >>
> >> I am wondering now, if there is a way to save any spark data frame as
> >> hive tables backed by carbon data format?
> >> Am I doing smth wrong?
> >>
> >> Best,
> >> Michael
> >>
> >
> >
> >
> > --
> > Regards
> > Liang
>