Apache CarbonData Dev Mailing List archive

[jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Classic

List

Threaded

4 messages Options

Akash R Nilugal (Jira)

[jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Sanoj MG created CARBONDATA-836:
-----------------------------------

Summary: Error in load using dataframe - columns containing comma
Key: CARBONDATA-836
URL: https://issues.apache.org/jira/browse/CARBONDATA-836
Project: CarbonData
Issue Type: Bug
Components: spark-integration
Affects Versions: 1.1.0-incubating
Environment: HDP sandbox 2.5, Spark 1.6.2
Reporter: Sanoj MG
Priority: Minor
Fix For: NONE

While trying to load data into Carabondata table using dataframe, the columns containing commas are not properly loaded.

Eg:
scala> df.show(false)
+-------+------+-----------+----------------+---------+------+
|Country|Branch|Name |Address |ShortName|Status|
+-------+------+-----------+----------------+---------+------+
|2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 |
+-------+------+-----------+----------------+---------+------+

scala> df.write.format("carbondata").option("tableName", "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()

scala> cc.sql("select * from branch1").show(false)

+-------+------+-----------+-------+---------+------+
|country|branch|name |address|shortname|status|
+-------+------+-----------+-------+---------+------+
|2 |1 |Main Branch|XXXX | Dubai |null |
+-------+------+-----------+-------+---------+------+

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Sanoj MG

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Hi All,

In CarbonDataFrameWriter, there is an option to load using CSV file.

if (options.tempCSV) {

loadTempCSV(options)
} else {
loadDataFrame(options)
}

Why is this choice required? Is there any issue if we load it directly
without using CSV?

I have many dimension table with comma in string columns, and so always use
.option("tempCSV", "false"). In CarbonOption can we set the default value
as "false" as below

def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean

Thanks,
Sanoj

On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> wrote:

> Sanoj MG created CARBONDATA-836:
> -----------------------------------
>
> Summary: Error in load using dataframe - columns containing
> comma
> Key: CARBONDATA-836
> URL: https://issues.apache.org/jira/browse/CARBONDATA-836
> Project: CarbonData
> Issue Type: Bug
> Components: spark-integration
> Affects Versions: 1.1.0-incubating
> Environment: HDP sandbox 2.5, Spark 1.6.2
> Reporter: Sanoj MG
> Priority: Minor
> Fix For: NONE
>
>
> While trying to load data into Carabondata table using dataframe, the
> columns containing commas are not properly loaded.
>
> Eg:
> scala> df.show(false)
> +-------+------+-----------+----------------+---------+------+
> |Country|Branch|Name |Address |ShortName|Status|
> +-------+------+-----------+----------------+---------+------+
> |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 |
> +-------+------+-----------+----------------+---------+------+
>
>
> scala> df.write.format("carbondata").option("tableName",
> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>
>
> scala> cc.sql("select * from branch1").show(false)
>
> +-------+------+-----------+-------+---------+------+
> |country|branch|name |address|shortname|status|
> +-------+------+-----------+-------+---------+------+
> |2 |1 |Main Branch|XXXX | Dubai |null |
> +-------+------+-----------+-------+---------+------+
>
>
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>

Jacky Li

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Hi Sanoj,

This is because in CarbonData loading flow, it needs to scan input data twice (one for generating global dictionary, another for actual loading). If user is using Dataframe to write to CarbonData, and if the input dataframe compute is costly, it is better to save it as a temporary CSV file first and load into CarbonData instead of computing the dataframe twice.

However there is another option that can do single pass data load, by using .option(“single_pass”, “true”), in this case, the input dataframe should be computed only once. But when I check the code just now, it seems this behavior is not implemented. :(
I think you are free to create JIRA ticket if you want.

Regards,
Jacky

> 在 2017年4月11日，上午10:36，Sanoj MG <[hidden email]> 写道：
>
> Hi All,
>
> In CarbonDataFrameWriter, there is an option to load using CSV file.
>
> if (options.tempCSV) {
>
> loadTempCSV(options)
> } else {
> loadDataFrame(options)
> }
>
> Why is this choice required? Is there any issue if we load it directly
> without using CSV?
>
> I have many dimension table with comma in string columns, and so always use
> .option("tempCSV", "false"). In CarbonOption can we set the default value
> as "false" as below
>
> def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
>
> Thanks,
> Sanoj
>
>
> On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> wrote:
>
>> Sanoj MG created CARBONDATA-836:
>> -----------------------------------
>>
>> Summary: Error in load using dataframe - columns containing
>> comma
>> Key: CARBONDATA-836
>> URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>> Project: CarbonData
>> Issue Type: Bug
>> Components: spark-integration
>> Affects Versions: 1.1.0-incubating
>> Environment: HDP sandbox 2.5, Spark 1.6.2
>> Reporter: Sanoj MG
>> Priority: Minor
>> Fix For: NONE
>>
>>
>> While trying to load data into Carabondata table using dataframe, the
>> columns containing commas are not properly loaded.
>>
>> Eg:
>> scala> df.show(false)
>> +-------+------+-----------+----------------+---------+------+
>> |Country|Branch|Name |Address |ShortName|Status|
>> +-------+------+-----------+----------------+---------+------+
>> |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 |
>> +-------+------+-----------+----------------+---------+------+
>>
>>
>> scala> df.write.format("carbondata").option("tableName",
>> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>>
>>
>> scala> cc.sql("select * from branch1").show(false)
>>
>> +-------+------+-----------+-------+---------+------+
>> |country|branch|name |address|shortname|status|
>> +-------+------+-----------+-------+---------+------+
>> |2 |1 |Main Branch|XXXX | Dubai |null |
>> +-------+------+-----------+-------+---------+------+
>>
>>
>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.15#6346)
>>

Sanoj MG

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Thanks Jacky. I have created a JIRA -
https://issues.apache.org/jira/browse/CARBONDATA-909 for this.

Thanks,
Sanoj

On Tue, Apr 11, 2017 at 5:42 PM, Jacky Li <[hidden email]> wrote:

> Hi Sanoj,
>
> This is because in CarbonData loading flow, it needs to scan input data
> twice (one for generating global dictionary, another for actual loading).
> If user is using Dataframe to write to CarbonData, and if the input
> dataframe compute is costly, it is better to save it as a temporary CSV
> file first and load into CarbonData instead of computing the dataframe
> twice.
>
> However there is another option that can do single pass data load, by
> using .option(“single_pass”, “true”), in this case, the input dataframe
> should be computed only once. But when I check the code just now, it seems
> this behavior is not implemented. :(
> I think you are free to create JIRA ticket if you want.
>
> Regards,
> Jacky
>
> > 在 2017年4月11日，上午10:36，Sanoj MG <[hidden email]> 写道：
> >
> > Hi All,
> >
> > In CarbonDataFrameWriter, there is an option to load using CSV file.
> >
> > if (options.tempCSV) {
> >
> > loadTempCSV(options)
> > } else {
> > loadDataFrame(options)
> > }
> >
> > Why is this choice required? Is there any issue if we load it directly
> > without using CSV?
> >
> > I have many dimension table with comma in string columns, and so always
> use
> > .option("tempCSV", "false"). In CarbonOption can we set the default value
> > as "false" as below
> >
> > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> >
> > Thanks,
> > Sanoj
> >
> >
> > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]>
> wrote:
> >
> >> Sanoj MG created CARBONDATA-836:
> >> -----------------------------------
> >>
> >> Summary: Error in load using dataframe - columns containing
> >> comma
> >> Key: CARBONDATA-836
> >> URL: https://issues.apache.org/
> jira/browse/CARBONDATA-836
> >> Project: CarbonData
> >> Issue Type: Bug
> >> Components: spark-integration
> >> Affects Versions: 1.1.0-incubating
> >> Environment: HDP sandbox 2.5, Spark 1.6.2
> >> Reporter: Sanoj MG
> >> Priority: Minor
> >> Fix For: NONE
> >>
> >>
> >> While trying to load data into Carabondata table using dataframe, the
> >> columns containing commas are not properly loaded.
> >>
> >> Eg:
> >> scala> df.show(false)
> >> +-------+------+-----------+----------------+---------+------+
> >> |Country|Branch|Name |Address |ShortName|Status|
> >> +-------+------+-----------+----------------+---------+------+
> >> |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 |
> >> +-------+------+-----------+----------------+---------+------+
> >>
> >>
> >> scala> df.write.format("carbondata").option("tableName",
> >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
> >>
> >>
> >> scala> cc.sql("select * from branch1").show(false)
> >>
> >> +-------+------+-----------+-------+---------+------+
> >> |country|branch|name |address|shortname|status|
> >> +-------+------+-----------+-------+---------+------+
> >> |2 |1 |Main Branch|XXXX | Dubai |null |
> >> +-------+------+-----------+-------+---------+------+
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.15#6346)
> >>
>
>