[jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Akash R Nilugal (Jira)
Sanoj MG created CARBONDATA-836:
-----------------------------------

             Summary: Error in load using dataframe  - columns containing comma
                 Key: CARBONDATA-836
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-836
             Project: CarbonData
          Issue Type: Bug
          Components: spark-integration
    Affects Versions: 1.1.0-incubating
         Environment: HDP sandbox 2.5, Spark 1.6.2
            Reporter: Sanoj MG
            Priority: Minor
             Fix For: NONE


While trying to load data into Carabondata table using dataframe, the columns containing commas are not properly loaded.

Eg:
scala> df.show(false)
+-------+------+-----------+----------------+---------+------+
|Country|Branch|Name       |Address         |ShortName|Status|
+-------+------+-----------+----------------+---------+------+
|2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
+-------+------+-----------+----------------+---------+------+


scala>  df.write.format("carbondata").option("tableName", "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()


scala> cc.sql("select * from branch1").show(false)

+-------+------+-----------+-------+---------+------+
|country|branch|name       |address|shortname|status|
+-------+------+-----------+-------+---------+------+
|2      |1     |Main Branch|XXXX   | Dubai   |null  |
+-------+------+-----------+-------+---------+------+






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Sanoj MG
Hi All,

In CarbonDataFrameWriter, there is an option to load using CSV file.

if (options.tempCSV) {

  loadTempCSV(options)
} else {
  loadDataFrame(options)
}

Why is this choice required? Is there any issue if we load it directly
without using CSV?

I have many dimension table with comma in string columns, and so always use
 .option("tempCSV", "false"). In CarbonOption can we set the default value
as "false" as below

def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean

Thanks,
Sanoj


On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> wrote:

> Sanoj MG created CARBONDATA-836:
> -----------------------------------
>
>              Summary: Error in load using dataframe  - columns containing
> comma
>                  Key: CARBONDATA-836
>                  URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>              Project: CarbonData
>           Issue Type: Bug
>           Components: spark-integration
>     Affects Versions: 1.1.0-incubating
>          Environment: HDP sandbox 2.5, Spark 1.6.2
>             Reporter: Sanoj MG
>             Priority: Minor
>              Fix For: NONE
>
>
> While trying to load data into Carabondata table using dataframe, the
> columns containing commas are not properly loaded.
>
> Eg:
> scala> df.show(false)
> +-------+------+-----------+----------------+---------+------+
> |Country|Branch|Name       |Address         |ShortName|Status|
> +-------+------+-----------+----------------+---------+------+
> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
> +-------+------+-----------+----------------+---------+------+
>
>
> scala>  df.write.format("carbondata").option("tableName",
> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>
>
> scala> cc.sql("select * from branch1").show(false)
>
> +-------+------+-----------+-------+---------+------+
> |country|branch|name       |address|shortname|status|
> +-------+------+-----------+-------+---------+------+
> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
> +-------+------+-----------+-------+---------+------+
>
>
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Jacky Li
Hi Sanoj,

This is because in CarbonData loading flow, it needs to scan input data twice (one for generating global dictionary, another for actual loading). If user is using Dataframe to write to CarbonData, and if the input dataframe compute is costly, it is better to save it as a temporary CSV file first and load into CarbonData instead of computing the dataframe twice.

However there is another option that can do single pass data load, by using .option(“single_pass”, “true”), in this case, the input dataframe should be computed only once. But when I check the code just now, it seems this behavior is not implemented. :(
I think you are free to create JIRA ticket if you want.

Regards,
Jacky

> 在 2017年4月11日,上午10:36,Sanoj MG <[hidden email]> 写道:
>
> Hi All,
>
> In CarbonDataFrameWriter, there is an option to load using CSV file.
>
> if (options.tempCSV) {
>
>  loadTempCSV(options)
> } else {
>  loadDataFrame(options)
> }
>
> Why is this choice required? Is there any issue if we load it directly
> without using CSV?
>
> I have many dimension table with comma in string columns, and so always use
> .option("tempCSV", "false"). In CarbonOption can we set the default value
> as "false" as below
>
> def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
>
> Thanks,
> Sanoj
>
>
> On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> wrote:
>
>> Sanoj MG created CARBONDATA-836:
>> -----------------------------------
>>
>>             Summary: Error in load using dataframe  - columns containing
>> comma
>>                 Key: CARBONDATA-836
>>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>>             Project: CarbonData
>>          Issue Type: Bug
>>          Components: spark-integration
>>    Affects Versions: 1.1.0-incubating
>>         Environment: HDP sandbox 2.5, Spark 1.6.2
>>            Reporter: Sanoj MG
>>            Priority: Minor
>>             Fix For: NONE
>>
>>
>> While trying to load data into Carabondata table using dataframe, the
>> columns containing commas are not properly loaded.
>>
>> Eg:
>> scala> df.show(false)
>> +-------+------+-----------+----------------+---------+------+
>> |Country|Branch|Name       |Address         |ShortName|Status|
>> +-------+------+-----------+----------------+---------+------+
>> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
>> +-------+------+-----------+----------------+---------+------+
>>
>>
>> scala>  df.write.format("carbondata").option("tableName",
>> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>>
>>
>> scala> cc.sql("select * from branch1").show(false)
>>
>> +-------+------+-----------+-------+---------+------+
>> |country|branch|name       |address|shortname|status|
>> +-------+------+-----------+-------+---------+------+
>> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
>> +-------+------+-----------+-------+---------+------+
>>
>>
>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.15#6346)
>>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Sanoj MG
Thanks Jacky. I have created a JIRA -
https://issues.apache.org/jira/browse/CARBONDATA-909 for this.



Thanks,
Sanoj

On Tue, Apr 11, 2017 at 5:42 PM, Jacky Li <[hidden email]> wrote:

> Hi Sanoj,
>
> This is because in CarbonData loading flow, it needs to scan input data
> twice (one for generating global dictionary, another for actual loading).
> If user is using Dataframe to write to CarbonData, and if the input
> dataframe compute is costly, it is better to save it as a temporary CSV
> file first and load into CarbonData instead of computing the dataframe
> twice.
>
> However there is another option that can do single pass data load, by
> using .option(“single_pass”, “true”), in this case, the input dataframe
> should be computed only once. But when I check the code just now, it seems
> this behavior is not implemented. :(
> I think you are free to create JIRA ticket if you want.
>
> Regards,
> Jacky
>
> > 在 2017年4月11日,上午10:36,Sanoj MG <[hidden email]> 写道:
> >
> > Hi All,
> >
> > In CarbonDataFrameWriter, there is an option to load using CSV file.
> >
> > if (options.tempCSV) {
> >
> >  loadTempCSV(options)
> > } else {
> >  loadDataFrame(options)
> > }
> >
> > Why is this choice required? Is there any issue if we load it directly
> > without using CSV?
> >
> > I have many dimension table with comma in string columns, and so always
> use
> > .option("tempCSV", "false"). In CarbonOption can we set the default value
> > as "false" as below
> >
> > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> >
> > Thanks,
> > Sanoj
> >
> >
> > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]>
> wrote:
> >
> >> Sanoj MG created CARBONDATA-836:
> >> -----------------------------------
> >>
> >>             Summary: Error in load using dataframe  - columns containing
> >> comma
> >>                 Key: CARBONDATA-836
> >>                 URL: https://issues.apache.org/
> jira/browse/CARBONDATA-836
> >>             Project: CarbonData
> >>          Issue Type: Bug
> >>          Components: spark-integration
> >>    Affects Versions: 1.1.0-incubating
> >>         Environment: HDP sandbox 2.5, Spark 1.6.2
> >>            Reporter: Sanoj MG
> >>            Priority: Minor
> >>             Fix For: NONE
> >>
> >>
> >> While trying to load data into Carabondata table using dataframe, the
> >> columns containing commas are not properly loaded.
> >>
> >> Eg:
> >> scala> df.show(false)
> >> +-------+------+-----------+----------------+---------+------+
> >> |Country|Branch|Name       |Address         |ShortName|Status|
> >> +-------+------+-----------+----------------+---------+------+
> >> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
> >> +-------+------+-----------+----------------+---------+------+
> >>
> >>
> >> scala>  df.write.format("carbondata").option("tableName",
> >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
> >>
> >>
> >> scala> cc.sql("select * from branch1").show(false)
> >>
> >> +-------+------+-----------+-------+---------+------+
> >> |country|branch|name       |address|shortname|status|
> >> +-------+------+-----------+-------+---------+------+
> >> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
> >> +-------+------+-----------+-------+---------+------+
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.15#6346)
> >>
>
>