Sanoj MG created CARBONDATA-836:
----------------------------------- Summary: Error in load using dataframe - columns containing comma Key: CARBONDATA-836 URL: https://issues.apache.org/jira/browse/CARBONDATA-836 Project: CarbonData Issue Type: Bug Components: spark-integration Affects Versions: 1.1.0-incubating Environment: HDP sandbox 2.5, Spark 1.6.2 Reporter: Sanoj MG Priority: Minor Fix For: NONE While trying to load data into Carabondata table using dataframe, the columns containing commas are not properly loaded. Eg: scala> df.show(false) +-------+------+-----------+----------------+---------+------+ |Country|Branch|Name |Address |ShortName|Status| +-------+------+-----------+----------------+---------+------+ |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 | +-------+------+-----------+----------------+---------+------+ scala> df.write.format("carbondata").option("tableName", "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save() scala> cc.sql("select * from branch1").show(false) +-------+------+-----------+-------+---------+------+ |country|branch|name |address|shortname|status| +-------+------+-----------+-------+---------+------+ |2 |1 |Main Branch|XXXX | Dubai |null | +-------+------+-----------+-------+---------+------+ -- This message was sent by Atlassian JIRA (v6.3.15#6346) |
Hi All,
In CarbonDataFrameWriter, there is an option to load using CSV file. if (options.tempCSV) { loadTempCSV(options) } else { loadDataFrame(options) } Why is this choice required? Is there any issue if we load it directly without using CSV? I have many dimension table with comma in string columns, and so always use .option("tempCSV", "false"). In CarbonOption can we set the default value as "false" as below def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean Thanks, Sanoj On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> wrote: > Sanoj MG created CARBONDATA-836: > ----------------------------------- > > Summary: Error in load using dataframe - columns containing > comma > Key: CARBONDATA-836 > URL: https://issues.apache.org/jira/browse/CARBONDATA-836 > Project: CarbonData > Issue Type: Bug > Components: spark-integration > Affects Versions: 1.1.0-incubating > Environment: HDP sandbox 2.5, Spark 1.6.2 > Reporter: Sanoj MG > Priority: Minor > Fix For: NONE > > > While trying to load data into Carabondata table using dataframe, the > columns containing commas are not properly loaded. > > Eg: > scala> df.show(false) > +-------+------+-----------+----------------+---------+------+ > |Country|Branch|Name |Address |ShortName|Status| > +-------+------+-----------+----------------+---------+------+ > |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 | > +-------+------+-----------+----------------+---------+------+ > > > scala> df.write.format("carbondata").option("tableName", > "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save() > > > scala> cc.sql("select * from branch1").show(false) > > +-------+------+-----------+-------+---------+------+ > |country|branch|name |address|shortname|status| > +-------+------+-----------+-------+---------+------+ > |2 |1 |Main Branch|XXXX | Dubai |null | > +-------+------+-----------+-------+---------+------+ > > > > > > > -- > This message was sent by Atlassian JIRA > (v6.3.15#6346) > |
Hi Sanoj,
This is because in CarbonData loading flow, it needs to scan input data twice (one for generating global dictionary, another for actual loading). If user is using Dataframe to write to CarbonData, and if the input dataframe compute is costly, it is better to save it as a temporary CSV file first and load into CarbonData instead of computing the dataframe twice. However there is another option that can do single pass data load, by using .option(“single_pass”, “true”), in this case, the input dataframe should be computed only once. But when I check the code just now, it seems this behavior is not implemented. :( I think you are free to create JIRA ticket if you want. Regards, Jacky > 在 2017年4月11日,上午10:36,Sanoj MG <[hidden email]> 写道: > > Hi All, > > In CarbonDataFrameWriter, there is an option to load using CSV file. > > if (options.tempCSV) { > > loadTempCSV(options) > } else { > loadDataFrame(options) > } > > Why is this choice required? Is there any issue if we load it directly > without using CSV? > > I have many dimension table with comma in string columns, and so always use > .option("tempCSV", "false"). In CarbonOption can we set the default value > as "false" as below > > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean > > Thanks, > Sanoj > > > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> wrote: > >> Sanoj MG created CARBONDATA-836: >> ----------------------------------- >> >> Summary: Error in load using dataframe - columns containing >> comma >> Key: CARBONDATA-836 >> URL: https://issues.apache.org/jira/browse/CARBONDATA-836 >> Project: CarbonData >> Issue Type: Bug >> Components: spark-integration >> Affects Versions: 1.1.0-incubating >> Environment: HDP sandbox 2.5, Spark 1.6.2 >> Reporter: Sanoj MG >> Priority: Minor >> Fix For: NONE >> >> >> While trying to load data into Carabondata table using dataframe, the >> columns containing commas are not properly loaded. >> >> Eg: >> scala> df.show(false) >> +-------+------+-----------+----------------+---------+------+ >> |Country|Branch|Name |Address |ShortName|Status| >> +-------+------+-----------+----------------+---------+------+ >> |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 | >> +-------+------+-----------+----------------+---------+------+ >> >> >> scala> df.write.format("carbondata").option("tableName", >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save() >> >> >> scala> cc.sql("select * from branch1").show(false) >> >> +-------+------+-----------+-------+---------+------+ >> |country|branch|name |address|shortname|status| >> +-------+------+-----------+-------+---------+------+ >> |2 |1 |Main Branch|XXXX | Dubai |null | >> +-------+------+-----------+-------+---------+------+ >> >> >> >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.3.15#6346) >> |
Thanks Jacky. I have created a JIRA -
https://issues.apache.org/jira/browse/CARBONDATA-909 for this. Thanks, Sanoj On Tue, Apr 11, 2017 at 5:42 PM, Jacky Li <[hidden email]> wrote: > Hi Sanoj, > > This is because in CarbonData loading flow, it needs to scan input data > twice (one for generating global dictionary, another for actual loading). > If user is using Dataframe to write to CarbonData, and if the input > dataframe compute is costly, it is better to save it as a temporary CSV > file first and load into CarbonData instead of computing the dataframe > twice. > > However there is another option that can do single pass data load, by > using .option(“single_pass”, “true”), in this case, the input dataframe > should be computed only once. But when I check the code just now, it seems > this behavior is not implemented. :( > I think you are free to create JIRA ticket if you want. > > Regards, > Jacky > > > 在 2017年4月11日,上午10:36,Sanoj MG <[hidden email]> 写道: > > > > Hi All, > > > > In CarbonDataFrameWriter, there is an option to load using CSV file. > > > > if (options.tempCSV) { > > > > loadTempCSV(options) > > } else { > > loadDataFrame(options) > > } > > > > Why is this choice required? Is there any issue if we load it directly > > without using CSV? > > > > I have many dimension table with comma in string columns, and so always > use > > .option("tempCSV", "false"). In CarbonOption can we set the default value > > as "false" as below > > > > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean > > > > Thanks, > > Sanoj > > > > > > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <[hidden email]> > wrote: > > > >> Sanoj MG created CARBONDATA-836: > >> ----------------------------------- > >> > >> Summary: Error in load using dataframe - columns containing > >> comma > >> Key: CARBONDATA-836 > >> URL: https://issues.apache.org/ > jira/browse/CARBONDATA-836 > >> Project: CarbonData > >> Issue Type: Bug > >> Components: spark-integration > >> Affects Versions: 1.1.0-incubating > >> Environment: HDP sandbox 2.5, Spark 1.6.2 > >> Reporter: Sanoj MG > >> Priority: Minor > >> Fix For: NONE > >> > >> > >> While trying to load data into Carabondata table using dataframe, the > >> columns containing commas are not properly loaded. > >> > >> Eg: > >> scala> df.show(false) > >> +-------+------+-----------+----------------+---------+------+ > >> |Country|Branch|Name |Address |ShortName|Status| > >> +-------+------+-----------+----------------+---------+------+ > >> |2 |1 |Main Branch|XXXX, Dubai, UAE|UHO |256 | > >> +-------+------+-----------+----------------+---------+------+ > >> > >> > >> scala> df.write.format("carbondata").option("tableName", > >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save() > >> > >> > >> scala> cc.sql("select * from branch1").show(false) > >> > >> +-------+------+-----------+-------+---------+------+ > >> |country|branch|name |address|shortname|status| > >> +-------+------+-----------+-------+---------+------+ > >> |2 |1 |Main Branch|XXXX | Dubai |null | > >> +-------+------+-----------+-------+---------+------+ > >> > >> > >> > >> > >> > >> > >> -- > >> This message was sent by Atlassian JIRA > >> (v6.3.15#6346) > >> > > |
Free forum by Nabble | Edit this page |