Login  Register

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 on Oct 21, 2017; 3:12am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Update-statement-failed-with-Multiple-input-rows-matched-for-same-row-in-version-1-2-0-tp24222p24455.html

dev

testdata
链接:http://pan.baidu.com/s/1boVAqeF 密码:qtdh
 CSV


yixu2001
 
From: Raghunandan S
Date: 2017-10-20 16:56
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Should be fine.do we need to create any login to access it?
On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <[hidden email]> wrote:
 

> dev
>        Baidu SkyDrive share it ?
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 16:14
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> As the tupleIds are also unique, the only option left is to get hold of
> your data and reproducing it in-house. Is it possible to generate your data
> in-house or get it?
>
> On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <[hidden email]> wrote:
>
> > dev
> >
> > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> b.id)
> > c group by c.tupleId having count(*)>1").show;
> > +-------+--------+
> > |tupleId|count(1)|
> > +-------+--------+
> > +-------+--------+
> >
> > no Multiple input rows
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 10:57
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Slight rectification in typo of the query syntax
> >
> > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > c_indextest2
> > b where a.id=b.id;"
> >
> > On Fri, Oct 20, 2017 at 7:54 AM, sounak <[hidden email]> wrote:
> >
> > > Internally, we call a UDF to generate the TupleId and based on that
> > > tupleId we decide if the row is duplicated or not. You can run a
> slightly
> > > tweaked query
> > >
> > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > c_indextest2
> > > b where a.id=b.id; "
> > >
> > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > will
> > > give duplicate entries.
> > >
> > > In between your given query runs fine with the data i have generated in
> > > house. So this is data dependent. So it will be good if we get
> > > *origin_data* or the script which has build the data for origin_data.
> > >
> > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <[hidden email]> wrote:
> > >
> > >> dev
> > >>    one row record
> > >>
> > >>
> > >> yixu2001
> > >>
> > >> From: Liang Chen
> > >> Date: 2017-10-19 22:26
> > >> To: dev
> > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched for same row" in version 1.2.0,
> > >> Hi
> > >>
> > >> Execute the below query, return one row record or multiple row
> records ?
> > >> ---------------------------------
> > >> select a.remark from  c_indextest1 a where a.id=b.id
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> yixu2001 wrote
> > >> > dev
> > >> >  You can follow the steps below to reproduce the problem.
> > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> about
> > >> 30w
> > >> > records.
> > >> >
> > >> > step 1:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 2: origin_data is a existing table with 1700w records, and
> fields
> > >> in
> > >> > the table do not matter.
> > >> > cc.sql("insert into c_indextest2 select row_number() over(partition
> by
> > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > >> remark
> > >> > from origin_data limit 17000000) a").show;
> > >> >
> > >> > step 3:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 4:
> > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > >> > pmod(cast(ID as int),50)=43").show;
> > >> >
> > >> > setp 5:
> > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > >> > c_indextest1 a where a.id=b.id)").show;
> > >> >
> > >> >
> > >> > yixu2001
> > >> >
> > >> > From: sounak
> > >> > Date: 2017-10-19 18:26
> > >> > To: dev
> > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched
> > >> > for same row" in version 1.2.0,
> > >> > Hi Yixu,
> > >> >
> > >> > Thanks for the ddl. But i am particularly interested on the data in
> > >> order
> > >> > to reproduce the problem. Will it be feasible to share to data or if
> > it
> > >> is
> > >> > huge then the scripts that generates it? I cannot access the data
> > >> location
> > >> > from public network.
> > >> >
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >
> > >> >> dev
> > >> >>
> > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > >> >> `id` INT,
> > >> >> `CUST_ORDER_ID` STRING,
> > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > >> >> `CREATE_DATE` STRING,
> > >> >> `UPDATE_DATE` STRING,
> > >> >> `STATUS_CD` STRING,
> > >> >> `STATUS_DATE` STRING,
> > >> >> `AREA_ID` STRING,
> > >> >> `REGION_CD` STRING,
> > >> >> `UPDATE_STAFF` STRING,
> > >> >> `CREATE_STAFF` STRING,
> > >> >> `SHARDING_ID` STRING,
> > >> >> `ORDER_ATTR_ID` STRING )
> > >> >> ROW FORMAT DELIMITED
> > >> >>   FIELDS TERMINATED BY '\u0001'
> > >> >>   LINES TERMINATED BY '\n'
> > >> >>   NULL DEFINED AS ''
> > >> >> STORED AS INPUTFORMAT
> > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > >> >> OUTPUTFORMAT
> > >> >>   'org.apache.hadoop.hive.ql.io
> > >> >>
> > >> >> .HiveIgnoreKeyTextOutputFormat'
> > >> >> LOCATION
> > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > >> >>
> > >> >> ';
> > >> >>
> > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > data
> > >> >> format in the csv file just as following
> > >> >>
> > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > >> 302064,127859875,
> > >> >> 9999999
> > >> >>
> > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > >> STRING,STATUS_DATE
> > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > STRING,CREATE_STAFF
> > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > 'carbondata'
> > >> ")
> > >> >>
> > >> >> Step 3:Insert data:
> > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > >> >> qqdata2.h_indextest1").show(100,false);
> > >> >>
> > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > >> >> qqdata2.c_indextest2
> > >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> > of
> > >> >> qqdata2.c_indextest2 is 1700w.
> > >> >>
> > >> >>
> > >> >>
> > >> >> yixu2001
> > >> >>
> > >> >> From: sounak
> > >> >> Date: 2017-10-19 14:13
> > >> >> To: dev
> > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > matched
> > >> >> for
> > >> >> same row" in version 1.2.0,
> > >> >> Hi Yixu,
> > >> >>
> > >> >> Can you please share the DDLs and the Data for the above problem
> with
> > >> us?
> > >> >>
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >>
> > >> >> > dev
> > >> >> >
> > >> >> >
> > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > >> >> sub-query,
> > >> >> > it failed.
> > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > >> statement
> > >> >> > will succeed in carbondata version 1.1.1.
> > >> >> >
> > >> >> > The test log as following:
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest1").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |300000  |300000            |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest2").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |71223220|71223220          |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > >> R_ATTR_ID)
> > >> >> =
> > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> )").show(100,false);
> > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> > of
> > >> a
> > >> >> > plan since it was too large. This behavior can be adjusted by
> > setting
> > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > >> >> >
> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > >> >> > qqdata2.c_indextest2
> > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > operation
> > >> is
> > >> >> > failed due to failure in creating delete delta file for segment :
> > >> null
> > >> >> > block : null
> > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception
> in
> > >> >> update
> > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> same
> > >> row.
> > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> input
> > >> >> rows
> > >> >> > matched for same row.
> > >> >> >   at scala.sys.package$.error(package.scala:27)
> > >> >> >   at org.apache.spark.sql.execution.command.
> > ProjectForUpdateCommand.
> > >> >> > processData(IUDCommands.scala:239)
> > >> >> >   at
> > >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > >> >> > IUDCommands.scala:141)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult(commands.scala:56)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > executeTake(commands.scala:71)
> > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > >> >> > executeCollect(limit.scala:38)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> > >> nId(
> > >> >> > SQLExecution.scala:57)
> > >> >> >   at
> > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2120)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> > >> :2810)
> > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > >> >> >   ... 50 elided
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > yixu2001
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks
> > >> > Sounak
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > >> .nabble.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>