Hi, Using latest Carbondata and Parquet - comparing full scan, aggregation queries on 5 node SPARK cluster each consists of 8cores, 32GB ram - we are having 18 million records for a minute which has 40 columns and 20 are of string type. We have a use case where in we want to do a full scan and aggregate the data based on 32 keys which mostly integer or long type. Even after changing / playing around carbon properties such DICTIONARY_EXCLUDE/INCLUDE, INVERTED_INDEX, block and blocklet size - Carbon query takes 2 secs for 1 aggregation (2 group by keys for SUM, MIN, MAX, COUNT) where as Parquet takes 0.9 secs. Carbon table creation is as follows - carbon.sql("create table Mar18(flowSeqNum integer ,protocolId integer ,srcTos integer ,dstTos integer ,tcpBits integer ,srcPort integer ,dstPort integer ,workerId integer ,epoch long ,rowNum long ,exportMs long ,pktSeqNum long ,inIfId long ,inIfEntityId long ,outIfId long ,outIfEntityId long ,inOctets long ,outOctets long ,inPackets long ,outPackets long ,bgpSrcAsNum long ,bgpDstAsNum long ,endMs long ,startMs long ,policyQosQueueId long ,exporterIp string ,srcIp string ,dstIp string ,srcMask string ,dstMask string ,direction string ,nextHopIp string ,bgpNextHopIp string ,appName string ,srcIpGroup string ,dstIpGroup string ,appId string ,policyQosClassificationHierarchy string ,inIfEnabled boolean ,outIfEnabled boolean) stored as carbondata ") Can u help here where we are going wrong? BR, Deepak -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Can you give us the full properties of creating table sql? also the sql which
is low performance. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Thx. Please see below create table SQL
carbon.sql("create table carbon_test_new_7(flowSeqNum integer ,protocolId integer ,srcTos integer ,dstTos integer ,tcpBits integer ,srcPort integer ,dstPort integer ,workerId integer ,epoch long ,rowNum long ,exportMs long ,pktSeqNum long ,inIfId long ,inIfEntityId long ,outIfId long ,outIfEntityId long ,inOctets long ,outOctets long ,inPackets long ,outPackets long ,bgpSrcAsNum long ,bgpDstAsNum long ,endMs long ,startMs long ,policyQosQueueId long ,exporterIp string ,srcIp string ,dstIp string ,srcMask string ,dstMask string ,direction string ,nextHopIp string ,bgpNextHopIp string ,appName string ,srcIpGroup string ,dstIpGroup string ,appId string ,policyQosClassificationHierarchy string ,inIfEnabled boolean ,outIfEnabled boolean) stored as carbondata TBLPROPERTIES('INVERTED_INDEX'='inIfId,dstIpGroup,appName')") ").show() Also, the queries we use are like select appName, sum(inPackets), sum(inPackets) from carbon_test_new7 group by inIfid, appName Also, when we are creating datamaps on above table for specific queries (we have where condition in it) using EXPLAIN command we are not coming to know whether it is using datamap or main table only. Please help in having what columns to include in DICTIONARY, INVERTED index, sorted columns and datamap creations. BR, Deepak -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Can you try the below schema with mentioned tableproperties and check the
query performance with Parquet once. carbon.sql("create table carbon_test_new_7(flowSeqNum integer ,protocolId integer ,srcTos integer ,dstTos integer ,tcpBits integer ,srcPort integer ,dstPort integer ,workerId integer ,epoch long ,rowNum long ,exportMs long ,pktSeqNum long ,inIfId long ,inIfEntityId long ,outIfId long ,outIfEntityId long ,inOctets long ,outOctets long ,inPackets long ,outPackets long ,bgpSrcAsNum long ,bgpDstAsNum long ,endMs long ,startMs long ,policyQosQueueId long ,exporterIp string ,srcIp string ,dstIp string ,srcMask string ,dstMask string ,direction string ,nextHopIp string ,bgpNextHopIp string ,appName string ,srcIpGroup string ,dstIpGroup string ,appId string ,policyQosClassificationHierarchy string ,inIfEnabled boolean ,outIfEnabled boolean) stored as carbondata TBLPROPERTIES('no_inverted_index'='flowSeqNum,protocolId,srcTos,dstTos,tcpBits,srcPort,dstPort,workerId,epoch,rowNum,exportMs,pktSeqNum,inIfId,inIfEntityId,outIfId,outIfEntityId,inOctets,outOctets,inPackets,outPackets,bgpSrcAsNum,bgpDstAsNum,endMs,startMs,policyQosQueueId,exporterIp,srcIp,dstIp,srcMask,dstMask,direction,nextHopIp,bgpNextHopIp,appName,srcIpGroup,dstIpGroup,appId,policyQosClassificationHierarchy,inIfEnabled,outIfEnabled','SORT_COLUMNS'='', 'local_dictionary_enable'='true','table_blocklet_size'='90','table_blocksize'='120')") ").show() Please let us know once after testing. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |