MarvinLitt opened a new pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523 …to carbondata as talked with likun add some chinese doc to carbondata. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-567896563 Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1229/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-567915196 Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1239/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-567937751 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1248/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-570506223 Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1421/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-570515793 Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1423/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-570533450 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1443/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-570534045 Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1431/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363074263 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,520 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以跟组自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | + + + +## 建表 Create table + +### 建表 Create table 时字典的选择 + +建表时用户可以选择使用本地字典或者全局字典,或者不使用字典。通常情况下使用全局字典数据加载时比较耗时,但是查询时较快; 本地字典相比全局字典加载较快,查询不如全局字典; 而不使用本地字典和全局字典时数据加载较快,但是查询最慢。用户在实际使用时可以根据自身业务的诉求来选择合适的模式。 + +为此这里做了一个验证,分别对不使用字典,使用本地字典,使用全局字典的数据表进行加载和点查操作,分析其加载和查询性能。 + +建表语句结构如下: + +1)不使用字典: + +``` +create table if not exists test.detail_benchmark1 +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256') +``` + +2) 使用本地字典: + +``` +create table if not exists test.detail_loacl_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI','table_blocksize'='256') +``` + +3) 使用全局字典: + +``` +create table if not exists test.detail_global_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256','DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI') +``` + +使用16亿数据量样本作为表数据,分168个批次分别加载到如上表中,其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ----------------------- | ---------- | ----------- | +| test.detail_benchmark1 | 1613149548 | 1109 | +| test.detail_loacl_dict | 1613149548 | 1876 | +| test.detail_global_dict | 1613149548 | 5191 | + +从数据中可以看出,使用全局字典在数据加载时将花费更多的时间,不使用字典或者使用本地字典时耗时明显少。 + +下面列表记录了每一个批次的加载耗时 + +| 加载批次 | 耗时秒 detail_benchmark | 耗时秒 detail_loacl_dict | 耗时秒 detail_global_dict | Review comment: please remove detail_global_dict, global dictionary feature is deprecated ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363074426 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,520 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以跟组自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | + + + +## 建表 Create table + +### 建表 Create table 时字典的选择 + +建表时用户可以选择使用本地字典或者全局字典,或者不使用字典。通常情况下使用全局字典数据加载时比较耗时,但是查询时较快; 本地字典相比全局字典加载较快,查询不如全局字典; 而不使用本地字典和全局字典时数据加载较快,但是查询最慢。用户在实际使用时可以根据自身业务的诉求来选择合适的模式。 + +为此这里做了一个验证,分别对不使用字典,使用本地字典,使用全局字典的数据表进行加载和点查操作,分析其加载和查询性能。 + +建表语句结构如下: + +1)不使用字典: + +``` +create table if not exists test.detail_benchmark1 +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256') +``` + +2) 使用本地字典: + +``` +create table if not exists test.detail_loacl_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI','table_blocksize'='256') +``` + +3) 使用全局字典: + +``` +create table if not exists test.detail_global_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256','DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI') +``` + +使用16亿数据量样本作为表数据,分168个批次分别加载到如上表中,其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ----------------------- | ---------- | ----------- | +| test.detail_benchmark1 | 1613149548 | 1109 | +| test.detail_loacl_dict | 1613149548 | 1876 | +| test.detail_global_dict | 1613149548 | 5191 | + +从数据中可以看出,使用全局字典在数据加载时将花费更多的时间,不使用字典或者使用本地字典时耗时明显少。 + +下面列表记录了每一个批次的加载耗时 + +| 加载批次 | 耗时秒 detail_benchmark | 耗时秒 detail_loacl_dict | 耗时秒 detail_global_dict | +| -------- | ------------------------ | ------------------------ | ------------------------- | +| 1 | 6.584 | 19.31 | 17.237 | +| 2 | 10.255 | 21.406 | 16.306 | +| 3 | 5.079 | 21.619 | 21.351 | +| 4 | 5.004 | 21.538 | 41.012 | +| 5 | 10.877 | 18.191 | 25.649 | +| 6 | 7.868 | 17.457 | 29.419 | +| 7 | 5.295 | 20.919 | 36.859 | +| 8 | 4.43 | 17.048 | 48.596 | +| 9 | 5.373 | 18.584 | 32.498 | +| 10 | 6.074 | 17.668 | 27.939 | +| 11 | 5.889 | 19.553 | 40.715 | +| 12 | 4.8 | 14.819 | 23.211 | +| 13 | 4.972 | 12 | 29.177 | +| 14 | 5.144 | 11.909 | 28.371 | +| 15 | 5.403 | 18.428 | 23.218 | +| 16 | 5.314 | 19.536 | 33.537 | +| 17 | 4.519 | 22.373 | 40.328 | +| 18 | 9.327 | 24.507 | 32.968 | +| 19 | 5.277 | 7.219 | 22.937 | +| 20 | 10.882 | 6.552 | 57.343 | +| 21 | 4.74 | 11.911 | 16.446 | +| 22 | 5.38 | 11.908 | 16.796 | +| 23 | 9.046 | 7.076 | 25.815 | +| 24 | 4.399 | 6.391 | 16.634 | +| 25 | 5.079 | 7.697 | 20.03 | +| 26 | 4.177 | 6.114 | 16.441 | +| 27 | 4.441 | 6.349 | 20.032 | +| 28 | 4.951 | 7.034 | 15.82 | +| 29 | 5.09 | 13.404 | 15.448 | +| 30 | 4.286 | 7.455 | 17.099 | +| 31 | 7.821 | 13.776 | 31.003 | +| 32 | 4.623 | 7.025 | 15.334 | +| 33 | 9.304 | 7.273 | 27.529 | +| 34 | 4.806 | 13.23 | 16.066 | +| 35 | 5.469 | 6.562 | 15.055 | +| 36 | 10.045 | 6.581 | 25.76 | +| 37 | 4.619 | 11.716 | 19.778 | +| 38 | 5.807 | 5.939 | 15.884 | +| 39 | 9.101 | 6.531 | 25.726 | +| 40 | 4.968 | 6.675 | 33.37 | +| 41 | 4.654 | 6.389 | 46.855 | +| 42 | 4.763 | 6.774 | 29.427 | +| 43 | 4.155 | 6.203 | 29.645 | +| 44 | 4.672 | 6.288 | 16.472 | +| 45 | 5.052 | 12.845 | 40.372 | +| 46 | 4.812 | 6.185 | 37.74 | +| 47 | 7.941 | 6.653 | 29.874 | +| 48 | 5.401 | 5.775 | 21.479 | +| 49 | 4.684 | 5.776 | 16.473 | +| 50 | 4.698 | 10.646 | 17.763 | +| 51 | 4.604 | 6.359 | 30.287 | +| 52 | 8.843 | 5.899 | 124.026 | +| 53 | 4.785 | 11.284 | 42.688 | +| 54 | 4.969 | 7.116 | 50.739 | +| 55 | 7.634 | 6.291 | 61.608 | +| 56 | 4.424 | 6.534 | 34.682 | +| 57 | 5.086 | 11.372 | 16.501 | +| 58 | 4.461 | 5.613 | 17.621 | +| 59 | 7.809 | 19.169 | 35.44 | +| 60 | 4.852 | 7.34 | 19.982 | +| 61 | 8.486 | 7.182 | 106.913 | +| 62 | 4.412 | 10.598 | 27.532 | +| 63 | 4.168 | 17.974 | 22.97 | +| 64 | 4.287 | 6.998 | 27.284 | +| 65 | 8.867 | 6.526 | 62.789 | +| 66 | 4.37 | 6.457 | 36.626 | +| 67 | 4.23 | 12.335 | 16.968 | +| 68 | 5.28 | 13.638 | 13.551 | +| 69 | 7.945 | 7.597 | 20.787 | +| 70 | 4.647 | 41.695 | 13.298 | +| 71 | 4.429 | 22.057 | 13.021 | +| 72 | 8.659 | 11.863 | 80.778 | +| 73 | 10.062 | 17.561 | 52.477 | +| 74 | 5.514 | 8.366 | 16.012 | +| 75 | 7.608 | 14.011 | 35.127 | +| 76 | 4.586 | 7.205 | 35.552 | +| 77 | 7.692 | 6.602 | 31.97 | +| 78 | 5.319 | 6.103 | 18.685 | +| 79 | 4.884 | 18.043 | 14.357 | +| 80 | 4.13 | 6.89 | 16.879 | +| 81 | 8.511 | 11.561 | 88.715 | +| 82 | 4.983 | 7.434 | 43.16 | +| 83 | 8.486 | 5.779 | 24.319 | +| 84 | 5.158 | 6.033 | 15.065 | +| 85 | 4.224 | 6.359 | 16.537 | +| 86 | 4.444 | 6.736 | 24.151 | +| 87 | 4.546 | 6.397 | 15.62 | +| 88 | 4.799 | 6.904 | 32.86 | +| 89 | 3.967 | 7.01 | 15.07 | +| 90 | 4.437 | 5.612 | 15.076 | +| 91 | 4.541 | 10.904 | 14.597 | +| 92 | 4.649 | 6.64 | 14.637 | +| 93 | 4.665 | 6.508 | 13.285 | +| 94 | 62.925 | 17.063 | 14.43 | +| 95 | 4.335 | 7.104 | 14.012 | +| 96 | 7.366 | 7.699 | 24.729 | +| 97 | 5.028 | 6.881 | 9.559 | +| 98 | 5.097 | 5.803 | 14.27 | +| 99 | 4.345 | 6.36 | 13.744 | +| 100 | 4.251 | 9.569 | 25.646 | +| 101 | 4.671 | 6.121 | 13.938 | +| 102 | 7.432 | 5.491 | 23.341 | +| 103 | 4.508 | 6.412 | 36.594 | +| 104 | 4.866 | 6.104 | 13.313 | +| 105 | 4.58 | 12.243 | 15.316 | +| 106 | 4.94 | 6.362 | 25.561 | +| 107 | 7.842 | 5.697 | 47.395 | +| 108 | 4.332 | 11.934 | 14.079 | +| 109 | 4.646 | 29.469 | 13.751 | +| 110 | 4.534 | 14.249 | 14.33 | +| 111 | 9.206 | 13.585 | 42.629 | +| 112 | 4.802 | 10.479 | 17.424 | +| 113 | 4.119 | 8.238 | 19.978 | +| 114 | 4.139 | 9.334 | 35.5 | +| 115 | 4.588 | 8.438 | 16.489 | +| 116 | 4.537 | 8.621 | 25.803 | +| 117 | 4.688 | 7.742 | 22.332 | +| 118 | 4.055 | 8.27 | 14.074 | +| 119 | 4.703 | 6.958 | 31.992 | +| 120 | 4.51 | 23.409 | 17.064 | +| 121 | 4.178 | 9.167 | 17.754 | +| 122 | 9.22 | 6.772 | 29.279 | +| 123 | 4.359 | 12.718 | 17.967 | +| 124 | 4.926 | 7.611 | 20.424 | +| 125 | 4.487 | 14.572 | 40.434 | +| 126 | 4.571 | 13.044 | 16.905 | +| 127 | 8.108 | 8.975 | 32.976 | +| 128 | 4.472 | 18.712 | 24.549 | +| 129 | 4.66 | 45.553 | 16.351 | +| 130 | 7.791 | 12.255 | 35.594 | +| 131 | 8.569 | 13.897 | 56.421 | +| 132 | 5.671 | 12.073 | 34.335 | +| 133 | 9.43 | 21.305 | 62.405 | +| 134 | 4.015 | 6.391 | 212.92 | +| 135 | 8.961 | 47.173 | 84.666 | +| 136 | 5.955 | 9.856 | 56.024 | +| 137 | 7.095 | 6.996 | 90.83 | +| 138 | 4.641 | 11.802 | 54.385 | +| 139 | 5.156 | 6.072 | 39.201 | +| 140 | 7.757 | 6.365 | 36.235 | +| 141 | 4.71 | 10.944 | 22.979 | +| 142 | 4.346 | 7.061 | 53.964 | +| 143 | 8.663 | 6.19 | 26.35 | +| 144 | 4.641 | 13.49 | 17.432 | +| 145 | 4.864 | 6.598 | 34.935 | +| 146 | 7.932 | 7.13 | 25.882 | +| 147 | 4.56 | 11.359 | 18.395 | +| 148 | 3.941 | 6.113 | 18.126 | +| 149 | 7.729 | 10.987 | 25.151 | +| 150 | 4.7 | 7.347 | 13.144 | +| 151 | 8.089 | 7.309 | 50.434 | +| 152 | 4.016 | 12.875 | 18.051 | +| 153 | 4.468 | 7.873 | 27.779 | +| 154 | 9.136 | 6.559 | 82.546 | +| 155 | 5.69 | 16.854 | 33.156 | +| 156 | 4.253 | 6.499 | 18.284 | +| 157 | 8.223 | 5.983 | 26.154 | +| 158 | 4.733 | 16.357 | 17.805 | +| 159 | 4.347 | 15.032 | 13.611 | +| 160 | 4.276 | 13.735 | 26.219 | +| 161 | 17.02 | 13.094 | 76.501 | +| 162 | 4.765 | 22.683 | 40.978 | +| 163 | 5.941 | 12.033 | 21.6 | +| 164 | 66.877 | 11.795 | 60.514 | +| 165 | 15.101 | 6.563 | 39.268 | +| 166 | 8.748 | 9.21 | 18.536 | +| 167 | 5.412 | 10.025 | 19.954 | +| 168 | 11.692 | 8.565 | 35.502 | +| 总计 | 1109.742 | 1876.579 | 5191.277 | + +查询性能的对比 + +这里使用了一组点查的SQL对使用本地字典,全局字典和不使用字典的德数据表进行了查询统计,记录一些统计结论。 + +查询SQL分别使用 + +``` +select * from test.detail_benchmark where MSISDN="13900000000" limit 2000; +select * from test.detail_loacl_dict where MSISDN="13900000000" limit 2000; +select * from test.detail_global_dict where MSISDN="13900000000" limit 2000; +``` + +记录的查询耗时情况如下: + +| | detail_benchmark | detail_loacl_dict | detail_global_dict | +| -------------- | ---------------- | ----------------- | ------------------ | +| 查询五次平均值 | 18.5 | 14.543 | 7.750 | + +从结果可以看出,使用全局字典,点查字典字段用时最少,本地字典次之,无字典最差。 + +用户在选择字典的使用上时,可以根据自身对数据加载的要求和查询要求来选择字典,本案例由于对数据入库的时效性有较高要求,为了避免数据加载的延迟,因此在建表的时候选择无字典的方案。 + + + +### 建表 Create table 时SORT_COLUMNS和SORT_SCOPE的选择 + + 关于SORT_COLUMNS的配置可以参考CarbonData[官网](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)的说明, 当用户不指定SORT_COLUMNS时,默认将不建立SORT_COLUMNS,当前的SORT_COLUMNS之支持:tring, date, timestamp, short, int, long, byte, boolean类型。SORT_COLUMNS 在用户指定的列上建立MKD索引,将有助于查询性能的提升,但是将稍微影响加载性能,因此只需要给需要的列上设置SORT_COLUMNS即可,无需给所有列都设置SORT_COLUMNS。 + + [SORT_SCOPE](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)参数用户在指定了SORT_COLUMNS后,数据家在时排序的范围,当前支持的范围有: + +NO_SORT: 对加载的数据不排序 + +LOCAL_SORT: 加载的数据在本地排序 + +GLOBAL_SORT: 加载的数据全局排序,它提升了高并发下点查的效率,对加载的性能的影响较大。 + +BATCH_SORT: 用的比较少 Review comment: please remove it, BATCH_SORT is deprecated ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
MarvinLitt commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363229910 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,520 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以跟组自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | + + + +## 建表 Create table + +### 建表 Create table 时字典的选择 + +建表时用户可以选择使用本地字典或者全局字典,或者不使用字典。通常情况下使用全局字典数据加载时比较耗时,但是查询时较快; 本地字典相比全局字典加载较快,查询不如全局字典; 而不使用本地字典和全局字典时数据加载较快,但是查询最慢。用户在实际使用时可以根据自身业务的诉求来选择合适的模式。 + +为此这里做了一个验证,分别对不使用字典,使用本地字典,使用全局字典的数据表进行加载和点查操作,分析其加载和查询性能。 + +建表语句结构如下: + +1)不使用字典: + +``` +create table if not exists test.detail_benchmark1 +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256') +``` + +2) 使用本地字典: + +``` +create table if not exists test.detail_loacl_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI','table_blocksize'='256') +``` + +3) 使用全局字典: + +``` +create table if not exists test.detail_global_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256','DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI') +``` + +使用16亿数据量样本作为表数据,分168个批次分别加载到如上表中,其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ----------------------- | ---------- | ----------- | +| test.detail_benchmark1 | 1613149548 | 1109 | +| test.detail_loacl_dict | 1613149548 | 1876 | +| test.detail_global_dict | 1613149548 | 5191 | + +从数据中可以看出,使用全局字典在数据加载时将花费更多的时间,不使用字典或者使用本地字典时耗时明显少。 + +下面列表记录了每一个批次的加载耗时 + +| 加载批次 | 耗时秒 detail_benchmark | 耗时秒 detail_loacl_dict | 耗时秒 detail_global_dict | Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
MarvinLitt commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363229944 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,520 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以跟组自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | + + + +## 建表 Create table + +### 建表 Create table 时字典的选择 + +建表时用户可以选择使用本地字典或者全局字典,或者不使用字典。通常情况下使用全局字典数据加载时比较耗时,但是查询时较快; 本地字典相比全局字典加载较快,查询不如全局字典; 而不使用本地字典和全局字典时数据加载较快,但是查询最慢。用户在实际使用时可以根据自身业务的诉求来选择合适的模式。 + +为此这里做了一个验证,分别对不使用字典,使用本地字典,使用全局字典的数据表进行加载和点查操作,分析其加载和查询性能。 + +建表语句结构如下: + +1)不使用字典: + +``` +create table if not exists test.detail_benchmark1 +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256') +``` + +2) 使用本地字典: + +``` +create table if not exists test.detail_loacl_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI','table_blocksize'='256') +``` + +3) 使用全局字典: + +``` +create table if not exists test.detail_global_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256','DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI') +``` + +使用16亿数据量样本作为表数据,分168个批次分别加载到如上表中,其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ----------------------- | ---------- | ----------- | +| test.detail_benchmark1 | 1613149548 | 1109 | +| test.detail_loacl_dict | 1613149548 | 1876 | +| test.detail_global_dict | 1613149548 | 5191 | + +从数据中可以看出,使用全局字典在数据加载时将花费更多的时间,不使用字典或者使用本地字典时耗时明显少。 + +下面列表记录了每一个批次的加载耗时 + +| 加载批次 | 耗时秒 detail_benchmark | 耗时秒 detail_loacl_dict | 耗时秒 detail_global_dict | +| -------- | ------------------------ | ------------------------ | ------------------------- | +| 1 | 6.584 | 19.31 | 17.237 | +| 2 | 10.255 | 21.406 | 16.306 | +| 3 | 5.079 | 21.619 | 21.351 | +| 4 | 5.004 | 21.538 | 41.012 | +| 5 | 10.877 | 18.191 | 25.649 | +| 6 | 7.868 | 17.457 | 29.419 | +| 7 | 5.295 | 20.919 | 36.859 | +| 8 | 4.43 | 17.048 | 48.596 | +| 9 | 5.373 | 18.584 | 32.498 | +| 10 | 6.074 | 17.668 | 27.939 | +| 11 | 5.889 | 19.553 | 40.715 | +| 12 | 4.8 | 14.819 | 23.211 | +| 13 | 4.972 | 12 | 29.177 | +| 14 | 5.144 | 11.909 | 28.371 | +| 15 | 5.403 | 18.428 | 23.218 | +| 16 | 5.314 | 19.536 | 33.537 | +| 17 | 4.519 | 22.373 | 40.328 | +| 18 | 9.327 | 24.507 | 32.968 | +| 19 | 5.277 | 7.219 | 22.937 | +| 20 | 10.882 | 6.552 | 57.343 | +| 21 | 4.74 | 11.911 | 16.446 | +| 22 | 5.38 | 11.908 | 16.796 | +| 23 | 9.046 | 7.076 | 25.815 | +| 24 | 4.399 | 6.391 | 16.634 | +| 25 | 5.079 | 7.697 | 20.03 | +| 26 | 4.177 | 6.114 | 16.441 | +| 27 | 4.441 | 6.349 | 20.032 | +| 28 | 4.951 | 7.034 | 15.82 | +| 29 | 5.09 | 13.404 | 15.448 | +| 30 | 4.286 | 7.455 | 17.099 | +| 31 | 7.821 | 13.776 | 31.003 | +| 32 | 4.623 | 7.025 | 15.334 | +| 33 | 9.304 | 7.273 | 27.529 | +| 34 | 4.806 | 13.23 | 16.066 | +| 35 | 5.469 | 6.562 | 15.055 | +| 36 | 10.045 | 6.581 | 25.76 | +| 37 | 4.619 | 11.716 | 19.778 | +| 38 | 5.807 | 5.939 | 15.884 | +| 39 | 9.101 | 6.531 | 25.726 | +| 40 | 4.968 | 6.675 | 33.37 | +| 41 | 4.654 | 6.389 | 46.855 | +| 42 | 4.763 | 6.774 | 29.427 | +| 43 | 4.155 | 6.203 | 29.645 | +| 44 | 4.672 | 6.288 | 16.472 | +| 45 | 5.052 | 12.845 | 40.372 | +| 46 | 4.812 | 6.185 | 37.74 | +| 47 | 7.941 | 6.653 | 29.874 | +| 48 | 5.401 | 5.775 | 21.479 | +| 49 | 4.684 | 5.776 | 16.473 | +| 50 | 4.698 | 10.646 | 17.763 | +| 51 | 4.604 | 6.359 | 30.287 | +| 52 | 8.843 | 5.899 | 124.026 | +| 53 | 4.785 | 11.284 | 42.688 | +| 54 | 4.969 | 7.116 | 50.739 | +| 55 | 7.634 | 6.291 | 61.608 | +| 56 | 4.424 | 6.534 | 34.682 | +| 57 | 5.086 | 11.372 | 16.501 | +| 58 | 4.461 | 5.613 | 17.621 | +| 59 | 7.809 | 19.169 | 35.44 | +| 60 | 4.852 | 7.34 | 19.982 | +| 61 | 8.486 | 7.182 | 106.913 | +| 62 | 4.412 | 10.598 | 27.532 | +| 63 | 4.168 | 17.974 | 22.97 | +| 64 | 4.287 | 6.998 | 27.284 | +| 65 | 8.867 | 6.526 | 62.789 | +| 66 | 4.37 | 6.457 | 36.626 | +| 67 | 4.23 | 12.335 | 16.968 | +| 68 | 5.28 | 13.638 | 13.551 | +| 69 | 7.945 | 7.597 | 20.787 | +| 70 | 4.647 | 41.695 | 13.298 | +| 71 | 4.429 | 22.057 | 13.021 | +| 72 | 8.659 | 11.863 | 80.778 | +| 73 | 10.062 | 17.561 | 52.477 | +| 74 | 5.514 | 8.366 | 16.012 | +| 75 | 7.608 | 14.011 | 35.127 | +| 76 | 4.586 | 7.205 | 35.552 | +| 77 | 7.692 | 6.602 | 31.97 | +| 78 | 5.319 | 6.103 | 18.685 | +| 79 | 4.884 | 18.043 | 14.357 | +| 80 | 4.13 | 6.89 | 16.879 | +| 81 | 8.511 | 11.561 | 88.715 | +| 82 | 4.983 | 7.434 | 43.16 | +| 83 | 8.486 | 5.779 | 24.319 | +| 84 | 5.158 | 6.033 | 15.065 | +| 85 | 4.224 | 6.359 | 16.537 | +| 86 | 4.444 | 6.736 | 24.151 | +| 87 | 4.546 | 6.397 | 15.62 | +| 88 | 4.799 | 6.904 | 32.86 | +| 89 | 3.967 | 7.01 | 15.07 | +| 90 | 4.437 | 5.612 | 15.076 | +| 91 | 4.541 | 10.904 | 14.597 | +| 92 | 4.649 | 6.64 | 14.637 | +| 93 | 4.665 | 6.508 | 13.285 | +| 94 | 62.925 | 17.063 | 14.43 | +| 95 | 4.335 | 7.104 | 14.012 | +| 96 | 7.366 | 7.699 | 24.729 | +| 97 | 5.028 | 6.881 | 9.559 | +| 98 | 5.097 | 5.803 | 14.27 | +| 99 | 4.345 | 6.36 | 13.744 | +| 100 | 4.251 | 9.569 | 25.646 | +| 101 | 4.671 | 6.121 | 13.938 | +| 102 | 7.432 | 5.491 | 23.341 | +| 103 | 4.508 | 6.412 | 36.594 | +| 104 | 4.866 | 6.104 | 13.313 | +| 105 | 4.58 | 12.243 | 15.316 | +| 106 | 4.94 | 6.362 | 25.561 | +| 107 | 7.842 | 5.697 | 47.395 | +| 108 | 4.332 | 11.934 | 14.079 | +| 109 | 4.646 | 29.469 | 13.751 | +| 110 | 4.534 | 14.249 | 14.33 | +| 111 | 9.206 | 13.585 | 42.629 | +| 112 | 4.802 | 10.479 | 17.424 | +| 113 | 4.119 | 8.238 | 19.978 | +| 114 | 4.139 | 9.334 | 35.5 | +| 115 | 4.588 | 8.438 | 16.489 | +| 116 | 4.537 | 8.621 | 25.803 | +| 117 | 4.688 | 7.742 | 22.332 | +| 118 | 4.055 | 8.27 | 14.074 | +| 119 | 4.703 | 6.958 | 31.992 | +| 120 | 4.51 | 23.409 | 17.064 | +| 121 | 4.178 | 9.167 | 17.754 | +| 122 | 9.22 | 6.772 | 29.279 | +| 123 | 4.359 | 12.718 | 17.967 | +| 124 | 4.926 | 7.611 | 20.424 | +| 125 | 4.487 | 14.572 | 40.434 | +| 126 | 4.571 | 13.044 | 16.905 | +| 127 | 8.108 | 8.975 | 32.976 | +| 128 | 4.472 | 18.712 | 24.549 | +| 129 | 4.66 | 45.553 | 16.351 | +| 130 | 7.791 | 12.255 | 35.594 | +| 131 | 8.569 | 13.897 | 56.421 | +| 132 | 5.671 | 12.073 | 34.335 | +| 133 | 9.43 | 21.305 | 62.405 | +| 134 | 4.015 | 6.391 | 212.92 | +| 135 | 8.961 | 47.173 | 84.666 | +| 136 | 5.955 | 9.856 | 56.024 | +| 137 | 7.095 | 6.996 | 90.83 | +| 138 | 4.641 | 11.802 | 54.385 | +| 139 | 5.156 | 6.072 | 39.201 | +| 140 | 7.757 | 6.365 | 36.235 | +| 141 | 4.71 | 10.944 | 22.979 | +| 142 | 4.346 | 7.061 | 53.964 | +| 143 | 8.663 | 6.19 | 26.35 | +| 144 | 4.641 | 13.49 | 17.432 | +| 145 | 4.864 | 6.598 | 34.935 | +| 146 | 7.932 | 7.13 | 25.882 | +| 147 | 4.56 | 11.359 | 18.395 | +| 148 | 3.941 | 6.113 | 18.126 | +| 149 | 7.729 | 10.987 | 25.151 | +| 150 | 4.7 | 7.347 | 13.144 | +| 151 | 8.089 | 7.309 | 50.434 | +| 152 | 4.016 | 12.875 | 18.051 | +| 153 | 4.468 | 7.873 | 27.779 | +| 154 | 9.136 | 6.559 | 82.546 | +| 155 | 5.69 | 16.854 | 33.156 | +| 156 | 4.253 | 6.499 | 18.284 | +| 157 | 8.223 | 5.983 | 26.154 | +| 158 | 4.733 | 16.357 | 17.805 | +| 159 | 4.347 | 15.032 | 13.611 | +| 160 | 4.276 | 13.735 | 26.219 | +| 161 | 17.02 | 13.094 | 76.501 | +| 162 | 4.765 | 22.683 | 40.978 | +| 163 | 5.941 | 12.033 | 21.6 | +| 164 | 66.877 | 11.795 | 60.514 | +| 165 | 15.101 | 6.563 | 39.268 | +| 166 | 8.748 | 9.21 | 18.536 | +| 167 | 5.412 | 10.025 | 19.954 | +| 168 | 11.692 | 8.565 | 35.502 | +| 总计 | 1109.742 | 1876.579 | 5191.277 | + +查询性能的对比 + +这里使用了一组点查的SQL对使用本地字典,全局字典和不使用字典的德数据表进行了查询统计,记录一些统计结论。 + +查询SQL分别使用 + +``` +select * from test.detail_benchmark where MSISDN="13900000000" limit 2000; +select * from test.detail_loacl_dict where MSISDN="13900000000" limit 2000; +select * from test.detail_global_dict where MSISDN="13900000000" limit 2000; +``` + +记录的查询耗时情况如下: + +| | detail_benchmark | detail_loacl_dict | detail_global_dict | +| -------------- | ---------------- | ----------------- | ------------------ | +| 查询五次平均值 | 18.5 | 14.543 | 7.750 | + +从结果可以看出,使用全局字典,点查字典字段用时最少,本地字典次之,无字典最差。 + +用户在选择字典的使用上时,可以根据自身对数据加载的要求和查询要求来选择字典,本案例由于对数据入库的时效性有较高要求,为了避免数据加载的延迟,因此在建表的时候选择无字典的方案。 + + + +### 建表 Create table 时SORT_COLUMNS和SORT_SCOPE的选择 + + 关于SORT_COLUMNS的配置可以参考CarbonData[官网](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)的说明, 当用户不指定SORT_COLUMNS时,默认将不建立SORT_COLUMNS,当前的SORT_COLUMNS之支持:tring, date, timestamp, short, int, long, byte, boolean类型。SORT_COLUMNS 在用户指定的列上建立MKD索引,将有助于查询性能的提升,但是将稍微影响加载性能,因此只需要给需要的列上设置SORT_COLUMNS即可,无需给所有列都设置SORT_COLUMNS。 + + [SORT_SCOPE](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)参数用户在指定了SORT_COLUMNS后,数据家在时排序的范围,当前支持的范围有: + +NO_SORT: 对加载的数据不排序 + +LOCAL_SORT: 加载的数据在本地排序 + +GLOBAL_SORT: 加载的数据全局排序,它提升了高并发下点查的效率,对加载的性能的影响较大。 + +BATCH_SORT: 用的比较少 Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-571105723 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1492/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ndwangsen commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-571400072 Very well written, it is even better to use official site data ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
qiuchenjian commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363565379 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,517 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以根据自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式,周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | Review comment: better to list the number of disk ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
qiuchenjian commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363565860 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,517 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以根据自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式,周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | + + + +## 建表 Create table + +### 建表 Create table 时字典的选择 + + 建表时用户可以选择使用本地字典或者不使用字典。通常情况下使用本地字典数据加载时比较耗时,但是查询时较快; 而不使用本地字典时加载较快,查询不如使用本地字典。用户在实际使用时可以根据自身业务的诉求来选择合适的模式。关于CarbonData本地字典的具体描述可以参考官网的[文档](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)。 + + 为此这里做了一个验证,分别对不使用字典,使用本地字典的数据表进行加载和点查操作,分析其加载和查询性能。 + +建表语句结构如下: + +1)不使用字典: + +``` +create table if not exists test.detail_benchmark1 +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256') +``` + +2) 使用本地字典: + +``` +create table if not exists test.detail_loacl_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI','table_blocksize'='256') +``` + +使用16亿数据量样本作为表数据,分168个批次分别加载到如上表中,其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ---------------------- | ---------- | ----------- | +| test.detail_benchmark1 | 1613149548 | 1109 | +| test.detail_loacl_dict | 1613149548 | 1876 | + +下面列表记录了每一个批次的加载耗时 + +| 加载批次 | 耗时 秒 detail_benchmark | 耗时 秒 detail_loacl_dict | +| -------- | ------------------------- | ------------------------- | +| 1 | 6.584 | 19.31 | +| 2 | 10.255 | 21.406 | +| 3 | 5.079 | 21.619 | +| 4 | 5.004 | 21.538 | +| 5 | 10.877 | 18.191 | +| 6 | 7.868 | 17.457 | +| 7 | 5.295 | 20.919 | +| 8 | 4.43 | 17.048 | +| 9 | 5.373 | 18.584 | +| 10 | 6.074 | 17.668 | +| 11 | 5.889 | 19.553 | +| 12 | 4.8 | 14.819 | +| 13 | 4.972 | 12 | +| 14 | 5.144 | 11.909 | +| 15 | 5.403 | 18.428 | +| 16 | 5.314 | 19.536 | +| 17 | 4.519 | 22.373 | +| 18 | 9.327 | 24.507 | +| 19 | 5.277 | 7.219 | +| 20 | 10.882 | 6.552 | +| 21 | 4.74 | 11.911 | +| 22 | 5.38 | 11.908 | +| 23 | 9.046 | 7.076 | +| 24 | 4.399 | 6.391 | +| 25 | 5.079 | 7.697 | +| 26 | 4.177 | 6.114 | +| 27 | 4.441 | 6.349 | +| 28 | 4.951 | 7.034 | +| 29 | 5.09 | 13.404 | +| 30 | 4.286 | 7.455 | +| 31 | 7.821 | 13.776 | +| 32 | 4.623 | 7.025 | +| 33 | 9.304 | 7.273 | +| 34 | 4.806 | 13.23 | +| 35 | 5.469 | 6.562 | +| 36 | 10.045 | 6.581 | +| 37 | 4.619 | 11.716 | +| 38 | 5.807 | 5.939 | +| 39 | 9.101 | 6.531 | +| 40 | 4.968 | 6.675 | +| 41 | 4.654 | 6.389 | +| 42 | 4.763 | 6.774 | +| 43 | 4.155 | 6.203 | +| 44 | 4.672 | 6.288 | +| 45 | 5.052 | 12.845 | +| 46 | 4.812 | 6.185 | +| 47 | 7.941 | 6.653 | +| 48 | 5.401 | 5.775 | +| 49 | 4.684 | 5.776 | +| 50 | 4.698 | 10.646 | +| 51 | 4.604 | 6.359 | +| 52 | 8.843 | 5.899 | +| 53 | 4.785 | 11.284 | +| 54 | 4.969 | 7.116 | +| 55 | 7.634 | 6.291 | +| 56 | 4.424 | 6.534 | +| 57 | 5.086 | 11.372 | +| 58 | 4.461 | 5.613 | +| 59 | 7.809 | 19.169 | +| 60 | 4.852 | 7.34 | +| 61 | 8.486 | 7.182 | +| 62 | 4.412 | 10.598 | +| 63 | 4.168 | 17.974 | +| 64 | 4.287 | 6.998 | +| 65 | 8.867 | 6.526 | +| 66 | 4.37 | 6.457 | +| 67 | 4.23 | 12.335 | +| 68 | 5.28 | 13.638 | +| 69 | 7.945 | 7.597 | +| 70 | 4.647 | 41.695 | +| 71 | 4.429 | 22.057 | +| 72 | 8.659 | 11.863 | +| 73 | 10.062 | 17.561 | +| 74 | 5.514 | 8.366 | +| 75 | 7.608 | 14.011 | +| 76 | 4.586 | 7.205 | +| 77 | 7.692 | 6.602 | +| 78 | 5.319 | 6.103 | +| 79 | 4.884 | 18.043 | +| 80 | 4.13 | 6.89 | +| 81 | 8.511 | 11.561 | +| 82 | 4.983 | 7.434 | +| 83 | 8.486 | 5.779 | +| 84 | 5.158 | 6.033 | +| 85 | 4.224 | 6.359 | +| 86 | 4.444 | 6.736 | +| 87 | 4.546 | 6.397 | +| 88 | 4.799 | 6.904 | +| 89 | 3.967 | 7.01 | +| 90 | 4.437 | 5.612 | +| 91 | 4.541 | 10.904 | +| 92 | 4.649 | 6.64 | +| 93 | 4.665 | 6.508 | +| 94 | 62.925 | 17.063 | +| 95 | 4.335 | 7.104 | +| 96 | 7.366 | 7.699 | +| 97 | 5.028 | 6.881 | +| 98 | 5.097 | 5.803 | +| 99 | 4.345 | 6.36 | +| 100 | 4.251 | 9.569 | +| 101 | 4.671 | 6.121 | +| 102 | 7.432 | 5.491 | +| 103 | 4.508 | 6.412 | +| 104 | 4.866 | 6.104 | +| 105 | 4.58 | 12.243 | +| 106 | 4.94 | 6.362 | +| 107 | 7.842 | 5.697 | +| 108 | 4.332 | 11.934 | +| 109 | 4.646 | 29.469 | +| 110 | 4.534 | 14.249 | +| 111 | 9.206 | 13.585 | +| 112 | 4.802 | 10.479 | +| 113 | 4.119 | 8.238 | +| 114 | 4.139 | 9.334 | +| 115 | 4.588 | 8.438 | +| 116 | 4.537 | 8.621 | +| 117 | 4.688 | 7.742 | +| 118 | 4.055 | 8.27 | +| 119 | 4.703 | 6.958 | +| 120 | 4.51 | 23.409 | +| 121 | 4.178 | 9.167 | +| 122 | 9.22 | 6.772 | +| 123 | 4.359 | 12.718 | +| 124 | 4.926 | 7.611 | +| 125 | 4.487 | 14.572 | +| 126 | 4.571 | 13.044 | +| 127 | 8.108 | 8.975 | +| 128 | 4.472 | 18.712 | +| 129 | 4.66 | 45.553 | +| 130 | 7.791 | 12.255 | +| 131 | 8.569 | 13.897 | +| 132 | 5.671 | 12.073 | +| 133 | 9.43 | 21.305 | +| 134 | 4.015 | 6.391 | +| 135 | 8.961 | 47.173 | +| 136 | 5.955 | 9.856 | +| 137 | 7.095 | 6.996 | +| 138 | 4.641 | 11.802 | +| 139 | 5.156 | 6.072 | +| 140 | 7.757 | 6.365 | +| 141 | 4.71 | 10.944 | +| 142 | 4.346 | 7.061 | +| 143 | 8.663 | 6.19 | +| 144 | 4.641 | 13.49 | +| 145 | 4.864 | 6.598 | +| 146 | 7.932 | 7.13 | +| 147 | 4.56 | 11.359 | +| 148 | 3.941 | 6.113 | +| 149 | 7.729 | 10.987 | +| 150 | 4.7 | 7.347 | +| 151 | 8.089 | 7.309 | +| 152 | 4.016 | 12.875 | +| 153 | 4.468 | 7.873 | +| 154 | 9.136 | 6.559 | +| 155 | 5.69 | 16.854 | +| 156 | 4.253 | 6.499 | +| 157 | 8.223 | 5.983 | +| 158 | 4.733 | 16.357 | +| 159 | 4.347 | 15.032 | +| 160 | 4.276 | 13.735 | +| 161 | 17.02 | 13.094 | +| 162 | 4.765 | 22.683 | +| 163 | 5.941 | 12.033 | +| 164 | 66.877 | 11.795 | +| 165 | 15.101 | 6.563 | +| 166 | 8.748 | 9.21 | +| 167 | 5.412 | 10.025 | +| 168 | 11.692 | 8.565 | +| 总计 | 1109.742 | 1876.579 | + + 从数据中可以看出,使用本地字典在数据加载时将花费更多的时间,不使用字典时耗时明显少,从168个批次的统计数据看使用本地字典将比不使用本地字典多消耗70%的加载时长,但是在查询性能上又有提升,详细见后续介绍。 + +查询性能的对比 + + 这里使用了一组点查的SQL对使用本地字典、全局字典和不使用字典的数据表进行了查询统计,记录一些统计结论。 + +点查询SQL分别使用 + +``` +select * from test.detail_benchmark where MSISDN="13900000000" limit 2000; +select * from test.detail_loacl_dict where MSISDN="13900000000" limit 2000; + +``` + +模糊查询SQL分别使用 + +``` +select * from test.detail_benchmark where MSISDN like "1361%" limit 2000; +select * from test.detail_loacl_dict where MSISDN like "1391%" limit 2000; +``` + + + +记录查询耗时情况如下: + +| 查询用例 | 耗时 秒 detail_benchmark | 耗时 秒 detail_loacl_dict | +| ------------------ | ------------------------ | ------------------------- | +| 点查询五次平均值 | 18.5 | 14.543 | +| 模糊查询五次平均值 | 10.832 | 4.118 | + + 从结果可以看出,使用本地字典查询较不使用本地字典点查的时候性能有较大的提升。使用本地字典点查询性能提升在30%左右,模糊查询性能提升超过1倍。 + + 用户在选择字典的使用上时,可以根据自身对数据加载的要求和查询要求来选择字典,本案例由于对数据入库的时效性有较高要求,为了避免数据加载的延迟,因此在建表的时候选择无字典的方案,牺牲了一些查询性能。 + + + +### 建表 Create table 时SORT_COLUMNS和SORT_SCOPE的选择 + + 关于SORT_COLUMNS的配置可以参考CarbonData[官网](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)的说明, 当用户不指定SORT_COLUMNS时,默认将不建立SORT_COLUMNS,当前的SORT_COLUMNS只支持:string, date, timestamp, short, int, long, byte, boolean类型。SORT_COLUMNS 在用户指定的列上建立MKD索引,将有助于查询性能的提升,但是将稍微影响加载性能,因此只需要给需要的列上设置SORT_COLUMNS即可,无需给所有列都设置SORT_COLUMNS。 + + [SORT_SCOPE](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)参数用户在指定了SORT_COLUMNS后,数据加载时排序的设置,当前支持的设置有: + +NO_SORT: 对加载的数据不排序 + +LOCAL_SORT: 加载的数据在本地排序 + +GLOBAL_SORT: 加载的数据全局排序,它提升了高并发下点查的效率,对加载的性能的影响较大。 + +在建表时,对使用SORT_COLUMNS时“NO_SORT”,“LOCAL_SORT”, “GLOBAL_SORT”的性能进行分析。 + +建表语句: + +```全局排序``` + +``` +create table if not exists test.detail_global_sort +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256', 'SORT_COLUMNS'='msisdn,req_time_sec,req_succed_flag', 'SORT_SCOPE'='GLOBAL_SORT') +``` + +```本地排序``` + +``` +create table if not exists test.detail_local_sort +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256', 'SORT_COLUMNS'='msisdn,req_time_sec,req_succed_flag', 'SORT_SCOPE'='LOCAL_SORT') +``` + +```不排序``` + +``` +create table if not exists test.detail_no_sort +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256', 'SORT_COLUMNS'='msisdn,req_time_sec,req_succed_flag', 'SORT_SCOPE'='NO_SORT') +``` + +验证其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ----------------------- | ---------- | ----------- | +| test.detail_benchmark | 1613149548 | 1109 | Review comment: better to give a description of test.detail_benchmark ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
MarvinLitt commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363616752 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,517 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以根据自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式,周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | Review comment: okay, i add. done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
MarvinLitt commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363616804 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,517 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 + +## 背景 + + 本文主要针对使用CarbonData在明细数据查询场景下如何配置建表、加载、查询时参数为主要阐述对象,指导用户在建表时选择合适的字典配置及SORT_COLUMNS、SORT_SCOPE配置。并且给出了一组不同配置是进行加载,查询的耗时情况,用户可以根据自己的业务特点和场景选择合适的参数。 + + 本文中数据表及查询的主要特点是:表记录数都比较大,列数比较多,大约在100-600行之间,表的大小从数千万到数百亿之间。在查询的时候主要是进行点查和过滤,没有汇聚计算,偶尔有关联维表的场景。数据入库采取分批入库的方式,周期约为5分钟,按天建表。查询时可能有不少于20的并发查询。 + +典型的查询的使用框架,其中第五个求sum仅为作性能对比。 + +1.点查: select * from table where id_a=‘ ’ limit 1000; + +2.模糊查询: select * from table where id_a like '1234%' limit 1000; + +3.求记录总数: select count(1) from table; + +4.求最大/最小值: select max(id_a), min(id_a) from table; + +5.求sum(仅为了做性能对比): select sum(id_a) from table; + +数据的特点,列主要是以int, bigint, string列构成,描述一些号码列,时间列,ID列等,无复杂数据类型。 + + + +## 测试环境 + +| 集群 | CPU | vCore | Memory | 硬盘 | 描述 | +| ---------- | -------------------- | ----- | ------ | ---- | ------------------------------------------------------------ | +| Hadoop集群 | Gold 6132 CPU@2.60GZ | 56 | 256GB | SATA | 2个namenode,6个datanode, 查询队列分配1/6的资源,等同于一个节点 | + + + +## 建表 Create table + +### 建表 Create table 时字典的选择 + + 建表时用户可以选择使用本地字典或者不使用字典。通常情况下使用本地字典数据加载时比较耗时,但是查询时较快; 而不使用本地字典时加载较快,查询不如使用本地字典。用户在实际使用时可以根据自身业务的诉求来选择合适的模式。关于CarbonData本地字典的具体描述可以参考官网的[文档](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)。 + + 为此这里做了一个验证,分别对不使用字典,使用本地字典的数据表进行加载和点查操作,分析其加载和查询性能。 + +建表语句结构如下: + +1)不使用字典: + +``` +create table if not exists test.detail_benchmark1 +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256') +``` + +2) 使用本地字典: + +``` +create table if not exists test.detail_loacl_dict +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) +TBLPROPERTIES ( 'LOCAL_DICTIONARY_INCLUDE'='IMSI,MSISDN,IMEI','table_blocksize'='256') +``` + +使用16亿数据量样本作为表数据,分168个批次分别加载到如上表中,其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ---------------------- | ---------- | ----------- | +| test.detail_benchmark1 | 1613149548 | 1109 | +| test.detail_loacl_dict | 1613149548 | 1876 | + +下面列表记录了每一个批次的加载耗时 + +| 加载批次 | 耗时 秒 detail_benchmark | 耗时 秒 detail_loacl_dict | +| -------- | ------------------------- | ------------------------- | +| 1 | 6.584 | 19.31 | +| 2 | 10.255 | 21.406 | +| 3 | 5.079 | 21.619 | +| 4 | 5.004 | 21.538 | +| 5 | 10.877 | 18.191 | +| 6 | 7.868 | 17.457 | +| 7 | 5.295 | 20.919 | +| 8 | 4.43 | 17.048 | +| 9 | 5.373 | 18.584 | +| 10 | 6.074 | 17.668 | +| 11 | 5.889 | 19.553 | +| 12 | 4.8 | 14.819 | +| 13 | 4.972 | 12 | +| 14 | 5.144 | 11.909 | +| 15 | 5.403 | 18.428 | +| 16 | 5.314 | 19.536 | +| 17 | 4.519 | 22.373 | +| 18 | 9.327 | 24.507 | +| 19 | 5.277 | 7.219 | +| 20 | 10.882 | 6.552 | +| 21 | 4.74 | 11.911 | +| 22 | 5.38 | 11.908 | +| 23 | 9.046 | 7.076 | +| 24 | 4.399 | 6.391 | +| 25 | 5.079 | 7.697 | +| 26 | 4.177 | 6.114 | +| 27 | 4.441 | 6.349 | +| 28 | 4.951 | 7.034 | +| 29 | 5.09 | 13.404 | +| 30 | 4.286 | 7.455 | +| 31 | 7.821 | 13.776 | +| 32 | 4.623 | 7.025 | +| 33 | 9.304 | 7.273 | +| 34 | 4.806 | 13.23 | +| 35 | 5.469 | 6.562 | +| 36 | 10.045 | 6.581 | +| 37 | 4.619 | 11.716 | +| 38 | 5.807 | 5.939 | +| 39 | 9.101 | 6.531 | +| 40 | 4.968 | 6.675 | +| 41 | 4.654 | 6.389 | +| 42 | 4.763 | 6.774 | +| 43 | 4.155 | 6.203 | +| 44 | 4.672 | 6.288 | +| 45 | 5.052 | 12.845 | +| 46 | 4.812 | 6.185 | +| 47 | 7.941 | 6.653 | +| 48 | 5.401 | 5.775 | +| 49 | 4.684 | 5.776 | +| 50 | 4.698 | 10.646 | +| 51 | 4.604 | 6.359 | +| 52 | 8.843 | 5.899 | +| 53 | 4.785 | 11.284 | +| 54 | 4.969 | 7.116 | +| 55 | 7.634 | 6.291 | +| 56 | 4.424 | 6.534 | +| 57 | 5.086 | 11.372 | +| 58 | 4.461 | 5.613 | +| 59 | 7.809 | 19.169 | +| 60 | 4.852 | 7.34 | +| 61 | 8.486 | 7.182 | +| 62 | 4.412 | 10.598 | +| 63 | 4.168 | 17.974 | +| 64 | 4.287 | 6.998 | +| 65 | 8.867 | 6.526 | +| 66 | 4.37 | 6.457 | +| 67 | 4.23 | 12.335 | +| 68 | 5.28 | 13.638 | +| 69 | 7.945 | 7.597 | +| 70 | 4.647 | 41.695 | +| 71 | 4.429 | 22.057 | +| 72 | 8.659 | 11.863 | +| 73 | 10.062 | 17.561 | +| 74 | 5.514 | 8.366 | +| 75 | 7.608 | 14.011 | +| 76 | 4.586 | 7.205 | +| 77 | 7.692 | 6.602 | +| 78 | 5.319 | 6.103 | +| 79 | 4.884 | 18.043 | +| 80 | 4.13 | 6.89 | +| 81 | 8.511 | 11.561 | +| 82 | 4.983 | 7.434 | +| 83 | 8.486 | 5.779 | +| 84 | 5.158 | 6.033 | +| 85 | 4.224 | 6.359 | +| 86 | 4.444 | 6.736 | +| 87 | 4.546 | 6.397 | +| 88 | 4.799 | 6.904 | +| 89 | 3.967 | 7.01 | +| 90 | 4.437 | 5.612 | +| 91 | 4.541 | 10.904 | +| 92 | 4.649 | 6.64 | +| 93 | 4.665 | 6.508 | +| 94 | 62.925 | 17.063 | +| 95 | 4.335 | 7.104 | +| 96 | 7.366 | 7.699 | +| 97 | 5.028 | 6.881 | +| 98 | 5.097 | 5.803 | +| 99 | 4.345 | 6.36 | +| 100 | 4.251 | 9.569 | +| 101 | 4.671 | 6.121 | +| 102 | 7.432 | 5.491 | +| 103 | 4.508 | 6.412 | +| 104 | 4.866 | 6.104 | +| 105 | 4.58 | 12.243 | +| 106 | 4.94 | 6.362 | +| 107 | 7.842 | 5.697 | +| 108 | 4.332 | 11.934 | +| 109 | 4.646 | 29.469 | +| 110 | 4.534 | 14.249 | +| 111 | 9.206 | 13.585 | +| 112 | 4.802 | 10.479 | +| 113 | 4.119 | 8.238 | +| 114 | 4.139 | 9.334 | +| 115 | 4.588 | 8.438 | +| 116 | 4.537 | 8.621 | +| 117 | 4.688 | 7.742 | +| 118 | 4.055 | 8.27 | +| 119 | 4.703 | 6.958 | +| 120 | 4.51 | 23.409 | +| 121 | 4.178 | 9.167 | +| 122 | 9.22 | 6.772 | +| 123 | 4.359 | 12.718 | +| 124 | 4.926 | 7.611 | +| 125 | 4.487 | 14.572 | +| 126 | 4.571 | 13.044 | +| 127 | 8.108 | 8.975 | +| 128 | 4.472 | 18.712 | +| 129 | 4.66 | 45.553 | +| 130 | 7.791 | 12.255 | +| 131 | 8.569 | 13.897 | +| 132 | 5.671 | 12.073 | +| 133 | 9.43 | 21.305 | +| 134 | 4.015 | 6.391 | +| 135 | 8.961 | 47.173 | +| 136 | 5.955 | 9.856 | +| 137 | 7.095 | 6.996 | +| 138 | 4.641 | 11.802 | +| 139 | 5.156 | 6.072 | +| 140 | 7.757 | 6.365 | +| 141 | 4.71 | 10.944 | +| 142 | 4.346 | 7.061 | +| 143 | 8.663 | 6.19 | +| 144 | 4.641 | 13.49 | +| 145 | 4.864 | 6.598 | +| 146 | 7.932 | 7.13 | +| 147 | 4.56 | 11.359 | +| 148 | 3.941 | 6.113 | +| 149 | 7.729 | 10.987 | +| 150 | 4.7 | 7.347 | +| 151 | 8.089 | 7.309 | +| 152 | 4.016 | 12.875 | +| 153 | 4.468 | 7.873 | +| 154 | 9.136 | 6.559 | +| 155 | 5.69 | 16.854 | +| 156 | 4.253 | 6.499 | +| 157 | 8.223 | 5.983 | +| 158 | 4.733 | 16.357 | +| 159 | 4.347 | 15.032 | +| 160 | 4.276 | 13.735 | +| 161 | 17.02 | 13.094 | +| 162 | 4.765 | 22.683 | +| 163 | 5.941 | 12.033 | +| 164 | 66.877 | 11.795 | +| 165 | 15.101 | 6.563 | +| 166 | 8.748 | 9.21 | +| 167 | 5.412 | 10.025 | +| 168 | 11.692 | 8.565 | +| 总计 | 1109.742 | 1876.579 | + + 从数据中可以看出,使用本地字典在数据加载时将花费更多的时间,不使用字典时耗时明显少,从168个批次的统计数据看使用本地字典将比不使用本地字典多消耗70%的加载时长,但是在查询性能上又有提升,详细见后续介绍。 + +查询性能的对比 + + 这里使用了一组点查的SQL对使用本地字典、全局字典和不使用字典的数据表进行了查询统计,记录一些统计结论。 + +点查询SQL分别使用 + +``` +select * from test.detail_benchmark where MSISDN="13900000000" limit 2000; +select * from test.detail_loacl_dict where MSISDN="13900000000" limit 2000; + +``` + +模糊查询SQL分别使用 + +``` +select * from test.detail_benchmark where MSISDN like "1361%" limit 2000; +select * from test.detail_loacl_dict where MSISDN like "1391%" limit 2000; +``` + + + +记录查询耗时情况如下: + +| 查询用例 | 耗时 秒 detail_benchmark | 耗时 秒 detail_loacl_dict | +| ------------------ | ------------------------ | ------------------------- | +| 点查询五次平均值 | 18.5 | 14.543 | +| 模糊查询五次平均值 | 10.832 | 4.118 | + + 从结果可以看出,使用本地字典查询较不使用本地字典点查的时候性能有较大的提升。使用本地字典点查询性能提升在30%左右,模糊查询性能提升超过1倍。 + + 用户在选择字典的使用上时,可以根据自身对数据加载的要求和查询要求来选择字典,本案例由于对数据入库的时效性有较高要求,为了避免数据加载的延迟,因此在建表的时候选择无字典的方案,牺牲了一些查询性能。 + + + +### 建表 Create table 时SORT_COLUMNS和SORT_SCOPE的选择 + + 关于SORT_COLUMNS的配置可以参考CarbonData[官网](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)的说明, 当用户不指定SORT_COLUMNS时,默认将不建立SORT_COLUMNS,当前的SORT_COLUMNS只支持:string, date, timestamp, short, int, long, byte, boolean类型。SORT_COLUMNS 在用户指定的列上建立MKD索引,将有助于查询性能的提升,但是将稍微影响加载性能,因此只需要给需要的列上设置SORT_COLUMNS即可,无需给所有列都设置SORT_COLUMNS。 + + [SORT_SCOPE](http://carbondata.apache.org/ddl-of-carbondata.html#create-table)参数用户在指定了SORT_COLUMNS后,数据加载时排序的设置,当前支持的设置有: + +NO_SORT: 对加载的数据不排序 + +LOCAL_SORT: 加载的数据在本地排序 + +GLOBAL_SORT: 加载的数据全局排序,它提升了高并发下点查的效率,对加载的性能的影响较大。 + +在建表时,对使用SORT_COLUMNS时“NO_SORT”,“LOCAL_SORT”, “GLOBAL_SORT”的性能进行分析。 + +建表语句: + +```全局排序``` + +``` +create table if not exists test.detail_global_sort +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256', 'SORT_COLUMNS'='msisdn,req_time_sec,req_succed_flag', 'SORT_SCOPE'='GLOBAL_SORT') +``` + +```本地排序``` + +``` +create table if not exists test.detail_local_sort +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256', 'SORT_COLUMNS'='msisdn,req_time_sec,req_succed_flag', 'SORT_SCOPE'='LOCAL_SORT') +``` + +```不排序``` + +``` +create table if not exists test.detail_no_sort +('id', BIGINT, 'imsi' STRING,'msisdn' STRING, `imei` STRING, ...) + TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', TBLPROPERTIES ( 'LOCAL_DICTIONARY_ENABLE'='false', 'table_blocksize'='256', 'SORT_COLUMNS'='msisdn,req_time_sec,req_succed_flag', 'SORT_SCOPE'='NO_SORT') +``` + +验证其加载性能如下: + +| 表 | 数据量 | 加载耗时 秒 | +| ----------------------- | ---------- | ----------- | +| test.detail_benchmark | 1613149548 | 1109 | Review comment: okay,i add done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#issuecomment-571497276 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1502/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
xuchuanyin commented on a change in pull request #3523: [doc_chinese_doc ]add Create table scene by day sortColumn effect analysis chinese doc …
URL: https://github.com/apache/carbondata/pull/3523#discussion_r363694813 ########## File path: docs/zh_cn/明细数据查询的典型Carbon应用-点查 过滤条件.md ########## @@ -0,0 +1,525 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# 明细数据查询的典型CarbonData应用-点查+过滤条件 Review comment: 建议修改为“CarbonData典型应用场景之明细数据查询:点查+过滤条件”,同样文件名也修改成类似的 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |