[ANNOUNCE] Apache CarbonData 1.5.2 release

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[ANNOUNCE] Apache CarbonData 1.5.2 release

sraghunandan
Hi All,


Apache CarbonData community is pleased to announce the release of the
Version 1.5.2 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.2/, and feedback
through the CarbonData user mailing lists <[hidden email]>!

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.2?

CarbonData 1.5.2 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have enhanced and stabilized
Presto features and the following features and improvements.

In this version of CarbonData, more than 68 JIRA tickets related to new
features, improvements, and bugs has been resolved. Following are the
summary.
CarbonData CoreSupport Compaction for No-sort Load Segments

During Data loading, if sort scope is set as No-sort, the data loading
performance would increase significantly as the data won't get sorted and
is written as it is received. But this no-sort loading would cause the
query performance to degrade as indexes are not built on these segments.
Compacting these no-sort loaded segments would convert these segments into
sorted segments and thereby improve the query performance as indexes get
generated. The ideal scenario to use this feature is when high speed data
loading is more important than a high query performance till the time the
compaction is not done.
Support Rename of Column Names

Column names can be renamed to reflect the business scenario or
conventions.
Support GZIP Compressor for CarbonData Files

GZIP compression is supported to compress each page of CarbonData file.
GZIP offers better compression ratio there by reducing the store size. On
the average GZIP compression reduces store size by 20-30% as compared to
Snappy compression. GZIP compression is supported to compress sort temp
files written during data loading. GZIP also has support from hardware.
Hence data loading performance would increase on those machines where GZIP
is supported natively from hardware.
Performance ImprovementsSupport Range Partitioned Sort during data load

Global Sort supported during Data loads ensures the data is entirely sorted
and hence group all the same data to a particular node/machine.This helps
to optimise the Spark scan performance and also increases the concurrency.
The drawback of Global Sort is that is very slow as the data has to be
globally sorted(Heavy shuffle). Local sort on the other hand partitions the
data to multiple nodes/machines and ensure the data local to that
node/machine is sorted. This improves the data loading performance, but
query performance degrades a bit as more Spark tasks will have to be
launched to scan the data. Range sort on the other hand, splits the data
based on the value range and loads using local sort. This give a balanced
performance for both load and query.
Other ImprovementsPresto Enhancements

CarbonData implemented features to better integrate with Presto. Now Presto
can recognise CarbonData as a native format. Many bugs were fixed to
enhance the stability.
Support Map Data Type through DDL

1.5.0 version supported adding Map data type through CarbonData SDK. This
version supports adding Map data type through DDL.
Behaviour Change

   1. If user doesn’t specify sort columns during table creation, default
   sort scope is set to no-sort during data loading
   2. Default Complex values delimiter value is changed from '*$*','*:*' to
   '*\001*' , '*\002*' respectively
   3. Inverted Index generation is disabled by default

New Configuration Parameters
Configuration Name                            Default Value     Range
*carbon.table.load.sort.scope              *LOCAL_SORT    LOCAL_SORT,
NO_SORT, GLOBAL_SORT, BATCH_SORT
*carbon.range.column.scale.factor**     3                          1-300  *


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344321
Sub-task

   - [CARBONDATA-2755
   <https://issues.apache.org/jira/browse/CARBONDATA-2755>] - Compaction of
   Complex DataType (STRUCT AND ARRAY)
   - [CARBONDATA-2838
   <https://issues.apache.org/jira/browse/CARBONDATA-2838>] - Add SDV test
   cases for Local Dictionary Support
   - [CARBONDATA-3017
   <https://issues.apache.org/jira/browse/CARBONDATA-3017>] - Create DDL
   Support for Map Type
   - [CARBONDATA-3073
   <https://issues.apache.org/jira/browse/CARBONDATA-3073>] - Support other
   interface in carbon writer of C++ SDK
   - [CARBONDATA-3160
   <https://issues.apache.org/jira/browse/CARBONDATA-3160>] - Compaction
   support with MAP data type
   - [CARBONDATA-3182
   <https://issues.apache.org/jira/browse/CARBONDATA-3182>] - Fix SDV
   TestCase Failures in Delimiters
   - [CARBONDATA-3259
   <https://issues.apache.org/jira/browse/CARBONDATA-3259>] - Documentation
   Update

Bug

   - [CARBONDATA-3080
   <https://issues.apache.org/jira/browse/CARBONDATA-3080>] - Supporting
   local dictionary enable by default for SDK
   - [CARBONDATA-3102
   <https://issues.apache.org/jira/browse/CARBONDATA-3102>] - There are
   some error when use thriftServer and beeline
   - [CARBONDATA-3116
   <https://issues.apache.org/jira/browse/CARBONDATA-3116>] - set
   carbon.query.directQueryOnDataMap.enabled=true not working
   - [CARBONDATA-3127
   <https://issues.apache.org/jira/browse/CARBONDATA-3127>] - Hive module
   test case has been commented off,can' t run.
   - [CARBONDATA-3147
   <https://issues.apache.org/jira/browse/CARBONDATA-3147>] - Preaggregate
   dataload fails in case of concurrent load in some cases
   - [CARBONDATA-3153
   <https://issues.apache.org/jira/browse/CARBONDATA-3153>] - Change of
   Complex Delimiters
   - [CARBONDATA-3154
   <https://issues.apache.org/jira/browse/CARBONDATA-3154>] - Fix spark-2.1
   test error
   - [CARBONDATA-3159
   <https://issues.apache.org/jira/browse/CARBONDATA-3159>] - Issue with
   SDK Write when empty array is given
   - [CARBONDATA-3162
   <https://issues.apache.org/jira/browse/CARBONDATA-3162>] - Range filters
   doesn't remove null values for no_sort direct dictionary dimension columns.
   - [CARBONDATA-3165
   <https://issues.apache.org/jira/browse/CARBONDATA-3165>] - Query of
   BloomFilter java.lang.NullPointerException
   - [CARBONDATA-3174
   <https://issues.apache.org/jira/browse/CARBONDATA-3174>] - Fix trailing
   space issue with varchar column for SDK
   - [CARBONDATA-3181
   <https://issues.apache.org/jira/browse/CARBONDATA-3181>] -
   IllegalAccessError for BloomFilter.bits when bloom_compress is false
   - [CARBONDATA-3184
   <https://issues.apache.org/jira/browse/CARBONDATA-3184>] - Fix DataLoad
   failure with "using carbondata"
   - [CARBONDATA-3188
   <https://issues.apache.org/jira/browse/CARBONDATA-3188>] - Create carbon
   table as hive understandable metastore table needed by Presto and Hive
   - [CARBONDATA-3196
   <https://issues.apache.org/jira/browse/CARBONDATA-3196>] - Compaction
   Failing for Complex datatypes with Dictionary Include
   - [CARBONDATA-3203
   <https://issues.apache.org/jira/browse/CARBONDATA-3203>] - Compaction
   failing for table which is retstructured
   - [CARBONDATA-3205
   <https://issues.apache.org/jira/browse/CARBONDATA-3205>] - Fix Get Local
   Dictionary for empty Array of Struct
   - [CARBONDATA-3212
   <https://issues.apache.org/jira/browse/CARBONDATA-3212>] - Select * is
   failing with java.lang.NegativeArraySizeException in SDK flow
   - [CARBONDATA-3216
   <https://issues.apache.org/jira/browse/CARBONDATA-3216>] - There are
   some bugs in CSDK
   - [CARBONDATA-3221
   <https://issues.apache.org/jira/browse/CARBONDATA-3221>] - SDK don't
   support read multiple file from S3
   - [CARBONDATA-3222
   <https://issues.apache.org/jira/browse/CARBONDATA-3222>] - Fix dataload
   failure after creation of preaggregate datamap on main table with
   long_string_columns
   - [CARBONDATA-3224
   <https://issues.apache.org/jira/browse/CARBONDATA-3224>] - SDK should
   validate the improper value
   - [CARBONDATA-3233
   <https://issues.apache.org/jira/browse/CARBONDATA-3233>] - JVM is
   getting crashed during dataload while compressing in snappy
   - [CARBONDATA-3238
   <https://issues.apache.org/jira/browse/CARBONDATA-3238>] - Throw
   StackOverflowError exception using MV datamap
   - [CARBONDATA-3239
   <https://issues.apache.org/jira/browse/CARBONDATA-3239>] - Throwing
   ArrayIndexOutOfBoundsException in DataSkewRangePartitioner
   - [CARBONDATA-3243
   <https://issues.apache.org/jira/browse/CARBONDATA-3243>] -
   CarbonTable.getSortScope() is not considering session property
   CARBON.TABLE.LOAD.SORT.SCOPE
   - [CARBONDATA-3246
   <https://issues.apache.org/jira/browse/CARBONDATA-3246>] - SDK reader
   fails if vectorReader is false for concurrent read scenario and batch size
   is zero.
   - [CARBONDATA-3260
   <https://issues.apache.org/jira/browse/CARBONDATA-3260>] - Broadcast
   join is not properly in carbon with spark-2.3.2
   - [CARBONDATA-3262
   <https://issues.apache.org/jira/browse/CARBONDATA-3262>] - Failure to
   write merge index file results in merged segment being deleted when cleanup
   happens
   - [CARBONDATA-3265
   <https://issues.apache.org/jira/browse/CARBONDATA-3265>] - Memory Leak
   and Low Query Performance Issues in Range Partition
   - [CARBONDATA-3267
   <https://issues.apache.org/jira/browse/CARBONDATA-3267>] - Data loading
   is failing with OOM using range sort
   - [CARBONDATA-3268
   <https://issues.apache.org/jira/browse/CARBONDATA-3268>] - Query on
   Varchar showing as Null in Presto
   - [CARBONDATA-3269
   <https://issues.apache.org/jira/browse/CARBONDATA-3269>] - Range_column
   throwing ArrayIndexOutOfBoundsException when using KryoSerializer
   - [CARBONDATA-3273
   <https://issues.apache.org/jira/browse/CARBONDATA-3273>] - For table
   without SORT_COLUMNS, Loading data is showing SORT_SCOPE=LOCAL_SORT instead
   of NO_SORT
   - [CARBONDATA-3275
   <https://issues.apache.org/jira/browse/CARBONDATA-3275>] - There are 4
   errors in CI after PR 3094 merged
   - [CARBONDATA-3282
   <https://issues.apache.org/jira/browse/CARBONDATA-3282>] - presto carbon
   doesn't work with Hadoop conf in cluster.

New Feature

   - [CARBONDATA-45 <https://issues.apache.org/jira/browse/CARBONDATA-45>]
   - Support MAP type
   - [CARBONDATA-3149
   <https://issues.apache.org/jira/browse/CARBONDATA-3149>] - Support alter
   table column rename
   - [CARBONDATA-3194
   <https://issues.apache.org/jira/browse/CARBONDATA-3194>] - Support Hive
   Metastore in Presto CarbonData.

Improvement

   - [CARBONDATA-3023
   <https://issues.apache.org/jira/browse/CARBONDATA-3023>] - Alter add
   column issue with SORT_COLUMNS
   - [CARBONDATA-3133
   <https://issues.apache.org/jira/browse/CARBONDATA-3133>] - Update
   carbondata build document
   - [CARBONDATA-3142
   <https://issues.apache.org/jira/browse/CARBONDATA-3142>] - The names of
   threads created by CarbonThreadFactory are all the same
   - [CARBONDATA-3157
   <https://issues.apache.org/jira/browse/CARBONDATA-3157>] - Integrate
   carbon lazy loading to presto carbon integration
   - [CARBONDATA-3158
   <https://issues.apache.org/jira/browse/CARBONDATA-3158>] - support
   presto-carbon to read sdk cabron files
   - [CARBONDATA-3166
   <https://issues.apache.org/jira/browse/CARBONDATA-3166>] - Changes in
   Document and Displaying Carbon Column Compressor used in Describe Formatted
   Command
   - [CARBONDATA-3176
   <https://issues.apache.org/jira/browse/CARBONDATA-3176>] - Optimize
   quick-start-guide documentation
   - [CARBONDATA-3215
   <https://issues.apache.org/jira/browse/CARBONDATA-3215>] - Optimize the
   documentation
   - [CARBONDATA-3219
   <https://issues.apache.org/jira/browse/CARBONDATA-3219>] - support range
   partition the input data for local_sort/global sort data loading
   - [CARBONDATA-3220
   <https://issues.apache.org/jira/browse/CARBONDATA-3220>] - Should
   support presto to read stream segment data
   - [CARBONDATA-3230
   <https://issues.apache.org/jira/browse/CARBONDATA-3230>] - Add ALTER
   test case with datasource for using parquet and carbon
   - [CARBONDATA-3241
   <https://issues.apache.org/jira/browse/CARBONDATA-3241>] - Refactor the
   requested scan columns and the projection columns
   - [CARBONDATA-3242
   <https://issues.apache.org/jira/browse/CARBONDATA-3242>] - Range_Column
   should be table level property
   - [CARBONDATA-3253
   <https://issues.apache.org/jira/browse/CARBONDATA-3253>] - Remove test
   case of bloom datamap using search mode
   - [CARBONDATA-3261
   <https://issues.apache.org/jira/browse/CARBONDATA-3261>] - support float
   and byte reading from presto

Test

   - [CARBONDATA-3141
   <https://issues.apache.org/jira/browse/CARBONDATA-3141>] - Remove Carbon
   Table Detail Test Case
   - [CARBONDATA-3175
   <https://issues.apache.org/jira/browse/CARBONDATA-3175>] - Fix Testcase
   failures in complex delimiters



Regards
Raghunandan