Apache CarbonData Dev Mailing List archive

[ANNOUNCE] Apache CarbonData 1.5.2 release

Classic

List

Threaded

1 message

sraghunandan

[ANNOUNCE] Apache CarbonData 1.5.2 release

Hi All,

Apache CarbonData community is pleased to announce the release of the
Version 1.5.2 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.2/, and feedback
through the CarbonData user mailing lists <[hidden email]>!

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.2?

CarbonData 1.5.2 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have enhanced and stabilized
Presto features and the following features and improvements.

In this version of CarbonData, more than 68 JIRA tickets related to new
features, improvements, and bugs has been resolved. Following are the
summary.
CarbonData CoreSupport Compaction for No-sort Load Segments

During Data loading, if sort scope is set as No-sort, the data loading
performance would increase significantly as the data won't get sorted and
is written as it is received. But this no-sort loading would cause the
query performance to degrade as indexes are not built on these segments.
Compacting these no-sort loaded segments would convert these segments into
sorted segments and thereby improve the query performance as indexes get
generated. The ideal scenario to use this feature is when high speed data
loading is more important than a high query performance till the time the
compaction is not done.
Support Rename of Column Names

Column names can be renamed to reflect the business scenario or
conventions.
Support GZIP Compressor for CarbonData Files

GZIP compression is supported to compress each page of CarbonData file.
GZIP offers better compression ratio there by reducing the store size. On
the average GZIP compression reduces store size by 20-30% as compared to
Snappy compression. GZIP compression is supported to compress sort temp
files written during data loading. GZIP also has support from hardware.
Hence data loading performance would increase on those machines where GZIP
is supported natively from hardware.
Performance ImprovementsSupport Range Partitioned Sort during data load

Global Sort supported during Data loads ensures the data is entirely sorted
and hence group all the same data to a particular node/machine.This helps
to optimise the Spark scan performance and also increases the concurrency.
The drawback of Global Sort is that is very slow as the data has to be
globally sorted(Heavy shuffle). Local sort on the other hand partitions the
data to multiple nodes/machines and ensure the data local to that
node/machine is sorted. This improves the data loading performance, but
query performance degrades a bit as more Spark tasks will have to be
launched to scan the data. Range sort on the other hand, splits the data
based on the value range and loads using local sort. This give a balanced
performance for both load and query.
Other ImprovementsPresto Enhancements

CarbonData implemented features to better integrate with Presto. Now Presto
can recognise CarbonData as a native format. Many bugs were fixed to
enhance the stability.
Support Map Data Type through DDL

1.5.0 version supported adding Map data type through CarbonData SDK. This
version supports adding Map data type through DDL.
Behaviour Change

1. If user doesn’t specify sort columns during table creation, default
sort scope is set to no-sort during data loading
2. Default Complex values delimiter value is changed from '*$*','*:*' to
'*\001*' , '*\002*' respectively
3. Inverted Index generation is disabled by default

New Configuration Parameters
Configuration Name Default Value Range
*carbon.table.load.sort.scope *LOCAL_SORT LOCAL_SORT,
NO_SORT, GLOBAL_SORT, BATCH_SORT
*carbon.range.column.scale.factor** 3 1-300 *

Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344321
Sub-task

- [CARBONDATA-2755
<https://issues.apache.org/jira/browse/CARBONDATA-2755>] - Compaction of
Complex DataType (STRUCT AND ARRAY)
- [CARBONDATA-2838
<https://issues.apache.org/jira/browse/CARBONDATA-2838>] - Add SDV test
cases for Local Dictionary Support
- [CARBONDATA-3017
<https://issues.apache.org/jira/browse/CARBONDATA-3017>] - Create DDL
Support for Map Type
- [CARBONDATA-3073
<https://issues.apache.org/jira/browse/CARBONDATA-3073>] - Support other
interface in carbon writer of C++ SDK
- [CARBONDATA-3160
<https://issues.apache.org/jira/browse/CARBONDATA-3160>] - Compaction
support with MAP data type
- [CARBONDATA-3182
<https://issues.apache.org/jira/browse/CARBONDATA-3182>] - Fix SDV
TestCase Failures in Delimiters
- [CARBONDATA-3259
<https://issues.apache.org/jira/browse/CARBONDATA-3259>] - Documentation
Update

Bug

- [CARBONDATA-3080
<https://issues.apache.org/jira/browse/CARBONDATA-3080>] - Supporting
local dictionary enable by default for SDK
- [CARBONDATA-3102
<https://issues.apache.org/jira/browse/CARBONDATA-3102>] - There are
some error when use thriftServer and beeline
- [CARBONDATA-3116
<https://issues.apache.org/jira/browse/CARBONDATA-3116>] - set
carbon.query.directQueryOnDataMap.enabled=true not working
- [CARBONDATA-3127
<https://issues.apache.org/jira/browse/CARBONDATA-3127>] - Hive module
test case has been commented off,can' t run.
- [CARBONDATA-3147
<https://issues.apache.org/jira/browse/CARBONDATA-3147>] - Preaggregate
dataload fails in case of concurrent load in some cases
- [CARBONDATA-3153
<https://issues.apache.org/jira/browse/CARBONDATA-3153>] - Change of
Complex Delimiters
- [CARBONDATA-3154
<https://issues.apache.org/jira/browse/CARBONDATA-3154>] - Fix spark-2.1
test error
- [CARBONDATA-3159
<https://issues.apache.org/jira/browse/CARBONDATA-3159>] - Issue with
SDK Write when empty array is given
- [CARBONDATA-3162
<https://issues.apache.org/jira/browse/CARBONDATA-3162>] - Range filters
doesn't remove null values for no_sort direct dictionary dimension columns.
- [CARBONDATA-3165
<https://issues.apache.org/jira/browse/CARBONDATA-3165>] - Query of
BloomFilter java.lang.NullPointerException
- [CARBONDATA-3174
<https://issues.apache.org/jira/browse/CARBONDATA-3174>] - Fix trailing
space issue with varchar column for SDK
- [CARBONDATA-3181
<https://issues.apache.org/jira/browse/CARBONDATA-3181>] -
IllegalAccessError for BloomFilter.bits when bloom_compress is false
- [CARBONDATA-3184
<https://issues.apache.org/jira/browse/CARBONDATA-3184>] - Fix DataLoad
failure with "using carbondata"
- [CARBONDATA-3188
<https://issues.apache.org/jira/browse/CARBONDATA-3188>] - Create carbon
table as hive understandable metastore table needed by Presto and Hive
- [CARBONDATA-3196
<https://issues.apache.org/jira/browse/CARBONDATA-3196>] - Compaction
Failing for Complex datatypes with Dictionary Include
- [CARBONDATA-3203
<https://issues.apache.org/jira/browse/CARBONDATA-3203>] - Compaction
failing for table which is retstructured
- [CARBONDATA-3205
<https://issues.apache.org/jira/browse/CARBONDATA-3205>] - Fix Get Local
Dictionary for empty Array of Struct
- [CARBONDATA-3212
<https://issues.apache.org/jira/browse/CARBONDATA-3212>] - Select * is
failing with java.lang.NegativeArraySizeException in SDK flow
- [CARBONDATA-3216
<https://issues.apache.org/jira/browse/CARBONDATA-3216>] - There are
some bugs in CSDK
- [CARBONDATA-3221
<https://issues.apache.org/jira/browse/CARBONDATA-3221>] - SDK don't
support read multiple file from S3
- [CARBONDATA-3222
<https://issues.apache.org/jira/browse/CARBONDATA-3222>] - Fix dataload
failure after creation of preaggregate datamap on main table with
long_string_columns
- [CARBONDATA-3224
<https://issues.apache.org/jira/browse/CARBONDATA-3224>] - SDK should
validate the improper value
- [CARBONDATA-3233
<https://issues.apache.org/jira/browse/CARBONDATA-3233>] - JVM is
getting crashed during dataload while compressing in snappy
- [CARBONDATA-3238
<https://issues.apache.org/jira/browse/CARBONDATA-3238>] - Throw
StackOverflowError exception using MV datamap
- [CARBONDATA-3239
<https://issues.apache.org/jira/browse/CARBONDATA-3239>] - Throwing
ArrayIndexOutOfBoundsException in DataSkewRangePartitioner
- [CARBONDATA-3243
<https://issues.apache.org/jira/browse/CARBONDATA-3243>] -
CarbonTable.getSortScope() is not considering session property
CARBON.TABLE.LOAD.SORT.SCOPE
- [CARBONDATA-3246
<https://issues.apache.org/jira/browse/CARBONDATA-3246>] - SDK reader
fails if vectorReader is false for concurrent read scenario and batch size
is zero.
- [CARBONDATA-3260
<https://issues.apache.org/jira/browse/CARBONDATA-3260>] - Broadcast
join is not properly in carbon with spark-2.3.2
- [CARBONDATA-3262
<https://issues.apache.org/jira/browse/CARBONDATA-3262>] - Failure to
write merge index file results in merged segment being deleted when cleanup
happens
- [CARBONDATA-3265
<https://issues.apache.org/jira/browse/CARBONDATA-3265>] - Memory Leak
and Low Query Performance Issues in Range Partition
- [CARBONDATA-3267
<https://issues.apache.org/jira/browse/CARBONDATA-3267>] - Data loading
is failing with OOM using range sort
- [CARBONDATA-3268
<https://issues.apache.org/jira/browse/CARBONDATA-3268>] - Query on
Varchar showing as Null in Presto
- [CARBONDATA-3269
<https://issues.apache.org/jira/browse/CARBONDATA-3269>] - Range_column
throwing ArrayIndexOutOfBoundsException when using KryoSerializer
- [CARBONDATA-3273
<https://issues.apache.org/jira/browse/CARBONDATA-3273>] - For table
without SORT_COLUMNS, Loading data is showing SORT_SCOPE=LOCAL_SORT instead
of NO_SORT
- [CARBONDATA-3275
<https://issues.apache.org/jira/browse/CARBONDATA-3275>] - There are 4
errors in CI after PR 3094 merged
- [CARBONDATA-3282
<https://issues.apache.org/jira/browse/CARBONDATA-3282>] - presto carbon
doesn't work with Hadoop conf in cluster.

New Feature

- [CARBONDATA-45 <https://issues.apache.org/jira/browse/CARBONDATA-45>]
- Support MAP type
- [CARBONDATA-3149
<https://issues.apache.org/jira/browse/CARBONDATA-3149>] - Support alter
table column rename
- [CARBONDATA-3194
<https://issues.apache.org/jira/browse/CARBONDATA-3194>] - Support Hive
Metastore in Presto CarbonData.

Improvement

- [CARBONDATA-3023
<https://issues.apache.org/jira/browse/CARBONDATA-3023>] - Alter add
column issue with SORT_COLUMNS
- [CARBONDATA-3133
<https://issues.apache.org/jira/browse/CARBONDATA-3133>] - Update
carbondata build document
- [CARBONDATA-3142
<https://issues.apache.org/jira/browse/CARBONDATA-3142>] - The names of
threads created by CarbonThreadFactory are all the same
- [CARBONDATA-3157
<https://issues.apache.org/jira/browse/CARBONDATA-3157>] - Integrate
carbon lazy loading to presto carbon integration
- [CARBONDATA-3158
<https://issues.apache.org/jira/browse/CARBONDATA-3158>] - support
presto-carbon to read sdk cabron files
- [CARBONDATA-3166
<https://issues.apache.org/jira/browse/CARBONDATA-3166>] - Changes in
Document and Displaying Carbon Column Compressor used in Describe Formatted
Command
- [CARBONDATA-3176
<https://issues.apache.org/jira/browse/CARBONDATA-3176>] - Optimize
quick-start-guide documentation
- [CARBONDATA-3215
<https://issues.apache.org/jira/browse/CARBONDATA-3215>] - Optimize the
documentation
- [CARBONDATA-3219
<https://issues.apache.org/jira/browse/CARBONDATA-3219>] - support range
partition the input data for local_sort/global sort data loading
- [CARBONDATA-3220
<https://issues.apache.org/jira/browse/CARBONDATA-3220>] - Should
support presto to read stream segment data
- [CARBONDATA-3230
<https://issues.apache.org/jira/browse/CARBONDATA-3230>] - Add ALTER
test case with datasource for using parquet and carbon
- [CARBONDATA-3241
<https://issues.apache.org/jira/browse/CARBONDATA-3241>] - Refactor the
requested scan columns and the projection columns
- [CARBONDATA-3242
<https://issues.apache.org/jira/browse/CARBONDATA-3242>] - Range_Column
should be table level property
- [CARBONDATA-3253
<https://issues.apache.org/jira/browse/CARBONDATA-3253>] - Remove test
case of bloom datamap using search mode
- [CARBONDATA-3261
<https://issues.apache.org/jira/browse/CARBONDATA-3261>] - support float
and byte reading from presto

Test

- [CARBONDATA-3141
<https://issues.apache.org/jira/browse/CARBONDATA-3141>] - Remove Carbon
Table Detail Test Case
- [CARBONDATA-3175
<https://issues.apache.org/jira/browse/CARBONDATA-3175>] - Fix Testcase
failures in complex delimiters

Regards
Raghunandan