+1,
If carbondata sdk can support load data from parquet, ORC, CSV, Avro and
JSON file, it will more convenient for users to use CarbonData. It avoid
every user to parser different fileformat and convert to carbondata format
by coding.
CarbonData SDK can refer spark-sql implementation, but CarbonSDK should
dependency Spark to implement this function because carbonData SDK is
usually used for no spark environment. CarbonData can integrate Parquet,
Json, CSV parser SDK to implement it.
For API design:
1. For load parquet, avro , json, csv files:
(1) it's better to support load many times data after create CarbonWriter.
(2) it's better to support schema map, for example, carbondata and parquet
has different schema, such as column name is different, column name is col1
in carbon , but column name is col2 in parquet, it's better for CarbonSDK to
support schema map, map parquet col2 to carbon col1.
2. API name:
public CarbonWriterBuilder withParquetFile(String filePath)
=>suggestion:
Parquet maybe has many files in the same folder. withParquetFile is easily
misunderstand one parquet file. so we can use withParquet(String filePath)
or withParquetPath
withOrcFile and withCsvFile is the same.
and it better to support load files list , for example, CarbonData SDK can
support load three files of five files in the same folder, which can be
selected by users.
3. lic void write() throws IOException;
=> we should support load data after create CarbonWriter, and load data to
memory and convert to carbon after call load method.
4. now we write data to disk/obs/hdfs by calling close method, it's better
to support flush.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/