Java ETL: hard to find a suitable one [closed]

2022-09-01 23:40:46

I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code.

I'm finding it surprisingly hard to find a suitable one.

I'm mainly looking at loading delimited text files into database tables, with some minor transforms along the way.

I'd like the following features:

  • the ability to specify the simple mappings externally, e.g, text column 5 to database column foo, specified some xml mapping file
  • the ability to give the the database node a javax.sql.Datasource

CloverETL allows mapping to be specified in XML, but database connections must be either JNDI names or a properties file specifying driverClass, url, dbusername, password, etc. Since I already have s set up by my dependency injection framework, properties files seem painful and non-robust, especially if I want this to work in several environments (dev, test, prod).javax.sql.Datasource

KETL tells me that "We are currently in the process of completely overhauling our documentation for KETL™. Because of this, only the installation guide has been updated." Honest, but not helpful.

Octopus is now "http://www.together.at/prod/database/tdt", which is "under construction".

Pentaho seems to use the same "specify driverClass" style that CloverETL does, rather that using a datasource, but Pentaho's documentation for calling their engine from java code is just difficult to find.

Basically I'd really like to be able to do this pseudo-code:

extractTransformLoad(         
        getInputFile( "input.csv" ) , 
        getXMLMapping( "myMappingFile.xml") ,
        new DatabaseWriter( getDatasource() );

Any suggestions?


答案 1

Disclosure: I'm the author of Scriptella ETL, but I believe this tool might be useful for your case.

It's a lightweight open source ETL with a one-liner integration with Java. It also supports Spring Framework and comes with built-in drivers for CSV, text, XML, Excel and other data-sources.

Example of importing a CSV file into a table:

<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
  <connection id="in" driver="csv" url="data.csv" />
  <connection id="out" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:ORCL" 
      classpath="ojdbc14.jar" user="scott" password="tiger" />
  <!-- Copy all CSV rows to a database table -->
  <query connection-id="in">
      <!-- Empty query means select all columns -->
      <script connection-id="out">
          INSERT INTO Table_Name VALUES (?id,?priority, ?summary, ?status)
      </script>
  </query>
</etl>

Running from Java:

// Execute etl.xml file
EtlExecutor.newExecutor(new File("etl.xml")).execute();

Running from command-line:

scriptella [file_name]

Integration with Spring:

  1. Use driver and the name of the bean to references data-sources. Example:"spring"

    <connection id="spring" driver="spring" url="datasourceBeanName" />
    
  2. Add to the application context in order to execute the job:EtlExecutorBean

    <bean id="createDb" class="scriptella.driver.spring.EtlExecutorBean">
        <property name="configLocation" value="create-db.etl.xml" />
        <property name="progressIndicator"><ref local="progress" /></property>
        <property name="autostart" value="true" /> <!-- Etl will be run during app context initialization -->
    </bean>
    

For additional details see the Spring example.


答案 2

Do you know Talend?

It's a tool based on Eclipse (Talend Open Studio), but you can use it directly in Java by writing your own code or by exporting jobs to Java classes.


推荐