Xml files with Talend

Published Feb 07, 2018
Xml files with Talend

What we want to do

To join 2 input xml files to produce an output xml file

What we'll need

Talend v6.4.1. installed (should work with previous version of Talend too)
Two xml files :

  • authors.xml containing a list of authors like this:


  • books.xml containing a list of books with authorId referencing an author in authors.xml file, like this:

The results

We should get a new Xml file books_with_authorname.xml containing a list of book with author name, like this:

It is the same file as books.xml plus the authorname field and minus the publish_date and description fields.

Step 1 : Configure metadata

First things first.
Always configure the metadata before doing anything on Talend.

Metadata for authors.xml file

Right click on Metadata > File xml and choose Create file xml

Enter the file information as below and click Next:

Mark the file as Input XML and click Next:

Browse to the authors.xml file. If it is a valid Xml file, the structure of the file should appears.
Choose the right encoding, then click Next

Here's the step to tell Talend how it should read the Xml file:

Here we want to get all the authors information, so we need to loop on the <author> tag of the xml file and select the id attribute (marked as @id) and the <authorName> and <authorAge> tags.
Here how it looks:

Great ! Click on Next when you get the desired output.

N.B. : At this stage we could export this file configuration into a context by clicking on Export as context button. But I will skip this part since I think it deserves a dedicated post to explain context environments.

Let's move on the final step to create our xml file metadata that we will use in all of our Talend jobs.
Here you can rename the fields and correct/change the data type of each fields.

N.B. : Can be useful specially to increase size of String fields and to change date pattern of Date fields.

For example in the picture above I've changed the size of authorName from 20 to 50 characters. Indeed Talend estimates the number of characters by using data sample, so it can under-estimate the maximum number of characters for a field.

When you're done, click the Finish button

You have now a new Xml file metadata named authors that you can use in all Xml related components in all jobs you want.


Metadata for books.xml file

For the books.xml file steps are the same.
Here is the configuration of the Step 4 for the books.xml file:

Metadata for books_with_authors.xml file (output file)

Right click on Metadata > File xml and choose Create file xml

Enter the file information as below and click Next:

Mark the file as Output XML and click Next:

Choose to create the file manually since it doesn't exist yet.

N.B. : you could also create a result xml file and use it by choosing Create from a file

Click on Schema Management button and add fields to build the schema of the output file:

Change the name of the root node by "catalog" and right-click on the row and choose Add Sub-element:
Name the element book and click Ok
Right-click on the book element and choose Add Attribute named id.
Drag and drop the book_id field on the left to the column Related Column of the @id attribute.
This will map the book_id field to the book attribute in the output xml file.

Then create sub-element for each fields under the book element

Finally set the book element as loop element

Here's the summary of all the steps:

When done click Next
Review the name of the fields and data type and length and click Finish

Step 2 : Place components on the job

Create a job named j_xml_output and drag each of the metadata created in step 1 to the job canvas. For both authors and books choose tFileInputXML component when asked, for books_with_authors it won't ask since it is an output xml file metadata.
You should end up with the following:

Step 3 : Joining input files

To join the two input xml files we'll use the tMap component. Find it on the Palette and add it to the design job between the two input files and the ouptut file:

Right click on the books component and choose Row > Main
A path should appears inviting you to link the book component with another one. Go to the tMap component and left-click.
This should link the book component to the tMap component so that the both components now are a sub-job.

Do the exact same link with the authors component. You should get the following:
You can see on the screenshot above that the link between authors and tMap is dashed. This means the authors flow is a lookup.
The main flow remains the books list but we are going to look into authors file to get some information to complete the books list.

Double click on the tMap component. It should open the tMap editor.
Drag the authorId field of the row1 flow (books) to the Expr. key of the id field of the row2 flow (authors).

N.B. : for more convenience rename rowX with a more meaningful name. For example here row1 should have been renamed to books and row2 to authors. The cool thing is that you can do it after linking fields. So you'll see the modification in the next screenshots.

So after having map the authorId of the books with the id of the authors, you should end up with this:

And that's all for a basic join of two xml files. By clicking on the little magnifying glass with a red 1 in it you can set options of the join. Basically it is the same thing as inner join or left join in the SQL world.

Step 4 : Inserting output and mapping fields

Now link the tMap component with the output xml file component

Double click on the tMap component to open the tMap editor and map the input fields with the output fields like this:

Validate your changes by clicking the Ok button

Step 5 : Run the job and get result file

Below the job canvas, open the Run (Job j_xml_lookup) tab and hit the Run button
You should get the final results:
The number of rows written in the output file is the same number of rows as the main flow (books) produced (since no filter has been set).

Open the /.../books_with_authors.xml file, you should get the expected result:

That's the end of the post.

Don't hesitate to reach me for more explaination or if some elements are incorrect.

Have fun with Talend !

Discover and read more posts from Julien Wazné
get started