This particular post might not be very “edgy” but rather something to get me going, as well as functioning as a personal “how the h-ll did I do that” for the future.
The problem
A client presented us with a rather standard flatfile. It was about 20 rows long, separated by the ever usual CR LF, and they contained tag identifiers.
The identifiers were numbered 10, 20, 30 and 40 and they could appear in any order, may or may not be present and the records were positional.
Seems easy, right? Well it took me some time and there are a couple of gotchas along the way so I thought I would share them.
The file
Here is a subsection of the file. Notice the repeating identifiers and the difference in length between identical tags.
10000000000004083006019800324E50000John Doe
2500000000000000433 Q00086Jane doe
3000000000000008448H00001Joe Doe
4000000000000008448H00001Hanna Montana
10000000000004083006019800324E50000
The solution
There were three issues that needed to be solved and, again I am not presenting anything new here.
- Tag identifiers in positional records.
- Repeating tag identifiers.
- The last column could be empty or not padded.
I started by adding a flat file schema, adding a root, and setting the Child delimiter Type
to Hexadecimal
and Child Delimiter
to 0xD 0xA
, now for the interesting part.
Tag identifiers in positional records
Tag identifiers are simply put, just that: some kind of text at the start of a record that shows what kind of record it is. It is a very useful feature and is often used in mainframe scenarios.
The tag identifiers in the file above are the first two positions. So I added a new child record and set the property Tag identifier
to 10
, and the Structure
to Positional
.
I then proceeded to add all the different columns based of the relative length and position.
Moving on I did the same for the other tag-types as well and ended up with a schema looking quite simple and straightforward.
This is when I ran into my first error: The tag is not removed from the row of data when it is used. So all the other positions was offset by two positions and the parser could not find any columns, or rather the rows were too long.
In order to make it work you have to offset the first field by the length of your tag identifier, or add a column at the front to handle the identifier. I opted for the first one and updated the property “Positional Offset” to “2” for the first field in every record.
Repeating identifiers
You may, or may not know this, but XML actually infer some kind of tag order when writing data defined from an XSD. That is why you get an error when you skip a tag that is not MinOccurs: 0
.
So how do you handle the fact that some records might repeat later in a file, like the one in question? The answer is to put all the root children
in a Choice-node. So I right-clicked the root
node and selected to add a Choice Group
. I set the property Max Occurs
to Unbounded
for the choice node, as well as all the other child records. For those I also added Min Occurs: 0
Lastly I added all the child nodes to the Choice Node and now the row types (tags) may be in any order and may occur zero or unbounded times.
The last column may be empty
One very popular way to export data from a mainframe is to make all the rows be of the same length. One very popular length is 80 characters. However, many exporters choose to end the rows when the data ends. So rather then putting in enough white space to reach a total length of 80, a CR LF is inserted and the row is ended.
You might think that simply setting the tag to “Min Occurs: 0” might solve this but you, like I was, would be wrong. The concept is called “Early Termination” and once you know that it is easy.
In order to allow early termination, which is set to false by default, I had to enable it. The property was located at Schema level and is called Allow Early Termination
. Once I set that to True
, everything worked fine.
Note that if you do not set Min Occurs: 0
, the tag will be created but set to null if the row terminates early.
Data: "...Mikael" = <NAME>Mikael</NAME>
Data: "..." = <NAME />
In conclusion
The fairly standard way of exporting data from a mainframe to a textfile, should from now on not really be a problem.