This particular post might not be very “edgy” but rather something to get me going, as well as functioning as a personal “how the h*ll did I do that” for the future.
A client presented us with a rather standard flatfile. It was about 20 rows long, separated by the ever usual CR LF, and they contained tag identifiers.
The identifiers were numbered 10, 20, 30 and 40 and they could appear in any order, may or may not be present and the records were positional.
Seems easy, right? Well it took me some time and there are a couple of gotchas along the way so I thought I would share them.
Here is a subsection of the file. Notice the repeating identifiers and the difference in length between identical tags.
10000000000004083006019800324E50000John Doe 2500000000000000433 Q00086Jane doe 3000000000000008448H00001Joe Doe 4000000000000008448H00001Hanna Montana 10000000000004083006019800324E50000
There were three issues that needed to be solved and, again I am not presenting anything new here;
- Tag identifiers in positional records.
- Repeating tag identifiers.
- The last column could be empty or not padded.
I started by adding a flat file schema, adding a root, and setting the “Child delimiter Type” to “Hexadecimal” and “Child Delimiter” to “0xD 0xA”, now for the interesting part.
Tag identifiers in positional records
Tag identifiers are simply put, just that: some kind of text at the start of a record that shows what kind of record it is. It is a very useful feature and is often used in mainframe scenarios.
The tag identifiers in the file above are the first two positions. So I added a new child record and set the property “Tag identifier” to “10”, and the “Structure” to “Positional”.
I then proceeded to add all the different columns based of the relative length and position.
Moving on I did the same for the other tag-types as well and ended up with a schema looking quite simple and straightforward.
This is when I ran into my first error: The tag is not removed from the row of data when it is used. So all the other positions was offset by two positions and the parser could not find any columns, or rather the rows were too long.
In order to make it work you have to offset the first field by the length of your tag identifier, or add a column at the front to handle the identifier. I opted for the first one and updated the property “Positional Offset” to “2” for the first field in every record.
You may, or may not know this, but XML actually infer some kind of tag order when writing data defined from an XSD. That is why you get an error when you skip a tag that is not “MinOccurs: 0”.
So how do you handle the fact that some records might repeat later in a file, like the one in question? The answer is to put all the “root children” in a Choice-node. So I right-clicked the “root” node and selected to add a “Choice Group”. I set the property “Max Occurs” to “Unbounded” for the choice node, as well as all the other child records. For those I also added “Min Occurs: 0”.
Lastly I added all the child nodes to the Choice Node and now the row types (tags) may be in any order and may occur zero or unbounded times.
The last column may be empty
One very popular way to export data from a mainframe is to make all the rows be of the same length. One very popular length is 80 characters. However, many exporters choose to end the rows when the data ends. So rather then putting in enough white space to reach a total length of 80, a CR LF is inserted and the row is ended.
In order to allow early termination, which is set to false by default, I had to enable it. The property was located at “Schema” level and is called “Allow Early Termination”. Once I set that to “True”, everything worked fine.
Note that if you do not set “Min Occurs: 0”, the tag will be created but set to null if the row terminates early.
Data: "…Mikael" = <NAME>Mikael</NAME> Data: "…" = <NAME />
The fairly standard way of exporting data from a mainframe to a textfile, should from now on not really be a problem.
By the way, the files can be found here: RepeatingTagsFF