Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Am trying to split a big xml file into multiple files, and have used the following code in AWK script.

/<fileItem>/ {
        rfile="fileItem" count ".xml"
        print "<?xml version="1.0" encoding="UTF-8"?>" > rfile
        print $0 > rfile
        getline
        while ($0 !~ "</fileItem>" ) {
                print > rfile
                getline
        }
        print $0 > rfile
        close(rfile)
        count++
}

The code above generates a list of xml files whose names read "fileItem_1", "fileItem_2", "fileItem3", etc.

However, I would like the file name to be something like "item_XXXXX" where the XXXXX is a node inside the XML - depicted as below

<fileItem>
<id>12345</id>
<name>XXXXX</name>
</fileItem>

So, basically I want the "id" node to be the filename. Can anyone please help me with this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.8k views
Welcome To Ask or Share your Answers For Others

1 Answer

I would not use getline. (I even read in an AWK book that it is not recommended to use it.) I think, using global variables for state it is even simpler. (Expressions with global variables may be used in patterns too.)

The script could look like this:

test-split-xml.awk:

/<fileItem>/ {
  collect = 1 ; buffer = "" ; file = "fileItem_"count".xml"
  ++count
}

collect > 0 {
  if (buffer != "") buffer = buffer"
"
  buffer = buffer $0
}

collect > 0 && /<name>.+</name>/ {
  # cut "...<name>"
  i = index($0, "<name>") ; file = substr($0, i + 6)
  # cut "</name>..."
  i = index(file, "</name>") ; file = substr(file, 1, i - 1)
  file = file".xml"
}

/</fileItem>/ {
  collect = 0;
  print file
  print "<?xml version="1.0" encoding="UTF-8"?>" >file
  print buffer >file
}

I prepared some sample data for a small test:

test-split-xml.xml:

<?xml version="1.0" encoding="UTF-8"?>
<top>
  <some>
    <fileItem>
      <id>1</id>
      <name>X1</name>
    </fileItem>
  </some>
  <fileItem>
    <id>2</id>
    <name>X2</name>
  </fileItem>
  <fileItem>
    <id>2</id>
    <!--name>X2</name-->
  </fileItem>
  <any> other input </any>
</top>

... and got the following output:

$ awk -f test-split-xml.awk test-split-xml.xml
X1.xml
X2.xml
fileItem_2.xml

$ more X1.xml 
<?xml version="1.0" encoding="UTF-8"?>
    <fileItem>
      <id>1</id>
      <name>X1</name>
    </fileItem>

$ more X2.xml
<?xml version="1.0" encoding="UTF-8"?>
  <fileItem>
    <id>2</id>
    <name>X2</name>
  </fileItem>

$ more fileItem_2.xml 
<?xml version="1.0" encoding="UTF-8"?>
  <fileItem>
    <id>2</id>
    <!--name>X2</name-->
  </fileItem>

$

The comment of tripleee is reasonable. Thus, such processing should be limited to personal usage because different (and legal) formattings of XML files could cause errors in this script processing.

As you will notice, there is no next in the whole script. This is intentionally.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...