среда, 22 января 2014 г.

How to read XML file into pandas dataframe using lxml

This is probably not the most effective way, but it's convenient and simple.

Let's pretend that we're analyzing the file with the content listed below:

<xml_root>

    <object>
        <id>1</id>
        <name>First</name>
    </object>

    <object>
        <id>2</id>
        <name>Second</name>
    </object>

    <object>
        <id>3</id>
        <name>Third</name>
    </object>

    <object>
        <id>4</id>
        <name>Fourth</name>
    </object>

</xml_root>

First, we need import lxml objectify

from lxml import objectify

Then, open the file:

path = 'file_path'
xml = objectify.parse(open(path))

Get the root node:

root = xml.getroot()

Now we can access child nodes, and with
root.getchildren()[0].getchildren()
we're able to get the actual content of the first child node as a simple Python list:

[1, 'First']

Now we obviously want to convert this data into data frame.

Les's import pandas:

import pandas as pd

Prepare a empty data frame that will hold our data:

df = pd.DataFrame(columns=('id', 'name'))

Now we go though our XML file appending data to this dataframe:

for i in range(0,4):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

(name of the Series object serves as an index element while appending the object to DataFrame)

And here is out fresh dataframe:

  id    name
0  1   First
1  2  Second
2  3   Third
3  4  Fourth


Full source code:

from lxml import objectify
import pandas as pd

path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('id', 'name'))

for i in range(0,4):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

2 комментария: