This is probably not the most effective way, but it's convenient and simple.
Let's pretend that we're analyzing the file with the content listed below:
First, we need import lxml objectify
Then, open the file:
Get the root node:
Now we can access child nodes, and with
Now we obviously want to convert this data into data frame.
Les's import pandas:
Prepare a empty data frame that will hold our data:
Now we go though our XML file appending data to this dataframe:
(name of the Series object serves as an index element while appending the object to DataFrame)
And here is out fresh dataframe:
Full source code:
Let's pretend that we're analyzing the file with the content listed below:
<xml_root>
<object>
<id>1</id>
<name>First</name>
</object>
<object>
<id>2</id>
<name>Second</name>
</object>
<object>
<id>3</id>
<name>Third</name>
</object>
<object>
<id>4</id>
<name>Fourth</name>
</object>
</xml_root>
First, we need import lxml objectify
from lxml import objectify
Then, open the file:
path = 'file_path'
xml = objectify.parse(open(path))
Get the root node:
root = xml.getroot()
Now we can access child nodes, and with
root.getchildren()[0].getchildren()we're able to get the actual content of the first child node as a simple Python list:
[1, 'First']
Now we obviously want to convert this data into data frame.
Les's import pandas:
import pandas as pd
Prepare a empty data frame that will hold our data:
df = pd.DataFrame(columns=('id', 'name'))
Now we go though our XML file appending data to this dataframe:
for i in range(0,4): obj = root.getchildren()[i].getchildren() row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text])) row_s = pd.Series(row) row_s.name = i df = df.append(row_s)
(name of the Series object serves as an index element while appending the object to DataFrame)
And here is out fresh dataframe:
id name 0 1 First 1 2 Second 2 3 Third 3 4 Fourth
Full source code:
from lxml import objectify import pandas as pd path = 'file_path' xml = objectify.parse(open(path)) root = xml.getroot() root.getchildren()[0].getchildren() df = pd.DataFrame(columns=('id', 'name')) for i in range(0,4): obj = root.getchildren()[i].getchildren() row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text])) row_s = pd.Series(row) row_s.name = i df = df.append(row_s)
Saved me a lot of work with that. Many thanks.
ОтветитьУдалитьthanks
ОтветитьУдалить