У меня есть html-страница, состоящая из таблицы, и я хочу получить все значения в td, tr в этой таблице. Я пробовал работать с beautifulsoup, но теперь я хотел работать с lxml или парсером HML с помощью python.
Я привел пример.
Я хочу получить значения как списки кортежей как
[
[( value of 2050 jan, value of main subject-part1-sub part1-subject1 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject1 ),... ],
[( value of 2050 jan, value of main subject-part1-sub part1-subject2 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject2 )... ]
]
и т.д.
Может ли кто-нибудь сообщить мне, как я могу обработать это очень "оптимальным" способом, используя парсер lxml или HTML python?
example: test.html
<HTML>
<HEAD>
<TITLE>Title</TITLE>
</HEAD>
<BODY>
<TABLE BORDER>
<TR ALIGN=LEFT>
<TH COLSPAN=38>Main Subject</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=18>part1</TH>
<TH VALIGN=TOP COLSPAN=18>part2</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=9>sub-part1</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part2</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part3</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part4</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=1>subject1</TH>
<TH VALIGN=TOP COLSPAN=1>subject2</TH>
<TH VALIGN=TOP COLSPAN=1>subject10</TH>
<TH VALIGN=TOP COLSPAN=1>subject11</TH>
<TH VALIGN=TOP COLSPAN=1>subject12</TH>
<TH VALIGN=TOP COLSPAN=1>subject13</TH>
<TH VALIGN=TOP COLSPAN=1>subject14</TH>
<TH VALIGN=TOP COLSPAN=1>subject15</TH>
<TH VALIGN=TOP COLSPAN=1>subject16</TH>
<TH VALIGN=TOP COLSPAN=1>subject17</TH>
<TH VALIGN=TOP COLSPAN=1>subject18</TH>
<TH VALIGN=TOP COLSPAN=1>subject19</TH>
<TH VALIGN=TOP COLSPAN=1>subject20</TH>
<TH VALIGN=TOP COLSPAN=1>subject21</TH>
<TH VALIGN=TOP COLSPAN=1>subject22</TH>
<TH VALIGN=TOP COLSPAN=1>subject23</TH>
<TH VALIGN=TOP COLSPAN=1>subject24</TH>
<TH VALIGN=TOP COLSPAN=1>subject25</TH>
<TH VALIGN=TOP COLSPAN=1>subject26</TH>
<TH VALIGN=TOP COLSPAN=1>subject27</TH>
<TH VALIGN=TOP COLSPAN=1>subject28</TH>
<TH VALIGN=TOP COLSPAN=1>subject29</TH>
<TH VALIGN=TOP COLSPAN=1>subject30</TH>
<TH VALIGN=TOP COLSPAN=1>subject31</TH>
<TH VALIGN=TOP COLSPAN=1>subject32</TH>
<TH VALIGN=TOP COLSPAN=1>subject33</TH>
<TH VALIGN=TOP COLSPAN=1>subject34</TH>
<TH VALIGN=TOP COLSPAN=1>subject35</TH>
<TH VALIGN=TOP COLSPAN=1>subject36</TH>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT VALIGN=TOP ROWSPAN=12>2050</TH>
<TH ALIGN=LEFT>January</TH>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>4</TD>
<TD>16</TD>
<TD>0</TD>
<TD>6</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>
<TD>0</TD>
<TD>26</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>February</TH>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>2</TD>
<TD>4</TD>
<TD>1</TD>
<TD>6</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>25</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>4</TD>
<TD>14</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>March</TH>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>7</TD>
<TD>0</TD>
<TD>9</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>9</TD>
<TD>0</TD>
<TD>45</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>10</TD>
<TD>16</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>April</TH>
<TD>1</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>3</TD>
<TD>12</TD>
<TD>1</TD>
<TD>11</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>
<TD>34</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>6</TD>
<TD>18</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>May</TH>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>4</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>7</TD>
<TD>1</TD>
<TD>30</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>
<TD>12</TD>
<TD>0</TD>
<TD>4</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>6</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>June</TH>
<TD>0</TD>
<TD>1</TD>
<TD>14</TD>
<TD>0</TD>
<TD>7</TD>
<TD>15</TD>
<TD>0</TD>
<TD>17</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>24</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>6</TD>
<TD>13</TD>
<TD>1</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>July</TH>
<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>17</TD>
<TD>1</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>2</TD>
<TD>15</TD>
<TD>2</TD>
<TD>53</TD>
<TD>0</TD>
<TD>3</TD>
<TD>3</TD>
<TD>6</TD>
<TD>0</TD>
<TD>7</TD>
<TD>16</TD>
<TD>0</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>August</TH>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>8</TD>
<TD>15</TD>
<TD>1</TD>
<TD>17</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>16</TD>
<TD>0</TD>
<TD>33</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>11</TD>
<TD>0</TD>
<TD>2</TD>
<TD>25</TD>
<TD>4</TD>
<TD>8</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>September</TH>
<TD>2</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>16</TD>
<TD>22</TD>
<TD>2</TD>
<TD>19</TD>
<TD>4</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>8</TD>
<TD>0</TD>
<TD>27</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>11</TD>
<TD>31</TD>
<TD>1</TD>
<TD>9</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>October</TH>
<TD>3</TD>
<TD>1</TD>
<TD>8</TD>
<TD>0</TD>
<TD>4</TD>
<TD>28</TD>
<TD>0</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>15</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>9</TD>
<TD>26</TD>
<TD>1</TD>
<TD>8</TD>
<TD>4</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>November</TH>
<TD>0</TD>
<TD>3</TD>
<TD>3</TD>
<TD>0</TD>
<TD>6</TD>
<TD>23</TD>
<TD>1</TD>
<TD>8</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>7</TD>
<TD>1</TD>
<TD>20</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>3</TD>
<TD>18</TD>
<TD>3</TD>
<TD>7</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>December</TH>
<TD>1</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>13</TD>
<TD>2</TD>
<TD>15</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>29</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>3</TD>
<TD>20</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
</TABLE>
</BODY>
</HTML>