Как разобрать файл XLS в Python, который содержит HTML-код?

Я скачал набор данных, который дает вам файлы .XLS, но когда я открываю его с помощью блокнота, я вижу HTML-теги / код.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Monthly Volume Report NRA 000000001508 2019-08-01 to 2019-08-31</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
/*<![CDATA[*/
.key td { white-space: nowrap; }
.key td.A,.key td.W,.key td.S,.key td.C,.key td.H { white-space: pre; }
td.A { background-color: #DD9CB3; } /* Salmon */
td.W { background-color: #80F7B5; } /* SeaGreen */
td.S { background-color: #FFFF80; } /* Yellow */
td.C { background-color: #A6CAF0; } /* Blue */
td.H { background-color: #CCCCCC; } /* Weekend grey */
td.WeA { background-color: #99586F; }
td.WeW { background-color: #3CB371; }
td.WeS { background-color: #CCCC40; }
td.WeC { background-color: #6286AC; }
td.WeH { background-color: #CCCCCC; }
td.We,th.We { background-color: #CCCCCC; }
td.P { border: 1px solid #808080; } /* Patched - border */
td.TLI { color:#A0A0A0; } /* Too Little Information - grey text */
td.NOI { color:red; } /* Too Little Information - red text */
.grid td { border:none; } /* No borders by default */
.grid .TBL td, .grid .TBL { border-left:2px solid #000000; } /* Thick left border for intermediate totals */
.grid .TBR td, .grid .TBR { border-right:2px solid #000000; } /* Thick right border for intermediate totals */
.grid .TBT td, .grid .TBT { border-top:2px solid #000000; } /* Thick top border for intermediate totals */
.grid .TBB td, .grid .TBB { border-bottom:2px solid #000000; } /* Thick bottom border for intermediate totals */
.grid .BL td, .grid .BL { border-left:1px solid #000000; } /* Thin left border for intermediate totals */
.grid .BR td, .grid .BR { border-right:1px solid #000000; } /* Thin right border for intermediate totals */
.grid .BT td, .grid .BT { border-top:1px solid #000000; } /* Thin top border for intermediate totals */
.grid .BB td, .grid .BB { border-bottom:1px solid #000000; } /* Thin bottom border for intermediate totals */
.grid .TBP th, .grid .TBP { padding-top: 1.5ex; padding-bottom: 1.5ex; mso-ignore: padding; } /* Top and bottom padding */
th { mso-number-format: @; } /* So dates in headers are interpreted as strings */
th.XlDM { mso-number-format:"d mmm"; } /* Force short dates */
th.WeXlDM { background-color: #CCCCCC; mso-number-format:"d mmm"; } /* Force short dates */
/*]]>*/
</style>
</head>
<body class="fixed">
<div id="wrapper">
<b>Monthly Volume Report NRA 000000001508 2019-08-01 to 2019-08-31</b> <div id="container">
  <div id="content">
<div class="panel panel-default">  <div class="panel-heading">   <table>
    <tr><td>Site Name</td><td style='mso-number-format:"@";' colspan=3>TMU M50 015.0 S </td></tr>
    <tr><td>Site ID</td><td style='mso-number-format:"@";' colspan=3>000000001508</td></tr>
    <tr><td>Grid</td><td style='mso-number-format:"@";' colspan=3>308164236836</td></tr>
    <tr><td>Description</td><td colspan=3>M50 Between Jn06 N03/M50 and Jn07 N04/M50, Castleknock, Co. Dublin</td></tr>
   </table>
</div>  <div class="panel-body">   <div id="selector">
    <ul id="dropdownmenu">
<table><tr><td>Setup</td><td colspan=3>M50 1508</td></tr></table>
<table><tr><td>Channel</td><td colspan=3>Each Direction
</td></tr></table>
<table><tr><td>Time Period</td><td colspan=3>1 hour
</td></tr></table>
<table><tr><td>Class</td><td colspan=3>Any</td></tr></table>
<table><tr><td>Exclude data:</td><td colspan=3>None</td></tr></table>
    </ul>
   </div> <!--selector-->
<table class="grid" id="gridTable" BORDER=1>
 <tr><th style="text-align:left; font-size:110%;" colspan=35>All directions</th></tr>
 <tr>
  <th></th>
  <th>Thu</th>
  <th>Fri</th>
  <th class="We">Sat</th>
  <th class="We">Sun</th>
  <th class="We">Mon</th>
  <th>Tue</th>
  <th>Wed</th>
  <th>Thu</th>
  <th>Fri</th>
  <th class="We">Sat</th>
  <th class="We">Sun</th>
  <th>Mon</th>
  <th>Tue</th>
  <th>Wed</th>
  <th>Thu</th>
  <th>Fri</th>
  <th class="We">Sat</th>
  <th class="We">Sun</th>
  <th>Mon</th>
  <th>Tue</th>
  <th>Wed</th>
  <th>Thu</th>
  <th>Fri</th>
  <th class="We">Sat</th>
  <th class="We">Sun</th>
  <th>Mon</th>
  <th>Tue</th>
  <th>Wed</th>
  <th>Thu</th>
  <th>Fri</th>
  <th class="We">Sat</th>
  <th colspan=2>Average</th>
  <th>Total</th>
 </tr>
 <tr class="BP">
  <th></th>
  <th>2019-08-01</th>
  <th>2019-08-02</th>
  <th class="We">2019-08-03</th>
  <th class="We">2019-08-04</th>
  <th class="We">2019-08-05</th>
  <th>2019-08-06</th>
  <th>2019-08-07</th>
  <th>2019-08-08</th>
  <th>2019-08-09</th>
  <th class="We">2019-08-10</th>
  <th class="We">2019-08-11</th>
  <th>2019-08-12</th>
  <th>2019-08-13</th>
  <th>2019-08-14</th>
  <th>2019-08-15</th>
  <th>2019-08-16</th>
  <th class="We">2019-08-17</th>
  <th class="We">2019-08-18</th>
  <th>2019-08-19</th>
  <th>2019-08-20</th>
  <th>2019-08-21</th>
  <th>2019-08-22</th>
  <th>2019-08-23</th>
  <th class="We">2019-08-24</th>
  <th class="We">2019-08-25</th>
  <th>2019-08-26</th>
  <th>2019-08-27</th>
  <th>2019-08-28</th>
  <th>2019-08-29</th>
  <th>2019-08-30</th>
  <th class="We">2019-08-31</th>
  <th>Workday</th>
  <th>7 Day</th>
  <th>Count</th>
 </tr>
 <tr class="T">
  <td class="RHR" style='mso-number-format:"@";'>00:00:00</td>
  <td class="S">1535</td>
  <td class="S">1549</td>
  <td class="WeS">1673</td>
  <td class="WeS">1536</td>
  <td class="WeS">1437</td>
  <td class="S">1410</td>
  <td>1326</td>
  <td>1492</td>
  <td>1612</td>
  <td class="We">1904</td>
  <td class="We">1705</td>
  <td>1389</td>
  <td>1410</td>
  <td>1386</td>
  <td>1600</td>
  <td>1476</td>
  <td class="We">1957</td>
  <td class="We">1792</td>
  <td>1346</td>
  <td>1437</td>
  <td>1465</td>
  <td>1594</td>
  <td>1625</td>
  <td class="We">1871</td>
  <td class="We">1737</td>
  <td>1449</td>
  <td>1349</td>
  <td>1462</td>
  <td>1358</td>
  <td>1489</td>
  <td class="We">1749</td>
  <td>1463</td>
  <td>1551</td>
  <td>48120</td>
 </tr>
 <tr>
  <td class="RHR" style='mso-number-format:"@";'>01:00:00</td>
  <td class="S">1096</td>
  <td class="S">1076</td>
  <td class="WeS">1262</td>
  <td class="WeS">1194</td>
  <td class="WeS">1042</td>
  <td class="S">846</td>
  <td>926</td>
  <td>948</td>
  <td>1102</td>
  <td class="We">1282</td>
  <td class="We">1190</td>
  <td>891</td>
  <td>984</td>
  <td>869</td>
  <td>1002</td>
  <td>1095</td>
  <td class="We">1080</td>
  <td class="We">1344</td>
  <td>948</td>
  <td>952</td>
  <td>904</td>
  <td>983</td>
  <td>1121</td>
  <td class="We">1083</td>
  <td class="We">1333</td>
  <td>757</td>
  <td>864</td>
  <td>868</td>
  <td>889</td>
  <td>1005</td>
  <td class="We">1103</td>
  <td>962</td>
  <td>1032</td>
  <td>32039</td>
 </tr>
 <tr>
  <td class="RHR" style='mso-number-format:"@";'>02:00:00</td>
  <td class="S">591</td>
  <td class="S">759</td>
  <td class="WeS">776</td>
  <td class="WeS">851</td>
  <td class="WeS">706</td>
  <td class="S">567</td>
  <td>652</td>
  <td>674</td>
  <td>758</td>
  <td class="We">1007</td>
  <td class="We">836</td>
  <td>507</td>
  <td>801</td>
  <td>623</td>
  <td>638</td>
  <td>726</td>
  <td class="We">844</td>
  <td class="We">812</td>
  <td>538</td>
  <td>677</td>
  <td>620</td>
  <td>688</td>
  <td>720</td>
  <td class="We">733</td>
  <td class="We">843</td>
  <td>518</td>
  <td>593</td>
  <td>666</td>
  <td>664</td>
  <td>675</td>
  <td class="We">809</td>
  <td>653</td>
  <td>705</td>
  <td>21872</td>
 </tr>
 <tr class="T">
  <td class="RHR" style='mso-number-format:"@";'>03:00:00</td>
  <td class="S">787</td>
  <td class="S">979</td>
  <td class="WeS">831</td>
  <td class="WeS">844</td>
  <td class="WeS">788</td>
  <td class="S">781</td>
  <td>831</td>
  <td>817</td>
  <td>974</td>
  <td class="We">947</td>
  <td class="We">886</td>
  <td>868</td>
  <td>763</td>
  <td>802</td>
  <td>762</td>
  <td>904</td>
  <td class="We">882</td>
  <td class="We">863</td>
  <td>828</td>
  <td>772</td>
  <td>745</td>
  <td>798</td>
  <td>905</td>
  <td class="We">854</td>
  <td class="We">796</td>
  <td>837</td>
  <td>740</td>
  <td>740</td>
  <td>727</td>
  <td>881</td>
  <td class="We">896</td>
  <td>819</td>
  <td>833</td>
  <td>25828</td>
 </tr>
 <tr>
  <td class="RHR" style='mso-number-format:"@";'>04:00:00</td>
  <td class="S">1356</td>
  <td class="S">1449</td>
  <td class="WeS">1318</td>
  <td class="WeS">1133</td>
  <td class="WeS">1200</td>
  <td class="S">1432</td>
  <td>1351</td>
  <td>1321</td>
  <td>1637</td>
  <td class="We">1287</td>
  <td class="We">1117</td>
  <td>1448</td>
  <td>1358</td>
  <td>1342</td>
  <td>1337</td>
  <td>1389</td>
  <td class="We">1348</td>
  <td class="We">1064</td>
  <td>1367</td>
  <td>1408</td>
  <td>1447</td>
  <td>1346</td>
  <td>1538</td>
  <td class="We">1275</td>
  <td class="We">1048</td>
  <td>1317</td>
  <td>1391</td>
  <td>1196</td>
  <td>1394</td>
  <td>1498</td>
  <td class="We">1375</td>
  <td>1387</td>
  <td>1339</td>
  <td>41487</td>
 </tr>
 <tr>
  <td class="RHR" style='mso-number-format:"@";'>05:00:00</td>
  <td class="S">2516</td>
  <td class="S">2444</td>
  <td class="WeS">1659</td>
  <td class="WeS">1136</td>
  <td class="WeS">1512</td>
  <td class="S">2609</td>
  <td>2518</td>
  <td>2490</td>
  <td>2466</td>
  <td class="We">1608</td>
  <td class="We">1203</td>
  <td>2651</td>
  <td>2554</td>
  <td>2441</td>
  <td>2441</td>
  <td>2419</td>
  <td class="We">1628</td>
  <td class="We">1254</td>
  <td>2676</td>
  <td>2528</td>
  <td>2496</td>
  <td>2443</td>
  <td>2452</td>

Вот как это выглядит при открытии в Excel

введите описание изображения здесь

Поэтому мой вопрос заключается в том, как мне прочитать эту таблицу, которая отображается в Excel через Python. Когда я попытался разобрать его через функцию чтения XLS в пандах, это выдало мне ошибки

редактировать: все содержимое файла XLS можно найти здесь на pastebin

https://pastebin.com/2rpyAZhE

Всего 1 ответ


Один из вариантов - открыть файл в Excel и сохранить его в формате .xlsx. Вы могли бы прочитать это непосредственно в информационный кадр как это:

import pandas as pd
df = pd.read_excel(ƈrpyAZhE.xlsx')

Другой вариант будет использовать BeautifulSoup. Что-то вроде этого:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(ƈrpyAZhE.html', 'r'), "html.parser")
table = soup.find_all('table', {"class": "grid"})
df = pd.read_html(str(table[0]))[0]

Поскольку источник довольно грязный (объединенные ячейки, пустые строки и т. Д.), Вы можете выполнить некоторую предварительную обработку, чтобы немного его очистить / быть более избирательным при поиске красивого супа, прежде чем читать его в массив данных.


Есть идеи?

10000