Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
TeachOpenCADD | 011 拉取网络接口
化学信息学
TeachOpenCADD
化学信息学TeachOpenCADD
YangHe
发布于 2023-06-15
AI4SCUP-CNS-BBB(v1)

011 查询在线API Web服务

🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先,请点击位于界面顶部的 开始连接 按钮,然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置,稍等片刻即可开始运行。

📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd 。

代码
文本

Aim of this talktorial

In this notebook, you will learn how to programmatically use online web-services from Python, in the context of drug design. By the end of this talktorial, you will be familiar with REST services and web scraping.

代码
文本

Contents in Theory

  • Data access from a server-side perspective
代码
文本

Contents in Practical

  • Downloading static files
  • Accessing dynamically generated content
  • Programmatic interfaces
  • Document parsing
  • Browser remote control
代码
文本

References

This guide is very practical and omits some technical definitions for the sake of clarity. However, you should also handle some basic terminology to fully understand what is going on behind the scenes.

代码
文本

Theory

代码
文本

The internet is a collection of connected computers that exchange data. In a way, you essentially query machines (servers) with certain parameters to retrieve specific data. That data will be either:

  • A. Served straight away, since the server is simply a repository of files. E.g. you can download the ChEMBL database dump from their servers.
  • B. Retrieved from a database and formatted in a particular way. The result you see on your browser is either:
    • B1. Pre-processed on the server, e.g. the HTML page you see when you visit any article in Wikipedia.
    • B2. Dynamically generated on the client (your browser) as you use the website, e.g. Twitter, Facebook, or any modern web-app.
  • C. Computed through the execution of one or more programs on the server-side, e.g. estimating the protonation states of a protein-ligand complex using Protoss.

In a way, configuration C is a special type of B1. You are just replacing the type of task that runs on the server: database querying and HTML rendering vs. computations that process your query and return data formatted in a domain-specific way.

Another way of categorizing online services is by the format of the returned data. Most pages you see on your browser are using HTML, usually focusing on presenting data in a human-readable way. However, some servers might structure that data in a way that is machine-readable. This data can be processed in a reliable way because it's formatted using a consistent set of rules that can be easily encoded in a program. Such programs are usually called parsers. HTML can be labeled in such a way that data can be obtained reliably, but it is not designed with that purpose in mind. As a result, we will usually prefer using services that provide machine-readable formats, like JSON, CSV or XML.

In practice, both ways of data presentation (should) coexist in harmony. Modern web architecture strives to separate data retrieval tasks from end-user presentation. One popular implementation consists of using a programmatic endpoint that returns machine-readable JSON data, which is then consumed by the user-facing web application. The latter renders HTML, either on the server -option B1-, or on the user's browser -option B2. Unfortunately, unlike the user-facing application, the programmatic endpoint (API) is not guaranteed to be publicly available, and is sometimes restricted to internal usage on the server side.

In the following sections, we will discuss how to make the most out of each type of online service using Python and some libraries!

代码
文本

Practical

代码
文本
[1]
from pathlib import Path

HERE = Path(_dh[-1])
DATA = HERE / "data"
TMPDATA = DATA / "_tmp" # this dir is gitignored
TMPDATA.mkdir(parents=True, exist_ok=True)
代码
文本

Downloading static files

In this case, the web server is hosting files that you will download and consume right away. All you need to do is to query the server for the right address or URL (Universal Resource Location). You do this all the time when you browse the internet, and you can also do it with Python!

For example, let's get this kinase-related CSV dataset from GitHub, which contains a list of kinases and their identifiers.

Tip: Whenever you want to download a file hosted in GitHub, use the Raw button to obtain the downloadable URL!

image.png

While Python provides a library to deal with HTTP queries (urllib), people often prefer using the 3rd-party requests because the usage is way simpler.

代码
文本
[2]
import requests

url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"
response = requests.get(url)
response.raise_for_status()
response

# NBVAL_CHECK_OUTPUT
<Response [200]>
代码
文本

When you use requests.get(...) you obtain a Response object. This is not the file you want to download, but an object that wraps the HTTP query and the response the server gave you. Before we inspect the content, we always call .raise_for_status(), which will raise an exception if the server told us that the request could not be fulfilled. How does the server do that? With HTTP status codes, a 3-digit number. There are several, but the most common ones are:

  • 200: Everything OK!
  • 404: File not found.
  • 500: Server error.

.raise_for_status() will complain if your response didn't obtain a 200 code. As such, it's a good practice to call it after every query!

See this example of a bad URL, it contains an error: there's no TXT file there, just a CSV.

代码
文本
[3]
# NBVAL_RAISES_EXCEPTION
bad_url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt"
bad_response = requests.get(bad_url)
bad_response.raise_for_status()
bad_response
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[3], line 4
      2 bad_url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt"
      3 bad_response = requests.get(bad_url)
----> 4 bad_response.raise_for_status()
      5 bad_response

File /opt/conda/lib/python3.8/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt
代码
文本

Ok, now let's get to the contents of the CSV file! Depending on what you are looking for, you will need one of these attributes:

  • response.content: The bytes returned by the server.
  • response.text: The contents of the file, as a string, if possible.
  • response.json(): If the server returns JSON data (more on this later), this method will parse it and return the corresponding dictionary.

Which one should you use? If you want to display some text in the Notebook output, then go for .text. Everything that involves binary files (images, archives, PDFs...) or downloading to disk should use .content.

Since this a CSV file, we know that's a plain text file, so we can use the usual Python methods on it! Let's print the first 10 lines:

代码
文本
[4]
print(*response.text.splitlines()[:10], sep="\n")
xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705
ADCK4,ADCK4,ADCK4,Uncharacterized aarF domain-containing protein kinase 4,Atypical,ABC1,ABC1-A,Q96D53
Trb1,Trb1,TRIB1,Tribbles homolog 1,CAMK,Trbl,,Q96RU8
BRSK2,BRSK2,BRSK2,Serine/threonine-protein kinase BRSK2,CAMK,CAMKL,BRSK,Q8IWQ3
Wnk2,Wnk2,WNK2,Serine/threonine-protein kinase WNK2,Other,WNK,,Q9Y3S1
AKT1,AKT1,AKT1,RAC-alpha serine/threonine-protein kinase,AGC,Akt,,P31749
代码
文本

Of course, you can save this to disk using the usual Python constructs. Since we are downloading, it's recommended to use the raw bytes contents, not the text version! This means you should use response.content and open your file in bytes mode (the b in wb):

代码
文本
[5]
with open(TMPDATA / "kinhub.csv", "wb") as f:
f.write(response.content)
代码
文本

Open it again to check we wrote something.

代码
文本
[ ]
# We need the encoding="utf-8-sig" to ensure correct encoding
# under all platforms
with open(TMPDATA / "kinhub.csv", encoding="utf-8-sig") as f:
# Zip will stop iterating with the shortest iterator
# passing `range(5)` allow us to just get five lines ;)
for _, line in zip(range(5), f):
print(line.rstrip())

# NBVAL_CHECK_OUTPUT
代码
文本

Tip: If all you want to do is downloading a CSV file to open it with Pandas, then just pass the raw URL to pandas.read_csv. It will download the file for you!

代码
文本
[ ]
import pandas as pd

df = pd.read_csv(
"https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"
)
df.head()
# NBVAL_CHECK_OUTPUT
代码
文本

One note about file downloads. The method above downloads the whole file into memory, which can be a problem for very big files. If you intend to download a very large file, you can push it to disk directly using streaming requests and raw responses. As an example, let's pretend this 1MB video is too big to fit in memory:

代码
文本
[ ]
import shutil
from IPython.display import Video

response = requests.get(
"https://archive.org/download/SlowMotionFlame/slomoflame_512kb.mp4", stream=True
)
response.raise_for_status()

with open(TMPDATA / "video.mp4", "wb") as tmp:
for chunk in response.iter_content(chunk_size=8192):
tmp.write(chunk)

# Let's play the movie in Jupyter!
# Paths passed to widgets need to be relative to notebook or they will 404 :)
display(Video(Path(tmp.name).relative_to(HERE)))
代码
文本

Accessing dynamically generated content

So far, we have been able to retrieve files that were present on a remote server. To do that, we used requests.get and a URL that points to the file.

Well, it turns out that the same technique will work for many more types of content! What the server does with the URL is not our concern! Whether the server only needs to give you a file on disk or query a database and assemble different parts into the returned content does not matter at all.

That concept alone is extremely powerful, as you will see now. Remember: We just need to make sure we request the correct URL!

Let's work on something fun now! The spike protein in SARS-CoV-2 is one of the most popular proteins lately, can we get some information from UniProt using requests? Its UniProt ID is P0DTC2. Go check with your browser first, you should see something like this:

UniProt entry for SARS-CoV-2

One of the things UniProt provides is the amino acid sequence of the listed protein. Scroll down until you see this part:

Sequence for SARS-CoV-2

Do you think we can get only the sequence using Python? Let's see!

To query a protein, you simply need to add its UniProt ID to the URL.

代码
文本
[ ]
r = requests.get("https://www.uniprot.org/uniprot/P0DTC2")
r.raise_for_status()
print(r.text[:5000])
代码
文本

Wow, what is all that noise? You are seeing the HTML content of the webpage! That's the markup language web developers use to write webpages.

There are libraries to process HTML and extract the actual content (like BeautifulSoup; more below), but we will not need it here yet. Fortunately, UniProt provides alternative representations of the data.

UniProt formats

Some formats are more convenient for programmatic use. If you click on Text you will see something different in your browser: just plain text! Also, notice how the URL is now different.

Just adding the .txt extension was enough to change the style. This is a nice feature UniProt provides. It mimics a file system, but it's actually changing the representation of the returned content. Elegant! And more important, easier to use programmatically! Check it:

代码
文本
[ ]
r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.txt")
r.raise_for_status()
print(r.text[:1000])
代码
文本

This is exactly what we see on our browser! Plain text is nice for these things. However, the sequence is all the way at the end of the file. To retrieve it, you need to get creative and analyze those little tags each line has. See how it begins with SQ and finishes with //:

SQ   SEQUENCE   1273 AA;  141178 MW;  B17BE6D9F1C4EA34 CRC64;
     MFVFLVLLPL VSSQCVNLTT RTQLPPAYTN SFTRGVYYPD KVFRSSVLHS TQDLFLPFFS
     NVTWFHAIHV SGTNGTKRFD NPVLPFNDGV YFASTEKSNI IRGWIFGTTL DSKTQSLLIV
     NNATNVVIKV CEFQFCNDPF LGVYYHKNNK SWMESEFRVY SSANNCTFEY VSQPFLMDLE
     GKQGNFKNLR EFVFKNIDGY FKIYSKHTPI NLVRDLPQGF SALEPLVDLP IGINITRFQT
     LLALHRSYLT PGDSSSGWTA GAAAYYVGYL QPRTFLLKYN ENGTITDAVD CALDPLSETK
     CTLKSFTVEK GIYQTSNFRV QPTESIVRFP NITNLCPFGE VFNATRFASV YAWNRKRISN
     CVADYSVLYN SASFSTFKCY GVSPTKLNDL CFTNVYADSF VIRGDEVRQI APGQTGKIAD
     YNYKLPDDFT GCVIAWNSNN LDSKVGGNYN YLYRLFRKSN LKPFERDIST EIYQAGSTPC
     NGVEGFNCYF PLQSYGFQPT NGVGYQPYRV VVLSFELLHA PATVCGPKKS TNLVKNKCVN
     FNFNGLTGTG VLTESNKKFL PFQQFGRDIA DTTDAVRDPQ TLEILDITPC SFGGVSVITP
     GTNTSNQVAV LYQDVNCTEV PVAIHADQLT PTWRVYSTGS NVFQTRAGCL IGAEHVNNSY
     ECDIPIGAGI CASYQTQTNS PRRARSVASQ SIIAYTMSLG AENSVAYSNN SIAIPTNFTI
     SVTTEILPVS MTKTSVDCTM YICGDSTECS NLLLQYGSFC TQLNRALTGI AVEQDKNTQE
     VFAQVKQIYK TPPIKDFGGF NFSQILPDPS KPSKRSFIED LLFNKVTLAD AGFIKQYGDC
     LGDIAARDLI CAQKFNGLTV LPPLLTDEMI AQYTSALLAG TITSGWTFGA GAALQIPFAM
     QMAYRFNGIG VTQNVLYENQ KLIANQFNSA IGKIQDSLSS TASALGKLQD VVNQNAQALN
     TLVKQLSSNF GAISSVLNDI LSRLDKVEAE VQIDRLITGR LQSLQTYVTQ QLIRAAEIRA
     SANLAATKMS ECVLGQSKRV DFCGKGYHLM SFPQSAPHGV VFLHVTYVPA QEKNFTTAPA
     ICHDGKAHFP REGVFVSNGT HWFVTQRNFY EPQIITTDNT FVSGNCDVVI GIVNNTVYDP
     LQPELDSFKE ELDKYFKNHT SPDVDLGDIS GINASVVNIQ KEIDRLNEVA KNLNESLIDL
     QELGKYEQYI KWPWYIWLGF IAGLIAIVMV TIMLCCMTSC CSCLKGCCSC GSCCKFDEDD
     SEPVLKGVKL HYT
//

Hence, you could do something like this:

代码
文本
[ ]
sequence_block = False
lines = []
for line in r.text.splitlines():
if line.startswith("SQ"):
sequence_block = True
elif line.startswith("//"):
sequence_block = False

if sequence_block:
line = line.strip() # delete spaces and newlines at the beginning and end of the line
line = line.replace(" ", "") # delete spaces in the middle of the line
lines.append(line)
sequence = "".join(lines[1:]) # the first line is the metadata header
print(f"This is your sequence: {sequence}")

# NBVAL_CHECK_OUTPUT
代码
文本

Ta-da! We got it! It required some processing, but it works... However, you should always wonder if there's an easier way. Given that UniProt had a nice way of providing the text representation, how come they don't offer a URL that only returns the sequence for a given UniProt ID? Well, they do! Just change .txt for .fasta: https://www.uniprot.org/uniprot/P0DTC2.fasta

代码
文本
[ ]
r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.fasta")
r.raise_for_status()
print(r.text)

# NBVAL_CHECK_OUTPUT
代码
文本

This is returned in FASTA, a common syntax in bioinformatics. You could use established libraries like BioPython to parse it too!

代码
文本
[ ]
from Bio import SeqIO
from tempfile import NamedTemporaryFile
import os

# Write response into a temporary text file
with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:
tmp.write(r.text)

# Create the BioPython object for sequence data:
sequence = SeqIO.read(tmp.name, format="fasta")

# Delete temporary file now that we have read it
os.remove(tmp.name)

print(sequence.description)
print(sequence.seq)

# NBVAL_CHECK_OUTPUT
代码
文本

All these ways to access different representations or sections of the data contained in UniProt constitutes a URL-based API (Application Programmatic Interface). The foundational principle is that the URL contains all the parameters needed to ask the server for a specific type of content. Yes, you read that correctly: parameters. If you think about it, a URL specifies two parts: the machine you are connecting to and the page in that machine you want to access. When the page part is missing, the server assumes you are asking for index.html or equivalent.

Let's compare it to a command-line interface:

@ # this is your browser
@ uniprot.org/uniprot/P0DTC2.fasta
$ # this is your terminal
$ uniprot --id=P0DTC2 --format=FASTA

Each part of the URL can be considered a positional argument! So, if you want the sequence of a different protein, just input its UniProt ID in the URL, done! For example, P00519 is the ID for the ABL1 kinase.

代码
文本
[ ]
r = requests.get("https://www.uniprot.org/uniprot/P00519.fasta")
r.raise_for_status()
print(r.text)

# NBVAL_CHECK_OUTPUT
代码
文本

What if we parameterize the URL with an f-string and provide a function to make it super Pythonic? Even better, what if we provide the Bio.SeqIO parsing functionality too?

代码
文本
[ ]
def sequence_for_uniprot_id(uniprot_id):
"""
Returns the FASTA sequence of a given Uniprot ID using
the UniProt URL-based API

Parameters
----------
uniprot_id : str

Returns
-------
Bio.SeqIO.SeqRecord
"""
# ⬇ this is key part!
r = requests.get(f"https://www.uniprot.org/uniprot/{uniprot_id}.fasta")
r.raise_for_status()

with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:
tmp.write(r.text)

sequence = SeqIO.read(tmp.name, format="fasta")
os.remove(tmp.name)

return sequence
代码
文本

Now you can use it for any UniProt ID. This is for the Src kinase:

代码
文本
[ ]
sequence = sequence_for_uniprot_id("P12931")
print(sequence)

# NBVAL_CHECK_OUTPUT
代码
文本

Congratulations! You have used your first online API in Python and adapted it to a workflow!

代码
文本

Programmatic interfaces

What UniProt does with their URLs is one way of providing access to their database, i.e., through specific URL schemes. However, if each web service would have to come up with their own scheme, developers would need to figure out which scheme the website is using, and then implement, adapt or customize their scripts on a case-by-case basis. Fortunately, there are some standardized ways of providing programmatic access to online resources. Some of them include:

  • HTTP-based RESTful APIs (wiki)
  • GraphQL
  • SOAP
  • gRPC

In this talktorial, we will focus on the first one, REST.

代码
文本

HTTP-based RESTful APIs

This type of programmatic access defines a specific entry point for clients (scripts, libraries, programs) that require programmatic access, something like api.webservice.com. This is usually different from the website itself (webservice.com). They can be versioned, so the provider can update the scheme without disrupting existing implementations (api.webservice.com/v1 will still work even when api.webservice.com/v2 has been deployed).

This kind of API is usually accompanied by well-written documentation explaining all the available actions in the platform. For example, look at the KLIFS API documentation. KLIFS is a database of kinase targets and small compound inhibitors. You can see how every argument and option is documented, along with usage examples.

If you wanted to list all the kinase families available in KLIFS, you need to access this URL:

https://klifs.net/api/kinase_groups
Result (click here!)
[
  "AGC",
  "CAMK",
  "CK1",
  "CMGC",
  "Other",
  "STE",
  "TK",
  "TKL"
]

This response happens to be JSON-formatted! This is easily parsed into a Python object using the json library. The best news is that you don't even need that. Using requests, the following operation can be done in three lines thanks to the .json() method:

代码
文本
[ ]
import requests

response = requests.get("https://klifs.net/api/kinase_groups")
response.raise_for_status()
result = response.json()
result

# NBVAL_CHECK_OUTPUT
代码
文本

That's a Python list!

代码
文本
[ ]
result[0]
代码
文本

Let's see if we can get all the kinase families contained in a specific group. Reading the documentation, looks like we need this kind of URL:

https://klifs.net/api/kinase_families?kinase_group={{ NAME }}

What follows after the ? symbol is the query. It's formatted with a key-value syntax like this: key=value. Multiple parameters can be expressed with &:

https://api.webservice.com/some/endpoint?parameter1=value1&parameter2=value2

Let's see the returned object for the tyrosine kinase (TK) group: family=TK

代码
文本
[ ]
response = requests.get("https://klifs.net/api/kinase_families?kinase_group=TK")
response.raise_for_status()
result = response.json()
result
代码
文本

Since passing parameters to the URL is a very common task, requests provides a more convenient way. This will save you from building the URLs manually or HTML escaping the values. The key idea is to pass the key-value pairs as a dictionary. The previous query can be (and should be, if you ask us) done like this:

代码
文本
[ ]
response = requests.get("https://klifs.net/api/kinase_families", params={"kinase_group": "TK"})
# You can see how requests formatted the URL for you
print("Queried", response.url)
response.raise_for_status()
result = response.json()
result
代码
文本

Sometimes the returned JSON object is not a list, but a dict. Or a combination of dictionaries and lists. Maybe even nested! You can still access them using the Python tools you already know.

For example, the kinase_information endpoint requires a numeric ID, and will return a lot of information on a single kinase:

代码
文本
[ ]
response = requests.get("https://klifs.net/api/kinase_information", params={"kinase_ID": 22})
response.raise_for_status()
result = response.json()
result

# NBVAL_CHECK_OUTPUT
代码
文本

If you want to know the UniProt ID for this kinase, you will need to access the first (and only) element in the returned list, and ask for the value of the uniprot key:

代码
文本
[ ]
result[0]["uniprot"]
代码
文本

Turns out we can use this to get the full sequence of the protein (and not just the pocket sequence) using our UniProt function from before!

代码
文本
[ ]
mastl = sequence_for_uniprot_id(result[0]["uniprot"])
print(mastl.seq)

# NBVAL_CHECK_OUTPUT
代码
文本

We are using two webservices together, awesome!

代码
文本

Generating a client for any API

Did you find that convenient? Well, we are not done yet! You might have noticed that all the endpoints in the KLIFS API have a similar pattern. You specify the name of the endpoint (kinase_groups, kinase_families, kinase_information, ...), pass some (optional) parameters if needed, and then get a JSON-formatted response. Is there a way you can avoid having to format the URLs yourself? The answer is... yes!

The REST API scheme can be expressed programmatically in a document called Swagger/OpenAPI definitions, which allows to dynamically generate a Python client for any REST API that implements the Swagger/OpenAPI schema. This is the one for KLIFS.

Of course, there are libraries for doing that in Python, like bravado.

代码
文本
[ ]
from bravado.client import SwaggerClient

KLIFS_SWAGGER = "https://klifs.net/swagger/swagger.json"
client = SwaggerClient.from_url(KLIFS_SWAGGER, config={"validate_responses": False})
client
代码
文本

Then, you can have fun inspecting the client object for all the API actions as methods.

Tip: Type client. and press Tab to inspect the client in this notebook.

代码
文本
[ ]
?client.Information.get_kinase_names
代码
文本

bravado is auto-generating classes and functions that mirror the API we were using before! How cool is that? The same query can now be done without requests.

代码
文本
[ ]
client.Information.get_kinase_information(kinase_ID=[22])
代码
文本

Note that bravado does not return the response right away. It creates a promise that it will do so when you ask for it. This allows it to be usable in asynchronous programming, but for our purposes, it means that you need to call it with .result().

代码
文本
[ ]
results = client.Information.get_kinase_information(kinase_ID=[22]).result()
result = results[0]
result
代码
文本
[ ]
result.uniprot

# NBVAL_CHECK_OUTPUT
代码
文本

bravado also builds result objects for you, so you don't have to use the result["property"] syntax, but the result.property one. Some more convenience for the end user ;)

代码
文本

Document parsing

Sometimes the web service will not provide a standardized API that produces machine-readable documents. Instead, you will have to use the regular webpage and parse through the HTML code to obtain the information you need. This is called (web) scraping, which usually involves finding the right HTML tags and IDs that contain the valuable data (ignoring things such as the sidebars, top menus, footers, ads, etc).

In scraping, you basically do two things:

  1. Access the webpage with requests and obtain the HTML contents.
  2. Parse the HTML string with BeautifulSoup or requests-html.

Let's parse the proteinogenic amino acids table in this Wikipedia article:

代码
文本
[ ]
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

r = requests.get("https://en.wikipedia.org/wiki/Proteinogenic_amino_acid")
r.raise_for_status()

# To guess the correct steps here, you will have to inspect the HTML code by hand
# Tip: use right-click + inspect content in any webpage to land in the HTML definition ;)
html = BeautifulSoup(r.text)
header = html.find("span", id="General_chemical_properties")
table = header.find_all_next()[4]
table_body = table.find("tbody")

data = []
for row in table_body.find_all("tr"):
cells = row.find_all("td")
if cells:
data.append([])
for cell in cells:
cell_content = cell.text.strip()
try: # convert to float if possible
cell_content = float(cell_content)
except ValueError:
pass
data[-1].append(cell_content)

# Empty fields are denoted with "?" which casts respective columns to object types
# (here mix of strings and floats) but we want float64, therefore replace "?" with NaN values
pd.DataFrame.from_records(data).replace("?", np.nan)

# NBVAL_CHECK_OUTPUT
代码
文本

Browser remote control

The trend some years ago was to build servers that dynamically generate HTML documents with some JavaScript here and there (such as Wikipedia). In other words, the HTML is built in the server and sent to the client (your browser).

However, latest trends are pointing towards full applications built entirely with JavaScript frameworks. This means that the HTML content is dynamically generated in the client. Traditional parsing will not work and you will only download the placeholder HTML code that hosts the JavaScript framework. To work around this, the HTML must be rendered with a client-side JavaScript engine.

We won't cover this in the current notebook, but you can check the following projects if you are interested:

代码
文本

Discussion

In this theoretical introduction you have seen how different methods to programmatically access online web services can be used from a Python interpreter. Leveraging these techniques you will be able to build automated pipelines inside Jupyter Notebooks. In the end, querying a database or downloading a file involves the same kind of tooling.

Unfortunately, there is too much material to cover about web APIs in a single lesson. For example, how do you send or upload contents from Python? Can you submit forms? If you are interested in knowing more, the requests documentation should be your go-to resource. Some interesting parts include:

代码
文本

Reference

https://github.com/volkamerlab/teachopencadd

Reprint statement

Original title: Querying online API webservices

Authors:

代码
文本
化学信息学
TeachOpenCADD
化学信息学TeachOpenCADD
点个赞吧
推荐阅读
公开
ML4CFD | Python简介
ML4CFDAI4SCFD
ML4CFDAI4SCFD
JiaweiMiao
发布于 2024-05-02
公开
ML4CFD | 圆柱绕流的降阶建模
AI4SCFDML4CFD中文
AI4SCFDML4CFD中文
JiaweiMiao
发布于 2024-05-07