空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

TeachOpenCADD | 011 拉取网络接口

化学信息学

TeachOpenCADD

化学信息学TeachOpenCADD

YangHe

发布于 2023-06-15

数据集

AI4SCUP-CNS-BBB(v1)

011 查询在线API Web服务

🏃🏻 快速开始
您可以直接在 Bohrium Notebook 上执行此文档。首先，请点击位于界面顶部的开始连接按钮，然后选择 bohrium-notebook:05-31 镜像并选择合适的的机器配置，稍等片刻即可开始运行。

📖 来源
本 Notebook 来自 https://github.com/volkamerlab/teachopencadd 。

代码

文本

Aim of this talktorial

In this notebook, you will learn how to programmatically use online web-services from Python, in the context of drug design. By the end of this talktorial, you will be familiar with REST services and web scraping.

代码

文本

Contents in Theory

Data access from a server-side perspective

代码

文本

Contents in Practical

Downloading static files
Accessing dynamically generated content
Programmatic interfaces
Document parsing
Browser remote control

代码

文本

References

This guide is very practical and omits some technical definitions for the sake of clarity. However, you should also handle some basic terminology to fully understand what is going on behind the scenes.

代码

文本

Theory

代码

文本

The internet is a collection of connected computers that exchange data. In a way, you essentially query machines (servers) with certain parameters to retrieve specific data. That data will be either:

A. Served straight away, since the server is simply a repository of files. E.g. you can download the ChEMBL database dump from their servers.
B. Retrieved from a database and formatted in a particular way. The result you see on your browser is either:
- B1. Pre-processed on the server, e.g. the HTML page you see when you visit any article in Wikipedia.
- B2. Dynamically generated on the client (your browser) as you use the website, e.g. Twitter, Facebook, or any modern web-app.
C. Computed through the execution of one or more programs on the server-side, e.g. estimating the protonation states of a protein-ligand complex using Protoss.

In a way, configuration C is a special type of B1. You are just replacing the type of task that runs on the server: database querying and HTML rendering vs. computations that process your query and return data formatted in a domain-specific way.

Another way of categorizing online services is by the format of the returned data. Most pages you see on your browser are using HTML, usually focusing on presenting data in a human-readable way. However, some servers might structure that data in a way that is machine-readable. This data can be processed in a reliable way because it's formatted using a consistent set of rules that can be easily encoded in a program. Such programs are usually called parsers. HTML can be labeled in such a way that data can be obtained reliably, but it is not designed with that purpose in mind. As a result, we will usually prefer using services that provide machine-readable formats, like JSON, CSV or XML.

In practice, both ways of data presentation (should) coexist in harmony. Modern web architecture strives to separate data retrieval tasks from end-user presentation. One popular implementation consists of using a programmatic endpoint that returns machine-readable JSON data, which is then consumed by the user-facing web application. The latter renders HTML, either on the server -option B1-, or on the user's browser -option B2. Unfortunately, unlike the user-facing application, the programmatic endpoint (API) is not guaranteed to be publicly available, and is sometimes restricted to internal usage on the server side.

In the following sections, we will discuss how to make the most out of each type of online service using Python and some libraries!

代码

文本

Practical

代码

文本

[1]

from pathlib import Path

HERE = Path(_dh[-1])

DATA = HERE / "data"

TMPDATA = DATA / "_tmp" # this dir is gitignored

TMPDATA.mkdir(parents=True, exist_ok=True)

代码

文本

Downloading static files

In this case, the web server is hosting files that you will download and consume right away. All you need to do is to query the server for the right address or URL (Universal Resource Location). You do this all the time when you browse the internet, and you can also do it with Python!

For example, let's get this kinase-related CSV dataset from GitHub, which contains a list of kinases and their identifiers.

Tip: Whenever you want to download a file hosted in GitHub, use the Raw button to obtain the downloadable URL!

While Python provides a library to deal with HTTP queries (urllib), people often prefer using the 3rd-party requests because the usage is way simpler.

代码

文本

[2]

import requests

url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"

response = requests.get(url)

response.raise_for_status()

response

# NBVAL_CHECK_OUTPUT

<Response [200]>

代码

文本

When you use requests.get(...) you obtain a Response object. This is not the file you want to download, but an object that wraps the HTTP query and the response the server gave you. Before we inspect the content, we always call .raise_for_status(), which will raise an exception if the server told us that the request could not be fulfilled. How does the server do that? With HTTP status codes, a 3-digit number. There are several, but the most common ones are:

200: Everything OK!
404: File not found.
500: Server error.

.raise_for_status() will complain if your response didn't obtain a 200 code. As such, it's a good practice to call it after every query!

See this example of a bad URL, it contains an error: there's no TXT file there, just a CSV.

代码

文本

[3]

# NBVAL_RAISES_EXCEPTION

bad_url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt"

bad_response = requests.get(bad_url)

bad_response.raise_for_status()

bad_response

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[3], line 4
      2 bad_url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt"
      3 bad_response = requests.get(bad_url)
----> 4 bad_response.raise_for_status()
      5 bad_response

File /opt/conda/lib/python3.8/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt

代码

文本

Ok, now let's get to the contents of the CSV file! Depending on what you are looking for, you will need one of these attributes:

response.content: The bytes returned by the server.
response.text: The contents of the file, as a string, if possible.
response.json(): If the server returns JSON data (more on this later), this method will parse it and return the corresponding dictionary.

Which one should you use? If you want to display some text in the Notebook output, then go for .text. Everything that involves binary files (images, archives, PDFs...) or downloading to disk should use .content.

Since this a CSV file, we know that's a plain text file, so we can use the usual Python methods on it! Let's print the first 10 lines:

代码

文本

[4]

print(*response.text.splitlines()[:10], sep="\n")

xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705
ADCK4,ADCK4,ADCK4,Uncharacterized aarF domain-containing protein kinase 4,Atypical,ABC1,ABC1-A,Q96D53
Trb1,Trb1,TRIB1,Tribbles homolog 1,CAMK,Trbl,,Q96RU8
BRSK2,BRSK2,BRSK2,Serine/threonine-protein kinase BRSK2,CAMK,CAMKL,BRSK,Q8IWQ3
Wnk2,Wnk2,WNK2,Serine/threonine-protein kinase WNK2,Other,WNK,,Q9Y3S1
AKT1,AKT1,AKT1,RAC-alpha serine/threonine-protein kinase,AGC,Akt,,P31749

代码

文本

Of course, you can save this to disk using the usual Python constructs. Since we are downloading, it's recommended to use the raw bytes contents, not the text version! This means you should use response.content and open your file in bytes mode (the b in wb):

代码

文本

[5]

with open(TMPDATA / "kinhub.csv", "wb") as f:

f.write(response.content)

代码

文本

Open it again to check we wrote something.

代码

文本

[ ]

# We need the encoding="utf-8-sig" to ensure correct encoding

# under all platforms

with open(TMPDATA / "kinhub.csv", encoding="utf-8-sig") as f:

# Zip will stop iterating with the shortest iterator

# passing `range(5)` allow us to just get five lines ;)

for _, line in zip(range(5), f):

print(line.rstrip())

# NBVAL_CHECK_OUTPUT

代码

文本

ℹ Tip: If all you want to do is downloading a CSV file to open it with Pandas, then just pass the raw URL to pandas.read_csv. It will download the file for you!

代码

文本

[ ]

import pandas as pd

df = pd.read_csv(

"https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"

)

df.head()

# NBVAL_CHECK_OUTPUT

代码

文本

One note about file downloads. The method above downloads the whole file into memory, which can be a problem for very big files. If you intend to download a very large file, you can push it to disk directly using streaming requests and raw responses. As an example, let's pretend this 1MB video is too big to fit in memory:

代码

文本

[ ]

import shutil

from IPython.display import Video

response = requests.get(

"https://archive.org/download/SlowMotionFlame/slomoflame_512kb.mp4", stream=True

)

response.raise_for_status()

with open(TMPDATA / "video.mp4", "wb") as tmp:

for chunk in response.iter_content(chunk_size=8192):

tmp.write(chunk)

# Let's play the movie in Jupyter!

# Paths passed to widgets need to be relative to notebook or they will 404 :)

display(Video(Path(tmp.name).relative_to(HERE)))

代码

文本

Accessing dynamically generated content

So far, we have been able to retrieve files that were present on a remote server. To do that, we used requests.get and a URL that points to the file.

Well, it turns out that the same technique will work for many more types of content! What the server does with the URL is not our concern! Whether the server only needs to give you a file on disk or query a database and assemble different parts into the returned content does not matter at all.

That concept alone is extremely powerful, as you will see now. Remember: We just need to make sure we request the correct URL!

Let's work on something fun now! The spike protein in SARS-CoV-2 is one of the most popular proteins lately, can we get some information from UniProt using requests? Its UniProt ID is P0DTC2. Go check with your browser first, you should see something like this:

UniProt entry for SARS-CoV-2

One of the things UniProt provides is the amino acid sequence of the listed protein. Scroll down until you see this part:

Sequence for SARS-CoV-2

Do you think we can get only the sequence using Python? Let's see!

To query a protein, you simply need to add its UniProt ID to the URL.

代码

文本

[ ]

r = requests.get("https://www.uniprot.org/uniprot/P0DTC2")

r.raise_for_status()

print(r.text[:5000])

代码

文本

Wow, what is all that noise? You are seeing the HTML content of the webpage! That's the markup language web developers use to write webpages.

There are libraries to process HTML and extract the actual content (like BeautifulSoup; more below), but we will not need it here yet. Fortunately, UniProt provides alternative representations of the data.

UniProt formats

Some formats are more convenient for programmatic use. If you click on Text you will see something different in your browser: just plain text! Also, notice how the URL is now different.

HTML representation (default): https://www.uniprot.org/uniprot/P0DTC2
Text representation: https://www.uniprot.org/uniprot/P0DTC2.txt

Just adding the .txt extension was enough to change the style. This is a nice feature UniProt provides. It mimics a file system, but it's actually changing the representation of the returned content. Elegant! And more important, easier to use programmatically! Check it:

代码

文本

[ ]

r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.txt")

r.raise_for_status()

print(r.text[:1000])

代码

文本

This is exactly what we see on our browser! Plain text is nice for these things. However, the sequence is all the way at the end of the file. To retrieve it, you need to get creative and analyze those little tags each line has. See how it begins with SQ and finishes with //:

SQ   SEQUENCE   1273 AA;  141178 MW;  B17BE6D9F1C4EA34 CRC64;
     MFVFLVLLPL VSSQCVNLTT RTQLPPAYTN SFTRGVYYPD KVFRSSVLHS TQDLFLPFFS
     NVTWFHAIHV SGTNGTKRFD NPVLPFNDGV YFASTEKSNI IRGWIFGTTL DSKTQSLLIV
     NNATNVVIKV CEFQFCNDPF LGVYYHKNNK SWMESEFRVY SSANNCTFEY VSQPFLMDLE
     GKQGNFKNLR EFVFKNIDGY FKIYSKHTPI NLVRDLPQGF SALEPLVDLP IGINITRFQT
     LLALHRSYLT PGDSSSGWTA GAAAYYVGYL QPRTFLLKYN ENGTITDAVD CALDPLSETK
     CTLKSFTVEK GIYQTSNFRV QPTESIVRFP NITNLCPFGE VFNATRFASV YAWNRKRISN
     CVADYSVLYN SASFSTFKCY GVSPTKLNDL CFTNVYADSF VIRGDEVRQI APGQTGKIAD
     YNYKLPDDFT GCVIAWNSNN LDSKVGGNYN YLYRLFRKSN LKPFERDIST EIYQAGSTPC
     NGVEGFNCYF PLQSYGFQPT NGVGYQPYRV VVLSFELLHA PATVCGPKKS TNLVKNKCVN
     FNFNGLTGTG VLTESNKKFL PFQQFGRDIA DTTDAVRDPQ TLEILDITPC SFGGVSVITP
     GTNTSNQVAV LYQDVNCTEV PVAIHADQLT PTWRVYSTGS NVFQTRAGCL IGAEHVNNSY
     ECDIPIGAGI CASYQTQTNS PRRARSVASQ SIIAYTMSLG AENSVAYSNN SIAIPTNFTI
     SVTTEILPVS MTKTSVDCTM YICGDSTECS NLLLQYGSFC TQLNRALTGI AVEQDKNTQE
     VFAQVKQIYK TPPIKDFGGF NFSQILPDPS KPSKRSFIED LLFNKVTLAD AGFIKQYGDC
     LGDIAARDLI CAQKFNGLTV LPPLLTDEMI AQYTSALLAG TITSGWTFGA GAALQIPFAM
     QMAYRFNGIG VTQNVLYENQ KLIANQFNSA IGKIQDSLSS TASALGKLQD VVNQNAQALN
     TLVKQLSSNF GAISSVLNDI LSRLDKVEAE VQIDRLITGR LQSLQTYVTQ QLIRAAEIRA
     SANLAATKMS ECVLGQSKRV DFCGKGYHLM SFPQSAPHGV VFLHVTYVPA QEKNFTTAPA
     ICHDGKAHFP REGVFVSNGT HWFVTQRNFY EPQIITTDNT FVSGNCDVVI GIVNNTVYDP
     LQPELDSFKE ELDKYFKNHT SPDVDLGDIS GINASVVNIQ KEIDRLNEVA KNLNESLIDL
     QELGKYEQYI KWPWYIWLGF IAGLIAIVMV TIMLCCMTSC CSCLKGCCSC GSCCKFDEDD
     SEPVLKGVKL HYT
//

Hence, you could do something like this:

代码

文本

[ ]

sequence_block = False

lines = []

for line in r.text.splitlines():

if line.startswith("SQ"):

sequence_block = True

elif line.startswith("//"):

sequence_block = False

if sequence_block:

line = line.strip() # delete spaces and newlines at the beginning and end of the line

line = line.replace(" ", "") # delete spaces in the middle of the line

lines.append(line)

sequence = "".join(lines[1:]) # the first line is the metadata header

print(f"This is your sequence: {sequence}")

# NBVAL_CHECK_OUTPUT

代码

文本

Ta-da! We got it! It required some processing, but it works... However, you should always wonder if there's an easier way. Given that UniProt had a nice way of providing the text representation, how come they don't offer a URL that only returns the sequence for a given UniProt ID? Well, they do! Just change .txt for .fasta: https://www.uniprot.org/uniprot/P0DTC2.fasta

代码

文本

[ ]

r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.fasta")

r.raise_for_status()

print(r.text)

# NBVAL_CHECK_OUTPUT

代码

文本

This is returned in FASTA, a common syntax in bioinformatics. You could use established libraries like BioPython to parse it too!

代码

文本

[ ]

from Bio import SeqIO

from tempfile import NamedTemporaryFile

import os

# Write response into a temporary text file

with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:

tmp.write(r.text)

# Create the BioPython object for sequence data:

sequence = SeqIO.read(tmp.name, format="fasta")

# Delete temporary file now that we have read it

os.remove(tmp.name)

print(sequence.description)

print(sequence.seq)

# NBVAL_CHECK_OUTPUT

代码

文本

All these ways to access different representations or sections of the data contained in UniProt constitutes a URL-based API (Application Programmatic Interface). The foundational principle is that the URL contains all the parameters needed to ask the server for a specific type of content. Yes, you read that correctly: parameters. If you think about it, a URL specifies two parts: the machine you are connecting to and the page in that machine you want to access. When the page part is missing, the server assumes you are asking for index.html or equivalent.

Let's compare it to a command-line interface:

@ # this is your browser
@ uniprot.org/uniprot/P0DTC2.fasta

$ # this is your terminal
$ uniprot --id=P0DTC2 --format=FASTA

Each part of the URL can be considered a positional argument! So, if you want the sequence of a different protein, just input its UniProt ID in the URL, done! For example, P00519 is the ID for the ABL1 kinase.

代码

文本

[ ]

r = requests.get("https://www.uniprot.org/uniprot/P00519.fasta")

r.raise_for_status()

print(r.text)

# NBVAL_CHECK_OUTPUT

代码

文本

What if we parameterize the URL with an f-string and provide a function to make it super Pythonic? Even better, what if we provide the Bio.SeqIO parsing functionality too?

代码

文本

[ ]

def sequence_for_uniprot_id(uniprot_id):

"""

Returns the FASTA sequence of a given Uniprot ID using

the UniProt URL-based API

Parameters

----------

uniprot_id : str

Returns

-------

Bio.SeqIO.SeqRecord

"""

# ⬇ this is key part!

r = requests.get(f"https://www.uniprot.org/uniprot/{uniprot_id}.fasta")

r.raise_for_status()

with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:

tmp.write(r.text)

sequence = SeqIO.read(tmp.name, format="fasta")

os.remove(tmp.name)

return sequence

代码

文本

Now you can use it for any UniProt ID. This is for the Src kinase:

代码

文本

[ ]

sequence = sequence_for_uniprot_id("P12931")

print(sequence)

# NBVAL_CHECK_OUTPUT

代码

文本

Congratulations! You have used your first online API in Python and adapted it to a workflow!

代码

文本

Programmatic interfaces

What UniProt does with their URLs is one way of providing access to their database, i.e., through specific URL schemes. However, if each web service would have to come up with their own scheme, developers would need to figure out which scheme the website is using, and then implement, adapt or customize their scripts on a case-by-case basis. Fortunately, there are some standardized ways of providing programmatic access to online resources. Some of them include:

HTTP-based RESTful APIs (wiki)
GraphQL
SOAP
gRPC

In this talktorial, we will focus on the first one, REST.

代码

文本

HTTP-based RESTful APIs

This type of programmatic access defines a specific entry point for clients (scripts, libraries, programs) that require programmatic access, something like api.webservice.com. This is usually different from the website itself (webservice.com). They can be versioned, so the provider can update the scheme without disrupting existing implementations (api.webservice.com/v1 will still work even when api.webservice.com/v2 has been deployed).

This kind of API is usually accompanied by well-written documentation explaining all the available actions in the platform. For example, look at the KLIFS API documentation. KLIFS is a database of kinase targets and small compound inhibitors. You can see how every argument and option is documented, along with usage examples.

If you wanted to list all the kinase families available in KLIFS, you need to access this URL:

https://klifs.net/api/kinase_groups

Result (click here!)

[
  "AGC",
  "CAMK",
  "CK1",
  "CMGC",
  "Other",
  "STE",
  "TK",
  "TKL"
]

This response happens to be JSON-formatted! This is easily parsed into a Python object using the json library. The best news is that you don't even need that. Using requests, the following operation can be done in three lines thanks to the .json() method:

代码

文本

[ ]

import requests

response = requests.get("https://klifs.net/api/kinase_groups")

response.raise_for_status()

result = response.json()

result

# NBVAL_CHECK_OUTPUT

代码

文本

That's a Python list!

代码

文本

[ ]

result[0]

代码

文本

Let's see if we can get all the kinase families contained in a specific group. Reading the documentation, looks like we need this kind of URL:

https://klifs.net/api/kinase_families?kinase_group={{ NAME }}

What follows after the ? symbol is the query. It's formatted with a key-value syntax like this: key=value. Multiple parameters can be expressed with &:
https://api.webservice.com/some/endpoint?parameter1=value1&parameter2=value2

Let's see the returned object for the tyrosine kinase (TK) group: family=TK

代码

文本

[ ]

response = requests.get("https://klifs.net/api/kinase_families?kinase_group=TK")

response.raise_for_status()

result = response.json()

result

代码

文本

Since passing parameters to the URL is a very common task, requests provides a more convenient way. This will save you from building the URLs manually or HTML escaping the values. The key idea is to pass the key-value pairs as a dictionary. The previous query can be (and should be, if you ask us) done like this:

代码

文本

[ ]

response = requests.get("https://klifs.net/api/kinase_families", params={"kinase_group": "TK"})

# You can see how requests formatted the URL for you

print("Queried", response.url)

response.raise_for_status()

result = response.json()

result

代码

文本

Sometimes the returned JSON object is not a list, but a dict. Or a combination of dictionaries and lists. Maybe even nested! You can still access them using the Python tools you already know.

For example, the kinase_information endpoint requires a numeric ID, and will return a lot of information on a single kinase:

代码

文本

[ ]

response = requests.get("https://klifs.net/api/kinase_information", params={"kinase_ID": 22})

response.raise_for_status()

result = response.json()

result

# NBVAL_CHECK_OUTPUT

代码

文本

If you want to know the UniProt ID for this kinase, you will need to access the first (and only) element in the returned list, and ask for the value of the uniprot key:

代码

文本

[ ]

result[0]["uniprot"]

代码

文本

Turns out we can use this to get the full sequence of the protein (and not just the pocket sequence) using our UniProt function from before!

代码

文本

[ ]

mastl = sequence_for_uniprot_id(result[0]["uniprot"])

print(mastl.seq)

# NBVAL_CHECK_OUTPUT

代码

文本

We are using two webservices together, awesome!

代码

文本

Generating a client for any API

Did you find that convenient? Well, we are not done yet! You might have noticed that all the endpoints in the KLIFS API have a similar pattern. You specify the name of the endpoint (kinase_groups, kinase_families, kinase_information, ...), pass some (optional) parameters if needed, and then get a JSON-formatted response. Is there a way you can avoid having to format the URLs yourself? The answer is... yes!

The REST API scheme can be expressed programmatically in a document called Swagger/OpenAPI definitions, which allows to dynamically generate a Python client for any REST API that implements the Swagger/OpenAPI schema. This is the one for KLIFS.

Of course, there are libraries for doing that in Python, like bravado.

代码

文本

[ ]

from bravado.client import SwaggerClient

KLIFS_SWAGGER = "https://klifs.net/swagger/swagger.json"

client = SwaggerClient.from_url(KLIFS_SWAGGER, config={"validate_responses": False})

client

代码

文本

Then, you can have fun inspecting the client object for all the API actions as methods.

ℹ Tip: Type client. and press Tab to inspect the client in this notebook.

代码

文本

[ ]

?client.Information.get_kinase_names

代码

文本

bravado is auto-generating classes and functions that mirror the API we were using before! How cool is that? The same query can now be done without requests.

代码

文本

[ ]

client.Information.get_kinase_information(kinase_ID=[22])

代码

文本

Note that bravado does not return the response right away. It creates a promise that it will do so when you ask for it. This allows it to be usable in asynchronous programming, but for our purposes, it means that you need to call it with .result().

代码

文本

[ ]

results = client.Information.get_kinase_information(kinase_ID=[22]).result()

result = results[0]

result

代码

文本

[ ]

result.uniprot

# NBVAL_CHECK_OUTPUT

代码

文本

bravado also builds result objects for you, so you don't have to use the result["property"] syntax, but the result.property one. Some more convenience for the end user ;)

代码

文本

Document parsing

Sometimes the web service will not provide a standardized API that produces machine-readable documents. Instead, you will have to use the regular webpage and parse through the HTML code to obtain the information you need. This is called (web) scraping, which usually involves finding the right HTML tags and IDs that contain the valuable data (ignoring things such as the sidebars, top menus, footers, ads, etc).

In scraping, you basically do two things:

Access the webpage with requests and obtain the HTML contents.
Parse the HTML string with BeautifulSoup or requests-html.

Let's parse the proteinogenic amino acids table in this Wikipedia article:

代码

文本

[ ]

import requests

from bs4 import BeautifulSoup

import numpy as np

import pandas as pd

r = requests.get("https://en.wikipedia.org/wiki/Proteinogenic_amino_acid")

r.raise_for_status()

# To guess the correct steps here, you will have to inspect the HTML code by hand

# Tip: use right-click + inspect content in any webpage to land in the HTML definition ;)

html = BeautifulSoup(r.text)

header = html.find("span", id="General_chemical_properties")

table = header.find_all_next()[4]

table_body = table.find("tbody")

data = []

for row in table_body.find_all("tr"):

cells = row.find_all("td")

if cells:

data.append([])

for cell in cells:

cell_content = cell.text.strip()

try: # convert to float if possible

cell_content = float(cell_content)

except ValueError:

pass

data[-1].append(cell_content)

# Empty fields are denoted with "?" which casts respective columns to object types

# (here mix of strings and floats) but we want float64, therefore replace "?" with NaN values

pd.DataFrame.from_records(data).replace("?", np.nan)

# NBVAL_CHECK_OUTPUT

代码

文本

Browser remote control

The trend some years ago was to build servers that dynamically generate HTML documents with some JavaScript here and there (such as Wikipedia). In other words, the HTML is built in the server and sent to the client (your browser).

However, latest trends are pointing towards full applications built entirely with JavaScript frameworks. This means that the HTML content is dynamically generated in the client. Traditional parsing will not work and you will only download the placeholder HTML code that hosts the JavaScript framework. To work around this, the HTML must be rendered with a client-side JavaScript engine.

We won't cover this in the current notebook, but you can check the following projects if you are interested:

代码

文本

Discussion

In this theoretical introduction you have seen how different methods to programmatically access online web services can be used from a Python interpreter. Leveraging these techniques you will be able to build automated pipelines inside Jupyter Notebooks. In the end, querying a database or downloading a file involves the same kind of tooling.

Unfortunately, there is too much material to cover about web APIs in a single lesson. For example, how do you send or upload contents from Python? Can you submit forms? If you are interested in knowing more, the requests documentation should be your go-to resource. Some interesting parts include:

代码

文本

Reference

https://github.com/volkamerlab/teachopencadd

Reprint statement

Original title: Querying online API webservices

Authors:

Jaime Rodríguez-Guerra, 2020-2021, Volkamer lab, Charité
Dominique Sydow, 2021, Volkamer lab, Charité
David Schaller, 2021, Volkamer lab, Charité

代码

文本

化学信息学

TeachOpenCADD

化学信息学TeachOpenCADD

点个赞吧