Text lab in Pavia
Motivation
Use existing tools to save your time. Learn some programming to save even more of your time.
Sketch Engine (account)
ELEXIS login for European universities (including Pavia). For the rest, sign up (join a multiuser account) with phrase paviaMay2019. This will be valid for one month starting 8th May 2019. Access the new Sketch Engine.
Sketch Engine API
REST: HTTP GET; accessible from any programming language with network support. Examples online in bash, Java, R and Python.
JSON
JSON (JavaScript Object Notation) is simple format for storing data (lists, dictionaries, strings, numbers and boolean values). Example
Online parser and validator and formatter.
To process JSON in Python you need json
library. This should
be available to import json
then you can use two functions: loads
for corverting JSON string into Python
object and vice versa with dumps
.
Let’s try to process the example JSON from above.
import json
jsondata = json.loads('{"key": 1, "key2": "test"}')
print(jsondata["key"])
print(jsondata["key2"])
Alternatively you may use repl.it online Python 3 IDE.
Python requests module
In your console (or in a short script you execute), try import requests
If this doesn’t work, have a look at this
how-to
and install it using pip
in your console.
python -m pip install requests
or
py -m pip install requests
should work.
You can download a webpage (HTML source) by using get
function.
import requests
response = requests.get("https://www.google.com")
# response.text response.json() response.status_code
print(response.status_code)
Now we want to send some parameters (query string in URL after “?”):
import requests
r = requests.get("https://www.google.com/search", params={"q": "pavia"})
print("Pavia" in r.text)
Supported languages in SkE
import requests
r = requests.get("https://app.sketchengine.eu/ca/api/languages")
for language in r.json()["data"]:
print(language["name"])
List of corpora
import requests
r = requests.get("https://app.sketchengine.eu/ca/api/corpora")
for corpus in r.json()["data"]:
print(corpus["name"], corpus["language_name"], corpus["sizes"]["wordcount"])
API server, username, API key
In the UI, click the top right menu and in My account, you should see API key, or you can generate a new one. This is to tell SkE API that it is really you with an existing account.
We use a separated API server, the address is
https://api.sketchengine.co.uk/corpus/
. We will always send
three parameters in our queries:
username=YOUR-USERNAME
api_key=YOUR-API-KEY
format=json
A trick
The new UI is sending API queries in the background. Open console (network) and have a look at what is sent to our server. You can copycat the query in your API requests easily.
Request template
import requests
base_url = "https://api.sketchengine.co.uk/corpus/"
method = "..." # view, thes, wsketch, wordlist, ...
parameters = {
"corpname": "...",
"username": "...",
"api_key": "...",
"format": "json"
}
r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
#process jsondata dictionary
API capabilities
Concordance (full-text search), parallel concordances, word sketch (collocational profiles of words), distributional thesaurus, word sketch difference (comparing two words and their collocational behaviour), wordlists, n-grams, keyword and term extraction, good dictionary examples (GDEX), bilingual terminology extraction (OneClick Terms), diachronic analysis (trends in time), …
Full-text search
CQL: corpus query language
Have a look at the cheat cheets and at
online documentation.
Basics.
What should must every linguist know:
regular expressions.
We will try to get sentences from an Italian corpus which contain
lemma suonare followed by any noun. We will use method view
.
import requests
base_url = "https://api.sketchengine.co.uk/corpus/"
method = "view"
parameters = {
"corpname": "ittenten16_2",
"username": "xbaisa",
"api_key": "...",
"format": "json",
"q": 'q[lemma="suonare"][tag="NOUN"]', # CQL
"viewmode": "sen", # KWIC vs. sentence mode
"pagesize": 10, # limit the number of sentences per page
"structs": "g" # what structures should be in the result
}
r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
for line in jsondata["Lines"]:
...
Things to be aware of:
- FUP: time.sleep(1),
- AQP: asynchronous query processing and
- download limits.
Wordlists
For any positional attribute from the corpus, you may access its lexicon (types).
The task is to get (verbs) in Italian starting with “s”. We will use the console trick to eavesdrop the right API request.
Or you may want to see all tags from the Italian tagset used in a corpus.
Thesaurus (similar words)
Distributional semantics in SkE. Very similar to word embeddings, in some situations even better.
The task is to find what similar words have adjectives “orange” and “red” in common.
import requests
base_url = "https://api.sketchengine.co.uk/corpus/"
method = "thes"
parameters = {
"corpname": "ententen15_tt21",
"username": "xbaisa",
"api_key": "...",
"format": "json",
"lemma": "orange",
"pos": "-j"
}
r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
orange_items = [x["word"] for x in jsondata["Words"]]
parameters["lemma"] = "red"
r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
red_items = [x["word"] for x in jsondata["Words"]]
for common in set(orange_items) & set(red_items):
print(common)
Keyword/term extraction
Create a corpus from text documents in the web interface. Extract keywords and terminology from it via API.
Simple math (Kilgarriff, 2009)
$${{\rm relfreq}_{\rm focus} + N} \over {{\rm relfreq}_{\rm ref} + N}$$
Have a look at the parameters, in particular, we need a reference corpus for the relative frequency.
import requests
base_url = "https://api.sketchengine.co.uk/corpus/"
method = "extract_keywords"
parameters = {
"corpname": "user/xbaisa/kw_test",
"username": "xbaisa",
"api_key": "...",
"format": "json",
"ref_corpname": "preloaded/ententen_13_tt2_1"
}
r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
for kw in jsondata["keywords"]:
print(kw["item"])
parameters["ref_corpname"] = "preloaded/ententen13_tt2_1_term_ref"
r = requests.get(base_url + "extract_terms", params=parameters)
jsondata = r.json()
for tm in jsondata["terms"]:
print(tm["item"])
Here we obtain “error” key in the JSON response. You should always
test not just “error” in JSON but also status_code
for other issues
(FUP, authentication problem, …)
Further functions and resources
- SkELL (Sketch Engine for English Learning) and itSkELL for Italian
- Corpus building via API
- At corpus.tools you find a few tools for working with texts, especially if you want to prepare a text corpus. The scripts are usually in Python 2.
- Statistics used in Sketch Engine
last modified: 2023-11-20
https://vit.baisa.cz/notes/learn/pavia/