The Main Function
scrape_and_parse_equations()
Let's begin with the main function that scrapes Wikipedia, parses the equations, and derives priors all at once. We will use the scrape_and_parse_equations()
function to scrape a category page using the Super
tag.
from equation_scraper import scrape_and_parse_equations
scrape_and_parse_equations(['Super:Cognitive_psychology'])
All data produced will be stored in a data
folder at your current working directory. If this folder does not exists, the package will create it for you. Each search is further organized by creating its own specific sub-folder named after your keywords, separated by underscores. For example ['Super:Cognitive_psychology', 'Neuroscience']
would create the folder SUPERCognitivePsychology_Neuroscience
within the data
folder. Note, keywords using the Direct:
tag are not included in this filename nomenclature.
If you repeat the exact same search, this folder is first deleted and then rebuilt with the new search. After running this function, you will notice a series of files, as well as a debug
folder with more files, but we will explain what each file does in the next section when running each of the two sub-functions individually.
An example structure when using the keywords ['Cognitive_psychology']
would be:
data
|
└───CognitivePsychology
│ equations_CognitivePsychology.txt
│ priors_CognitivePsychology.pkl
| parsed_equations_CognitivePsychology.txt
| (optional) figure_CognitivePsychology.png
|
└───debug
│ debug_parsed_CognitivePsychology.txt
│ skipped_equations_CognitivePsychology.txt
│ wordsRemoved_equations_CognitivePsychology.txt
The Sub-Functions
scrape_equations()
The scrape_equations()
function searches Wikipedia and derives a list of equations to be parsed.
from equation_scraper import scrape_equations
scrape_equations(['Super:Cognitive_psychology', 'Neuroscience'])
This function produces a single file: equations_*.txt
where *
corresponds to the keywords you searched for (i.e., the name of the search's sub-folder), for example: equations_SUPERCognitivePsychology_Neuroscience.txt
in the case of the example code above. The first line of this file lists meta-data of the keywords used for the search (e.g., #CATEGORIES: ['Super:Cognitive_Psychology', 'Neuroscience']
). After this, you will see a pattern of each Wikipedia page scraped that looks like this:
#ROOT: Biological Motion Perception
#LINK: /wiki/
Biological_motion_perception
{\displaystyle \nu _{\psi }(t)={\frac {R_{\psi }(t)-{\bar {R}}}{\bar {R}}}}
{\displaystyle o_{l}(x)={\sqrt {max(g_{p}(x_{i}))max(g_{r}(x_{j}))}}}
...
#ROOT
is the title of the corresponding Wikipedia page#LINK
is the Wikipedia URL without thehttps://en.wikipedia.org
prefix- e.g.,
/wiki/
Biological_motion_perception
can be used ashttps://en.wikipedia.org/wiki/Biological_motion_perception
- e.g.,
- Everything after this is a scraped equation. The format of these equations has not been modified in any way up to this point, so these are exactly as they were scraped from Wikipedia. For example, from the above example, the equations are:
{\displaystyle \nu _{\psi }(t)={\frac {R_{\psi }(t)-{\bar {R}}}{\bar {R}}}}
{\displaystyle o_{l}(x)={\sqrt {max(g_{p}(x_{i}))max(g_{r}(x_{j}))}}}
parse_equations()
The parse_equations()
function then loads the equations_*.txt
file written from the previous function and iterates through the equations where it parses their operators and functions and builds priors. Running this function still requires that you pass the same keywords as with the previous function because it uses these to determine which folder the data is saved into.
from equation_scraper import parse_equations
parse_equations(['Super:Cognitive_psychology', 'Neuroscience'])
The main file produced by this function is the priors_*.pkl
file where *
corresponds to the keywords you searched for (i.e., the name of the search's sub-folder), for example: priors_SUPERCognitivePsychology_Neuroscience.pkl
. We look into this file in detail in the Priors
section of our documentation. The other file that this function produces is the parsed_equations_*.txt
file, which includes the parsing results per equation.
Additionally, there are three files within the debug folder: debug_parsed_*.txt
, skipped_equations_*.txt
, wordsRemoved_equations_*.txt
.
-
debug_parsed_*.txt
presents the same information asparsed_equations_*.txt
but with a different organization structure. -
skipped_equations_*.txt
presents all equations discarded and not used for priors (these are most often discarded because they are not actual expressions, such as with a variable decleration in text, e.g., "v
is our first indepenent variable"). -
wordsRemoved_equations_*.txt
is a list of words that were turned into variables---this occurs when equations contain words to represent a single variable, for example:WEIGHT = HEIGHT * c
would be transformed toy = x * c
and the wordsWEIGHT
andHEIGHT
would be added to this list.
Other Functions
plot_prior()
The plot_prior()
function will produce a figure---figure_*.png
--- that is a barplot of the frequencies of operators/functions. It uses the same keywords as the aforementioned functions. The scrape_and_parse_equations()
function also automatically runs this function after it has scraped and parsed the equations from your keywords list.
from equation_scraper import plot_prior
plot_prior(['Super:Cognitive_psychology', 'Neuroscience'])
load_prior()
The load_prior()
function will allow you to load the pickle file containing the prior information either using the same keywords as all aforementioned functions:
es_priors = load_prior(['Super:Cognitive_psychology', 'Neuroscience'])
or with a point-and-click interface if no inputs are given to the function:
es_priors = load_prior()
This is the only function that returns a variable, and this variable will contain metadata and the priors. Again, see the Priors
section of our documentation for an in-depth look of this file.
from equation_scraper import load_prior
es_priors = load_prior(['Super:Cognitive_psychology', 'Neuroscience'])
#Print prior information
print('**********METADATA**********\n')
for key in es_priors['metadata'].keys():
print(f"{key}: {es_priors['metadata'][key]}")
print('\n***********PRIORS***********\n')
for key in es_priors['priors'].keys():
print(f"{key}: {es_priors['priors'][key]}")
Command Line Interface
The equation-scraper
can also be run using a command line interface (CLI). To use the CLI, you must clone the repository (see Install
instructions in our documentation). When using the CLI, you provide your keywords as arguments, separated by a space:
python scrape_and_parse_equations.py Super:Cognitive_psychology Neuroscience
python scrape_equations.py Super:Cognitive_psychology Neuroscience
python parse_equations.py Super:Cognitive_psychology Neuroscience
python plot_prior.py Super:Cognitive_psychology Neuroscience
python load_prior.py Super:Cognitive_psychology Neuroscience