Krawler is a research tool for crawling web applications through keyboard interaction and generating a keyboard navigation graph model, i.e., the Keyboard Experience Representation (KER).
The repository is centered on:
src/main/java/edu/usc/sql/krawler/RunApproachKrawler.java
Krawler uses Selenium WebDriver, browser replay through mitmproxy, and configurable keyboard actions to explore keyboard-reachable UI states and transitions.
Krawler:
- opens a target web page in a configured browser;
- routes browser traffic through mitmproxy;
- replays cached web content when available;
- explores the page using keyboard navigation and interaction keys;
- records discovered UI states and keyboard-triggered transitions;
- writes the resulting keyboard navigation graph as JSON.
- Java JDK 8 or newer
- Maven
- A configured
run_configfile - A subject-mapping CSV file
- Cached website content or a configured subject URL
Browser and mitmproxy setup are handled through the repository setup flow. The normal setup path is:
mvn clean installThis compiles the project and runs the repository setup scripts configured in Maven.
keyboard-crawler-public/
├── README.md
├── pom.xml
├── run_config_template
├── download_mitmproxy.sh
├── download_firefox.sh
├── download_chrome.sh
└── src/main/java/edu/usc/sql/krawler/
└── RunApproachKrawler.java
Clone the repository:
git clone git@github.com:USC-SQL/keyboard-crawler-public.git
cd keyboard-crawler-publicBuild the project:
mvn clean installThis is the intended setup step. Do not manually edit browser-driver paths unless you are debugging a local environment issue.
Copy the template config file:
cp run_config_template run_configThen edit run_config for your local experiment.
Example:
run_single_subject===netflix
list_of_subjects_to_run===/path/to/subjects-run.csv
headless_mode===true
num_of_concurrent_webdrivers===2
mitmproxy_folder_name===mitmproxy-5.3.0
browser_type===chrome
proxy_port===9998
subject_mapping_csv===/path/to/subjects.csv
cached_subjects_location===/path/to/cached-subjects/
project_output_dir===/path/to/output-directory/
| Field | Meaning |
|---|---|
run_single_subject |
Name of one subject to crawl. This should match a subject name in subject_mapping_csv. |
list_of_subjects_to_run |
CSV containing subject names to crawl. Used when running multiple subjects. |
headless_mode |
true to run the browser headlessly; false to show the browser window. |
num_of_concurrent_webdrivers |
Number of concurrent WebDriver instances used by the crawler. |
mitmproxy_folder_name |
Name of the mitmproxy folder under src/main/resources. |
browser_type |
Browser used by the crawler. Supported public options are chrome and firefox. |
proxy_port |
Starting proxy port used by mitmproxy. |
subject_mapping_csv |
Local CSV mapping subject names to URLs. |
cached_subjects_location |
Directory where cached web page content file is stored. |
project_output_dir |
Directory where crawler output is written. |
Krawler supports browser-based crawling through Selenium WebDriver.
Choose the browser in run_config:
browser_type===chrome
or:
browser_type===firefox
Chrome and Firefox are the primary supported browsers for the public repository.
The repository includes setup scripts for browser-related setup:
download_chrome.sh
download_firefox.sh
The normal public workflow is still:
mvn clean installThen choose the browser through browser_type.
For debugging, where you want to watch the browser, use:
headless_mode===false
otherwise
headless_mode===true
The subject-mapping CSV maps each subject name to the URL that should be crawled.
Example:
netflix,https://www.netflix.com/loginThe subject name should match the value used in run_single_subject.
For example:
run_single_subject===netflix
should correspond to a row whose subject name is netflix.
Krawler uses mitmproxy to replay cached websites.
Cached subject data should be stored under the directory configured by:
cached_subjects_location
For example, if:
cached_subjects_location===/path/to/cached-subjects/
run_single_subject===netflix
then the crawler expects the cached subject data to be associated with:
/path/to/cached-subjects/netflix
When creating a cache, interact with the parts of the web page that should be available during replay. mitmproxy can only replay content and behavior that was captured during caching.
Avoid unnecessary infinite scrolling, autoplaying content, or unrelated dynamic content when building caches, because large caches can significantly slow later crawling runs.
After setup and configuration, run:
edu.usc.sql.krawler.RunApproachKrawler
In IntelliJ IDEA:
- Open the repository as a Maven project.
- Run
mvn clean install. - Copy
run_config_templatetorun_config. - Edit
run_config. - Open
RunApproachKrawler.java. - Run the
mainmethod.
Krawler writes the generated keyboard navigation graph under:
project_output_dir/output/KERKrawler/<subject-name>/KER.json
Example:
/path/to/output-directory/output/KERKrawler/netflix/KER.json
Krawler explores pages using configured keyboard actions. The key configuration is controlled through the project config and source code. The default configuration is intended to cover common keyboard navigation and activation behavior.
Typical keyboard actions include navigation keys such as Tab and Shift+Tab, activation keys such as Enter and Space, and other supported keys used by the crawler implementation.
If you use Krawler in research, please cite the corresponding Keyboard Krawler publication.