Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/ISSUE_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## Description
A clear and concise description of what the issue is about.

## Screenshots


## Files
A list of relevant files for this issue. This will help people navigate the project and offer some clues of where to start.

## To Reproduce
If this issue is describing a bug, include some steps to reproduce the behavior.

## Tasks
Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.
- [ ] Task 1
- [ ] Task 2
- [ ] Task 3

25 changes: 25 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

## Type of change

Please delete options that are not relevant.

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update


## Screenshots


## Checklist:

- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
*.pyc
/data/
/.gtm/

# Environment settings
.*env
venv
*.DS_Store
# visual studio code config
.vscode
114 changes: 114 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Introduction

You are here to help on HealthTools Kenya Scraper? Awesome, feel welcome and read the following
sections in order to know what and how to work on something. If you get stuck
at any point you can create a ticket on
[GitHub](https://github.com/CodeForAfrica-SCRAPERS/healthtools_ke/issues).

All members of our community are expected to follow our
[Code of Conduct](https://github.com/CodeForAfrica/CodeOfConduct). Please make
sure you are welcoming and friendly in all of our spaces.

Following these guidelines helps to communicate that you respect the time of
the developers managing and developing this open source project. In return,
they should reciprocate that respect in addressing your issue, assessing
changes, and helping you finalize your pull requests.

## Types of Contributions

You can contribute in many ways. For example, you might:
* Add documentation and "how-to" articles in the [README](README.md) or the wiki
* Suggest Enhancements
* Fix issues
* Submit Bug reports

#### Bug Reports
*If you find a security vulnerability, **DO NOT** open an issue. Email
[security@codeforafrica.org](mailto:security@codeforafrica.org) instead.*

If you're reporting a bug, please include:
* Your operating system name and version
* Any details about your local setup that might be helpful in troubleshooting.
* If you can, provide detailed steps to reproduce the bug.
* If you don't have steps to reproduce the bug, just note your observations in
as much detail as you can. Questions to start a discussion about the issue
are welcome.

To ease the process of reporting bugs and issues, consider using our
[issue template](https://github.com/CodeForAfrica-SCRAPERS/healthtools_ke/blob/master/.github/ISSUE_TEMPLATE.md)
and don't forget to add an appropriate
[label](https://help.github.com/articles/creating-a-label/) to the issue.

#### Writing Documentation
Did you find a typo? Do you think that something should be clarified? Go ahead
and suggest a documentation patch. HealthTools could always use more documentation,
whether as part of the official docs, in docstrings, or even on the web in blog
posts, wiki, articles, and such.

#### Fixing Issues
Look through the GitHub issues for bugs. Anything tagged with "bug" is open to
whoever wants to implement it.

#### Suggesting Enhancements

Before creating enhancement suggestions, please check the issues list as you
might find out that you don't need to create one. When you are creating an
enhancement suggestion fill out the
[issue template](https://github.com/CodeForAfrica-SCRAPERS/healthtools_ke/blob/master/.github/ISSUE_TEMPLATE.md)
and [label](https://help.github.com/articles/creating-a-label/) the issue as a
new feature.

## Your first contribution

Unsure where to begin contributing to HealthTools Kenya Scraper? You can start by looking through
these `beginner` and `help-wanted` issues:

* `Beginner issues` - issues which should only require a few lines of code, and a
test or two.
* `Help wanted issues` - issues which should be a bit more involved than beginner
issues.

Once your changes and tests are ready to submit for review:

1. Test your changes

Run the tests if you have any and at the bare minimum, test your changes
manually.

2. Rebase your changes

Update your local repository with the most recent code from the main healthtools_ke
repository, and rebase your branch on top of the latest develop branch.

3. Submit a pull request

Push your local changes to your forked copy of the repository and submit a
pull request. In the pull request, choose a title which sums up the changes
that you have made, and in the body provide more details about what your
changes do. Also mention the number of the issue where discussion has taken
place. Preferably use our
[PR template](https://github.com/CodeForAfrica-SCRAPERS/healthtools_ke/blob/master/PULL_REQUEST_TEMPLATE.md).

Then sit back and wait. There will probably be discussion about the pull
request and, if any changes are needed, we would love to work with you to
get your pull request merged into healthtools_ke.

## Code Style

Please adhere to the [PEP8](https://www.python.org/dev/peps/pep-0008/) Coding
conventions for the Python language. This style guide evolves over time as
additional conventions are identified and past conventions are rendered obsolete
by changes in the language itself. Ensure to keep updated.

## Ground Rules
The goal is to maintain a diverse community that's pleasant for everyone.
That's why we would greatly appreciate it if everyone contributing to and
interacting with the community also followed this
[Code of Conduct](https://github.com/CodeForAfrica/CodeOfConduct).
The [Code of Conduct](https://github.com/CodeForAfrica/CodeOfConduct) covers our
behavior as members of the community, in any forum, mailing list, wiki, website,
Internet relay chat (IRC), public meeting or private correspondence.



Please see [contribution-guide.org](http://www.contribution-guide.org) for details on what we expect from contributors. Thanks!
49 changes: 42 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Healthtools Kenya
# HealthTools Kenya Scraper

This is a suite of scrapers that retrieve actionable information for citizens to use.
This is a suite of scrapers that retrieve actionable information for citizens to use. All the data scraped by this is accessible through our [HealthTools API](https://github.com/CodeForAfricaLabs/HealthTools.API).

They retrieve data from the following sites:

Expand All @@ -18,13 +18,25 @@ They currently run on [morph.io](http://morph.io) but you are able to set it up

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

### How the Scrapers Work

To get the data we follow a couple of steps:

**1. Start by scraping the websites:** This is done in most cases using beautiful soup.
**2. Elasticsearch update:** Replace data on elasticsearch with the new one. We only delete the documents after succesful completion of the scraping and not before. In the doctors' case, because we pull together foreign and local doctors, we won't update elasticsearch until both have been scraped succesfully.
**3. Archive the data:** We archive the data in a "latest" .json file so that the url doesn't have to change to get the latest version in a "dump" format. A date-stamped archive is also stored as we later intend to do some analysis on the changes over time.


Should the scraper fail at any of these points, we print out an error, and if set up, a Slack notification is sent.


### Installing

Clone the repo from Github by running `$ git clone git@github.com:CodeForAfrica-SCRAPERS/healthtools_ke.git`.

Change directory into the package `$ cd healthtools_ke`.

Install the dependencies by running `$ pip install requirements.txt`.
Install the dependencies by running `$ pip install -r requirements.txt`.

You can set the required environment variables like so

Expand All @@ -44,7 +56,28 @@ For linux and windows users, follow instructions from this [link](https://www.el

For mac users, run `$ brew install elasticsearch` on your terminal.

#### Slack
#### Error Handling

As with anything beyond our control (the websites we are scraping), we try to catch all errors and display useful and actionable information about them.

As such, we capture the following details:

- Timestamp
- Machine name
- Module / Scraper name + function name
- Error message

This data is printed in terminal in the following way:

[ Timestamp ] { Module / Scraper Name }
[ Timestamp ] Scraper has started.
[ Timestamp ] ERROR: { Module / Scraper Name } / { function name }
[ Timestamp ] ERROR: { Error message }


We also provide a Slack notification option detailed below.

*Slack Notification:*

We use Slack notifications when the scrapers run into an error.

Expand All @@ -54,18 +87,20 @@ If you set up elasticsearch locally run it `$ elasticsearch`

You can now run the scrapers `$ python scraper.py` (It might take a while)


### Development

In development, instead of scraping entire websites, you can scrape only a small batch to ensure your scrapers are working as expected.

Set the `SMALL_BATCH`, `SMALL_BATCH_HF` (for health facilities scrapers), and `SMALL_BATCH_NHIF` (for NHIF scrapers) in the config file that will ensure the scraper doesn't scrape entire sites but just the number of pages that you would like it to scrape defined by this variable.

Use `$ python scraper.py small_batch` to run the scrapers.
Usage `$ python scraper.py --help`
Example `$ python scraper.py --small-batch --scraper doctors clinical_officers ` to run the scrapers.


## Tests

Use nosetests to run tests (with stdout) like this:
```$ nosetests --nocapture```
```$ nosetests --nocapture``` or ```$ nosetests -s```

_**NB: <ake sure if you use elasticsearch locally, it's running**_
_**NB: Make sure if you use elasticsearch locally, it's running**_
56 changes: 47 additions & 9 deletions healthtools/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,66 @@
ES = {
"host": os.getenv("MORPH_ES_HOST", "127.0.0.1"),
"port": os.getenv("MORPH_ES_PORT", 9200),
"index": os.getenv("MORPH_ES_INDEX", "healthtools-ke-dev")
"index": os.getenv("MORPH_ES_INDEX", "healthtools-dev")
}

SLACK = {
"url": os.getenv("MORPH_WEBHOOK_URL")
}

TEST_DIR = os.getcwd() + "/healthtools/tests"

SMALL_BATCH = 5 # No of pages from clinical officers, doctors and foreign doctors sites, scrapped in development mode
SMALL_BATCH_HF = 100 # No of records scraped from health-facilities sites in development mode
SMALL_BATCH_NHIF = 10 # No of nhif accredited facilities scraped in development mode
# No of records scraped from health-facilities sites in development mode
SMALL_BATCH_HF = 100
SMALL_BATCH_NHIF = 1 # No of nhif accredited facilities scraped in development mode

DATA_DIR = os.getcwd() + "/data/"
if AWS["s3_bucket"]:
DATA_DIR = "data/"
else:
# Where we archive the data in case of no s3 Bucket
DATA_DIR = os.path.dirname(os.path.dirname(
os.path.abspath(__file__))) + "/data/"
if not os.path.exists(DATA_DIR):
os.mkdir(DATA_DIR)
os.mkdir(DATA_DIR + "archive")
os.mkdir(DATA_DIR + "test")
os.mkdir(DATA_DIR + "test/archive")

# sites to be scraped
SITES = {
"DOCTORS": "http://medicalboard.co.ke/online-services/retention/?currpage={}",
"DOCTORS": "https://medicalboard.co.ke/online-services/retention/?currpage={}",
"FOREIGN_DOCTORS": "http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}",
"CLINICAL_OFFICERS": "http://clinicalofficerscouncil.org/online-services/retention/?currpage={}",
"TOKEN_URL": "http://api.kmhfl.health.go.ke/o/token/",
"NHIF-OUTPATIENT_CS": "http://www.nhif.or.ke/healthinsurance/medicalFacilities",
"NHIF-INPATIENT": "http://www.nhif.or.ke/healthinsurance/inpatientServices",
"NHIF-OUTPATIENT": "http://www.nhif.or.ke/healthinsurance/outpatientServices"
"NHIF_INPATIENT": "http://www.nhif.or.ke/healthinsurance/inpatientServices",
"NHIF_OUTPATIENT": "http://www.nhif.or.ke/healthinsurance/outpatientServices",
"NHIF_OUTPATIENT_CS": "http://www.nhif.or.ke/healthinsurance/medicalFacilities"
}

NHIF_SERVICES = ["inpatient", "outpatient", "outpatient-cs"]

# config logging
LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"simple": {
"format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
"datefmt": "%Y-%m-%d %H:%M:%S"
}
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"level": "DEBUG",
"stream": "ext://sys.stdout"
}

},

"root": {
"level": "INFO",
"handlers": ["console"]
}
}

Loading