Hi all, I am Jan and I am passionate about human-computer interaction, product design and entrepreneurship. I am currently reading for my PhD in Computer Science at the University of Oxford, where I am researching (together with the MoD and Airbus) ways to improve the situational awareness of our nations’ cyber defenders by conceptualising the management of cyber threat intelligence and developing technologies that support their practices.
I used Kimono for my research, to efficiently put a database of cyber security companies together that I then used to reach out to selected organisations for a potential collaboration (it was very helpful). Right now, I am experimenting with ways to help me track and analyse the freelance market (Elance and ODesk) in regard to supply and demand. This has nothing to do with my research and is just out of curiosity. I am interested in the concept of productised services and explore the usefulness of transferring some of the concepts of keyword analysis (from SEO) to this domain.
But on a bigger picture, I want to explore the usefulness of web data extraction approaches that support day-to-day/operational and also strategic decision making in SMEs (small and medium-sized enterprises).
I would love to talk to anyone who has an opinion, experiences or ideas in this area.
So here is what I did:
1. Find the pages that output “all results”
For Elance, this is https://www.elance.com/r/jobs/sts-0
/sts-0 displays jobs with “any status” and doesn’t limit the results to jobs which “hiring (is) open”. If I would only scrape jobs that are open, I will miss a great deal of data, since we will scrape on an hourly basis and jobs are often allocated within minutes.
For ODesk, this is https://www.odesk.com/o/jobs/browse/st/-1/
The same holds for ODesk. /st/-1/ displays all jobs independent of their hiring status.
2. Configuring Kimono (the web scraping application)
Click on the fields that you want to collect and give them a variable name. Easy. See here for help how to use Kimono.
2. Scraping the base data set
I want to scrape all displayed results once (these will be around 47.000 results for ODesk and 26.000 for Elance). This will give me the data set that I will add to with later scrapes and mirrors the current job openings on these platforms. Important here is to set the pagination correctly in Kimono (here for more information) and set “pagination limit” to “10000 pages max” under the “CRAWL SETUP” tab of both APIs. And I had the “AUTO-RUN FREQUENCY” on “manual crawl” – I only need the base set scraped once.
The first crawl may take a while to run through. I looked at the results of this scrape and made sure the outputs were as expected.
3. Setting up the auto-run feature to track new jobs
To determine the crawl frequency in regard to pagination limit, I went to the Elance and ODesk pages and checked several times a day through how many pages new job posting from within the last 60 would stretch. On Elance, job posting from the last 60 minutes will stretch through an average of 6 pages. On ODesk, this is 1 page.
I went back to the “CRAWL SETUP” and changed the “pagination limit” to “10 pages max” for Elance and “1 page max”.
Implications for design (Kimono)
- Ability to set a custom number for the pagination limit.
- Ability to add-to previous crawls when using the auto-run frequency.
[WORK IN PROGRESS]