Kamisama.me

Cal-Heatmap 3 released

The latest 3.x branch of cal-heatmap comes with one new big feature: vertical orientation.

Cal-Heatmap

Implementing it was not easy, since other features, like label position on the side, etc, must come along.

A quick summary of all the new features in the v3:

  • Calendar vertical orientation
  • Label can be positioned on all four sides
  • Label can be rotated (-90deg/+90deg)
  • New x_ variants subdomains. Append x_ before a subdomain (eg: x_day, x_month) to rotate its reading order. More useful when using a vertical orientation.
  • New offset option for the label, for more control on its x/y position.
  • id option was replaced by itemSelector. It now takes a String Selector, compliant with CSS3. You can now use something like body > p:first-child > [title="calendar"]. If the selector returns more than one result, the first one will be used. itemSelector can also directly take a DOM Element object.
  • New domainMargin to add margin around domain.
  • New domainDynamicDimension, to allow the domain width and height to fit its content, since not all the months have the same number of weeks. Disabling it will set all the domains with the same width, and may leave a space between domain.
  • You can now display a text/date inside subDomain cells.
  • highlightToday was replaced by highlight, and takes an array of dates, so you can highlight any subDomain cells.
  • More legend options, so that you can place it either on top or below the calendar, and align it left, center or right. The legend size is now independent from cellSize, and can be set via legendCellSize.
  • Oh, all references to scale was renamed to legend
  • The browsing option was removed. It is now enabled by default, and cal-heatmap will not create the next/previous buttons for you anymore.
  • Two new events methods next() and previous() were added to browse the calendar.
  • If you don’t like the idea of using javascript to navigate the calendar, two new previousSelector and nextSelector options were added to directly attach the next() and previous() methods to a DOM Element on mouse click.
  • Domain Highlighting: you can control the background color of any domain. Opposed to the subDomain highlight option, domain highlighting is passive and controlled solely via CSS.

With this lot of new features and option, almost all options were renamed for consistency. Some option like scale are now renamed legend, some have just been camelCased.

See the new shiny website for details about each features.
I’m pretty satisfied with it, as each option is now properly documented and illustrated with a simple example. You can learn a lot more there than in this post.

First steps with PredictionIO : a simple recommendation server

A recommendation system seems to be a must in this current era websites, where you want to keep the current visitor inside your website, by providing things that will hold him there.

Anyone can build a basic recommendation engine by joining a few tables in a relational databases, and start recommending an item A based on another item B by looking for similarities between these 2 items (common tags/categories, common keywords in name and description, etc …).

If I said I watched The Dark Knight, the most 2 obvious recommendations you will give me will be the The Dark Knight Rises and Batman Begins. Lot of common keywords, tags, staff, and you obviously have a “sequel/prequel” relation between these movies.

This far, we were dealing with item-to-item recommendation. It’s the most easy recommendation you can implement, you’re just dealing with similarities between two entities.

Now, recommend me a third movie … Superman ? Because I like superheroes movie ? Or Inception, because I like the cast ? You can’t really decide without knowing my preferences. In this user-to-item relation, you have to know all my previous watched movies and behaviors (most viewed genre, actors, theme, etc …), before reaching a conclusion.

So, let’s try Superman … Which one ? The first, second, or the third ?

Here comes the machine learning system: you feed it users and items data, as well as their relations (like, rating, views, etc …), and it will predict the future, based on various algorithm.

Apache Mahout is among one of the popular and free machine learning library, written in Java. It’s used by some big names such as Amazon, Foursquare, Twitter, Yahoo, etc … It uses Hadoop as database, can be scaled, and can process a lot of data. Installing and managing these tools can be intimidating and frustrating, but PredictionIO helps us doing all these petty tasks. In the end, you’ll just have to install predictionIO and start it, all the hadoop and mahout stuff is hidden from you.

PredictionIO, an open source Machine Learning server

PredictionIO-logo PredictionIO is a “one package” tool that will install and setup all the dependencies automatically, then start a tomcat server to expose a REST API, only gateway to your machine learning server. You can learn more about the server structure here.

PredictionIO depends on

All these tools can be installed by running

bin/setup-vendors.sh

Next, you setup the PredictionIO itself with

bin/setup.sh

And finally, you’re ready start the PredictionIO server

bin/start-all.sh

The dashboard will be available at http://localhost:9000. You’re free to use another port by editing ADMIN_PORT in bin/common.sh. On my system, the port 9000 was already taken by php5-fpm.

This dashboard is the main advantage of using predictionIO compared with a vanilla Hadoop+Mahout installation, as it provides a neat web interface to organize and setup your engines. The REST API can also be consumed by everyone, regardless or your preferred programing language. A PHP, Ruby, Python and Java SDK are already available, and offer basic functions. You’re free to write your own, or implements more functions on top of existent one.

predictionIO - login

The dashboard is password protected, and you can create a user account easily with

bin/users

After login, you’ll be asked to create your first App.

predictionIO - create application page

You’ll obtain an App Key, used to for authenticate all API call.

predictionIO - application page

Next step is to create engine. An engine predicts a relation between 2 entities. If you have some posts, movies, books, etc … one engine can only deal with 2 entities : user-movie, or user-book, or movie-book, etc … Although engines can deal with multiple items relations, staying with a two entities relation raise the accuracy of the prediction.

And as the users and items data are shared among all engines, you’re not losing anything.

There’s 2 kind of engines :

  • Item recommendation engine
  • Items similarity prediction engine

predictionIO - create engine page

As of version 0.4, only Item Recommendation Engine is available. No ETA was given on the other and more interesting engine availability.

Each engines can be fine-tuned by choosing a different prediction algorithm.

predictionIO - engine settings

predictionIO - engine settings 2

predictionIO - engine algorithm selection

predictionIO - engine algorithm

The engine is now ready to predict the future. But before that, you need to input some user, item, and behavioral data to train the machine. The more data you’ll add, the more accurate your prediction will be.

PredictionIO doc have some tutorials about building recommendation engine

As far as I know, the only way to input data in predictionIO is to use the API, so when adding 1 millions of data, have some fun with the for loop …

Alternatives

PredictionIO is still young and in development. There’s not much all-on-one free machine learning server out there.

The only other one I found is Myrrix, a similar product also based on Apache Mahout, but packaged as one .jar file.

Usage can not be easier, you just download and run the .jar, and your machine learning server is online. It also used a REST API for adding/editing data and to get predictions.

A server in Myrrix correspond to an engine in predictionIO. So, to have multiple engines, you’ll end up running multiple myrrix servers, on different port. Each server is isolated, so the user data in the user-movie server can not be shared with the user-book server.

Myrrix is also in development, and still in beta. Its website is very complete, with tons of examples, tutorials and use cases.

Install, configure and protect Awstats for multiple nginx vhost on Debian

There’s already a lot of tutorial on internet on how to install awstats for nginx. I didn’t find any for the configuration I wanted, so I’ll write one, for my record.

I have some custom needs, let’s suppose I have 3 domains :

  • master-domain.com
  • alpha.com
  • beta.com

And I want to have stats for the 2 latest domains. The master-domain.com is used as the master domain of the server, with awstats available at awstats.master-domain.com, instead of having alpha.com/awstats and beta.com/awstats. The idea it to group all the server script/tools (phpmyadmin, zabbix, etc …) under master-domain.com.

We also want to password protect the stats, but with different credential for each vhost.

These steps have been tested on Debian Squeeze, on a Kimsufi.

Install Awstats

apt-get install awstats

On debian squeeze, awstats install things in 3 places :

  • /etc/awstats : contains all the conf files for each of your awstats installation
  • /usr/share/awstats : contains all tools and libraries used by awstats
  • /usr/share/doc/awstats : docs, tools for building the static html pages, icons and other static files used by html

Formatting Nginx log

Nginx by default output logs that already can be read by awstats, as long as you use the Combined format. If you set your errors log like this :

error_log /path/to/log.log;

Then you’re good. The combined format is implicit. It’s equivalent to

error_log /path/to/log.log combined;

Optional step

Using the default format is fine, but you can log one more field, that could be pertinent : the http_x_forwarded_for.

It’s used to capture the client IP address when he is connecting through a proxy of load balancer.

For that, we define another log format, named main in /etc/nginx/nginx.conf. In the server scope, add :

log_format main     '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for"';).    

It’s the same as the combined format, plus the $http_x_forwarded_for bit at the end. To use this format, add main at the end of your error_log directive.

error_log /path/to/log.log main;  

As this last field is not used by awstats, we should tell it to ignore it. In /etc/awstats/awstats.conf.local, add :

LogFormat = "%host - %host_r %time1 %methodurl %code %bytesd %refererquot %uaquot %otherquot"

This file should be empty by default. It’s used to set the settings shared by all your awstats config.

We teach awstats the meaning of each field when parsing the log. The last token (%otherquot) means that “Oh, that string here does not mean anything.”.

Creating a configuration file for each vhost

Awstats is picky about the configuration files : you should have one config file by vhost, they should be named following the convention : awstats.domain.tld.conf, and should be placed inside the /ect/awstats/ directory.

So, for the vhost alpha.com and beta.com, you should create these two files :

  1. awstats.alpha.com.conf
  2. awstats.beta.com.conf

The official method

There is already a model configuration file inside the /ect/awstats/ directory : awstats.conf. Documentation says to clone that file when creating your own config files, with

cp /ect/awstats/awstats.conf awstats.alpha.com.conf  
cp /ect/awstats/awstats.conf awstats.beta.com.conf

Then you just edit these files to your needs… Method I’m not fond of. If you take a look at awstats.conf, you’ll see that it’s a very complete conf, with plenty of comments, and all the available settings, all of that for just * suspense music * … 1500 lines.

I’m personally not interested into having multiples conf files, for 1500 lines each, with each files differing of just 4 lines.

The DRY method

If you have ls the /etc/awstats folder, you’ll see that there’s by default 2 files here :

  • awstats.conf
  • awstats.conf.local

awstats.conf is the main conf file, origin of all the other conf files. It’ll also fallback to this file if no other config file exists.

awstats.conf.local is an empty file. It’s the parent of all the other config files. If you have some rules that are shared among all your config, you put them here.

What I do is I copy all the contents of awstats.conf into awstats.conf.local, and just put the important rules inside each vhost config, so they’re easier to read, and shorter.

What to put in the conf files

Let’s create the conf files for alpha.com.

vi /etc/awstats/awstats.alpha.com.conf

We start with an empty file, insert the following lines

# Path to you nginx vhost log file
LogFile="/var/log/nginx/access.alpha.com.log"

# Domain of your vhost
SiteDomain="www.alpha.com"

# Directory where to store the awstats data
DirData="/var/lib/awstats/"

# Other alias, basically other domain/subdomain that's the same as the domain above
HostAliases="www.alpha.com"

By default, awstats store all its data inside /var/lib/awstats/, which is the default settings. You could change that to another directory, or have a subdirectory for each vhost, like /var/lib/awstats/alpha.com/.

But even if you use the default setting, you have to set it in each config, as it can not be inherited from awstats.conf.local.

You’re free to add more setting if some of your vhost requires additional customization.
Repeat the same steps for each vhost.

Tune the global settings

Edit awstats.conf.local,

  • Disable DNSLookup : DNSLookup = 0

  • Remove LogFile, SiteDomain, DirData and HostAliases directive, as they’re useless outside their context.

  • Set LogFormat to Combined (if you didn’t use the optional step in formatting the nginx log) LogFormat = 1

  • You could also enable some plugin, like GeoIP (require additional steps, beside uncommenting the line).

Computing data

Awstats is now configured for each vhost. We will now tell it to read the log files, and generate the stats from them. It’s a boring operation that should be done regularly (e.g, once a day, each 6 hours, etc…) depending on your need. More you wait, more the log file grow in size, and more time it will take to process it. It’ll depend on your website traffic.

To compute the data, a perl script is available in /usr/share/doc/awstats/examples. The awstats_updateall.pl will compute the stats for each available config. It’s easy, just run :

/usr/share/doc/awstats/examples/awstats_updateall.pl now -awstatsprog=/usr/lib/cgi-bin/awstats.pl

The -awstatsprog flag tell the script where to find the awstats.pl script, because awstats_updateall.pl is just a wrapper that is executing awstats.pl for each of your config.

The obvious solution to run this script regularly is to use a cron job. The drawback is that nginx logs are rotated with logrotate. It means that every X days, the log file will be archived (and renamed), and a new log file will be created. If you use a cronjob to compute the stats

  • Just before the log rotation, you’ll lose all data between the computation and the rotation, as the file is renamed and not accessible by awstats anymore
  • After the rotation, you’ll also lose all data between the computation and the next rotation.
  • At the rotation, you’ll experience some weird things.

Solution #1

We could prevent the data loss by telling awstats to always parse 2 logs files : the regular one, and the last archived log.

Logrotate always rename the file using the convention filename.1, filename.2. At each rotation, all filenames are incremented, and filename will become filename.1. A new filename will be created, so the newest archive is always filename.1.

In the awstats config for your vhost, edit the LogFile setting

LogFile="/usr/share/awstats/tools/logresolvemerge.pl /path/to/log/access.domain.tld.log /path/to/log/access.domain.tld.log.1 |"

logresolvemerge.pl will combine the 2 log files into one.

You’ll never lose data because of the rotation, since you’ll parse the rotated file too.

Solution #2

Execute the computation just before the rotation, using logrotate postrotate hook. This is useful especially if your computation interval equal the rotation interval (e.g, you rotate every day at midnight, and you compute also every day at midnight).

Edit the logrotate config for nginx :

vi /etc/logrotate/nginx.conf

I like to rotate log every day, to keep them lighter. By default, nginx rotate logs weekly.

/var/log/nginx/*.log {
    daily # rotate daily
    missingok 
    rotate 52 # Keep 52 days
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
    sharedscripts
    prerotate
            # Trigger awstats computation
            /usr/share/doc/awstats/examples/awstats_updateall.pl now -awstatsprog=/usr/lib/cgi-bin/awstats.pl
    endscript
    postrotate
            # Reload Nginx to make it read the new log file
            [ ! -f /var/run/nginx.pid ] || kill -USR1 `cat /var/run/nginx.pid`
    endscript
}

You could also trigger manually computation by running the

/usr/share/doc/awstats/examples/awstats_updateall.pl now -awstatsprog=/usr/lib/cgi-bin/awstats.pl

directly in the shell, if you don’t want to wait for the log rotation at midnight.

You could use a regular cronjob on a single log file if you compute more than once a day, and use the postrotate hook just for the computation near midnight.

Building the html reports

awstats_updateall.pl will compute new stats, but not build the html pages. Awstats come with 2 options :

  • Build the static html page yourself
  • Use cgi to build the page dynamically

I’ll use the dynamic options, explained below. There’s already plenty of articles on internet explaining how to build static pages if it’s the way you want to go.

Exposing awstats

Now that awstats is configured and charged with data, let’s make it viewable by the internet.

Let’s create the subdomain where awstats will live : awstats.master-domain.com, linked to /var/www/awstats.

Let’s assume that the subdomain is already redirected to your server (creating the subdomain is not in the scope of this post), you just have to create the nginx virtual host for awstats.master-domain.com.

How you create it is your own choice, there’s multiple ways (single conf file, ‘sites-enabled’ a la apache, etc …).

A regular nginx vhost conf should looks like that :

server {
    listen 80;
    server_name awstats.master-domain.com;
    root        /var/www/awstats;
}

Let’s define the error log, and disable access log

error_log /var/log/nginx/awstats.master-domain.com.error.log;
access_log off;
log_not_found off;

Alias the icon folder, so it’s viewable online, instead of copy/pasting it.

location ^~ /icon {
    alias /usr/share/awstats/icon/;
}

Finally, configure /cgi-bin/scripts to go through php-fastcgi

location ~ ^/cgi-bin/.*\\.(cgi|pl|py|rb) {
    gzip off;
    include         fastcgi_params;
    fastcgi_pass    unix:/var/run/php5-fpm.sock;
    fastcgi_index   cgi-bin.php;
    fastcgi_param   SCRIPT_FILENAME    /etc/nginx/cgi-bin.php;
    fastcgi_param   SCRIPT_NAME        /cgi-bin/cgi-bin.php;
    fastcgi_param   X_SCRIPT_FILENAME  /usr/lib$fastcgi_script_name;
    fastcgi_param   X_SCRIPT_NAME      $fastcgi_script_name;
    fastcgi_param   REMOTE_USER        $remote_user;
}

Edit the fastcgi_pass to your own php-fpm server.

Create the /etc/nginx/cgi-bin.php file

<?php
$descriptorspec = array(
    0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
    1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
    2 => array("pipe", "w")   // stderr is a file to write to
);

$newenv = $_SERVER;
$newenv["SCRIPT_FILENAME"] = $_SERVER["X_SCRIPT_FILENAME"];
$newenv["SCRIPT_NAME"] = $_SERVER["X_SCRIPT_NAME"];

if (is_executable($_SERVER["X_SCRIPT_FILENAME"])) {
    $process = proc_open($_SERVER["X_SCRIPT_FILENAME"], $descriptorspec, $pipes, NULL, $newenv);
    if (is_resource($process)) {
        fclose($pipes[0]);
        $head = fgets($pipes[1]);
        while (strcmp($head, "\\n")) {
            header($head);
            $head = fgets($pipes[1]);
        }
        fpassthru($pipes[1]);
        fclose($pipes[1]);
        fclose($pipes[2]);
        $return_value = proc_close($process);
    } else {
        header("Status: 500 Internal Server Error");
        echo("Internal Server Error");
    }
} else {
    header("Status: 404 Page Not Found");
    echo("Page Not Found");
}
?>

Final vhost config :

server {
    listen 80;
    server_name awstats.master-domain.com;
    root    /var/www/awstats;

    error_log /var/log/nginx/awstats.master-domain.com.error.log;
    access_log off;
    log_not_found off;

    location ^~ /icon {
        alias /usr/share/awstats/icon/;
    }

        location ~ ^/cgi-bin/.*\\.(cgi|pl|py|rb) {
        gzip off;
        include         fastcgi_params;
        fastcgi_pass    unix:/var/run/php5-fpm.sock;
        fastcgi_index   cgi-bin.php;
        fastcgi_param   SCRIPT_FILENAME    /etc/nginx/cgi-bin.php;
        fastcgi_param   SCRIPT_NAME        /cgi-bin/cgi-bin.php;
        fastcgi_param   X_SCRIPT_FILENAME  /usr/lib$fastcgi_script_name;
        fastcgi_param   X_SCRIPT_NAME      $fastcgi_script_name;
        fastcgi_param   REMOTE_USER        $remote_user;
    }
}

Beautifying the url

You can now view multiple websites stats, from a single website : awstats.master-domain.com.

But awstats don’t use url rewriting for beautiful link, and you end up with long and ugly url like :

http://awstats.master-domain.com/cgi-bin/awstats.pl?config=alpha.com  
http://awstats.master-domain.com/cgi-bin/awstats.pl?config=beta.com

We could make them easier to share, by transforming them into :

http://awstats.master-domain.com/alpha.com  
http://awstats.master-domain.com/beta.com

In the awstats conf for your vhost, add :

location ~ ^/([a-z0-9-_\.]+)$ {
    return 301 $scheme://awstats.master-domain.com/cgi-bin/awstats.pl?config=$1;
}

Protecting the stats

Let’s now protect the stats. The idea is to have different credential for each awstats config. The login used to view alpha.com stats should not let the user browse beta.com stats.

Let’s edit the /cgi-bin/ location bloc in the vhost

location ~ ^/cgi-bin/.*\\.(cgi|pl|py|rb) {

    # Protect each config with a different credential
    if ($args ~ "config=([a-z0-9-_\.]+)") {
        set $domain $1;
    }

    auth_basic            "Admin";
    auth_basic_user_file  /etc/awstats/awstats.$domain.htpasswd;

    gzip off;
    include         fastcgi_params;
    fastcgi_pass    unix:/var/run/php5-fpm.sock;
    fastcgi_index   cgi-bin.php;
    fastcgi_param   SCRIPT_FILENAME    /etc/nginx/cgi-bin.php;
    fastcgi_param   SCRIPT_NAME        /cgi-bin/cgi-bin.php;
    fastcgi_param   X_SCRIPT_FILENAME  /usr/lib$fastcgi_script_name;
    fastcgi_param   X_SCRIPT_NAME      $fastcgi_script_name;
    fastcgi_param   REMOTE_USER        $remote_user;
}

This will protect each awstats config with it’s own credential, stored in /etc/awstats/awstats.domain.tld.htpasswd. Authentication is based on HTTP Basic Authentication.

For the examples alpha.com and beta.com websites, the login and password are stored in

  • /etc/awstats/awstats.alpha.com.htpasswd
  • /etc/awstats/awstats.beta.com.htpasswd

Each files contains the credential for the corresponding domain.
You can create these files with htpasswd (tools shipped with apache):

htpasswd -c /etc/awstats/awstats.alpha.com.htpasswd username

You’ll be prompt for the password next.

Final Nginx Awstats vHost

server {
    listen 80;
    server_name awstats.master-domain.com;
    root    /var/www/awstats;

    error_log /var/log/nginx/awstats.master-domain.com.error.log;
    access_log off;
    log_not_found off;

    location ^~ /icon {
        alias /usr/share/awstats/icon/;
    }

    location ~ ^/([a-z0-9-_\.]+)$ {
        return 301 $scheme://awstats.master-domain.com/<cgi-></cgi->bin/awstats.pl?config=$1;
    }

    location ~ ^/cgi-bin/.*\\.(cgi|pl|py|rb) {
        if ($args ~ "config=([a-z0-9-_\.]+)") {
            set $domain $1;
        }

        auth_basic            "Admin";
        auth_basic_user_file  /etc/awstats/awstats.$domain.htpasswd;

        gzip off;
        include         fastcgi_params;
        fastcgi_pass    unix:/var/run/php5-fpm.sock;
        fastcgi_index   cgi-bin.php;
        fastcgi_param   SCRIPT_FILENAME    /etc/nginx/cgi-bin.php;
        fastcgi_param   SCRIPT_NAME        /cgi-bin/cgi-bin.php;
        fastcgi_param   X_SCRIPT_FILENAME  /usr/lib$fastcgi_script_name;
        fastcgi_param   X_SCRIPT_NAME      $fastcgi_script_name;
        fastcgi_param   REMOTE_USER        $remote_user;
    }
}

And voila !

alpha.com webmaster can browse its stats via awstats.master-domain.com/alpha.com, and beta.com, via awstats.master-domain.com/beta.com. And they’re protected with their own credential, no peeking.

CakeResque 3.0 : welcoming the Scheduled Jobs

CakeResque 3.0 was just freshly baked. The most important feature of this version is the support of scheduled jobs.

In addition to the possibility of queuing a job for a later execution, you can now specify when to queue the job.

Scheduling jobs

Queuing a job on a future date

You can know specify when to queue the job with CakeResque::enqueueAt(). This function takes 5 arguments :

CakeResque::enqueueAt($time, $queue, $class, $args, $track);

The last 4 arguments are the same arguments as the basic CakeResque::enqueue(); The new argument is the first argument, which is the the date when you want to queue the job. It can be a DateTime object, or simply an integer representing a timestamp.

Example

CakeResque::enqueueAt(
    new DateTime('2012-01-26 15:56:23'),
    'default', // Queue Name
    'MyPlugin.DummyJobShell', // Job classname 
    array(0,1,2) // Various args
);

Queuing a job after a certain time

You can also queue a job after a certain time, for example after 5 minutes, in case you don’t have the exact absolute time, with CakeResque::enqueueIn(). It also takes 5 arguments :

CakeResque::enqueueIn($seconds, $queue, $class, $args, $track);

Like with the CakeResque:enqueueAt(), last 4 arguments are the same as CakeResque::enqueue(), first argument is the number of seconds to wait before queuing the job.

Example

CakeResque::enqueueIn(
    3600, // Queue the job after 1 hour
    'default', // Queue Name
    'MyPlugin.DummyJobShell', // Job classname 
    array(0,1,2) // Various args
);

Limitations

By the worker polling time

Scheduling a job for time X/after Y seconds does not guarantee that the job will run at the specified time. It only means that the job will be added to the specified queue at that time. When will it be executed will depends on the worker polling the queue.

Example

Let’s suppose we have a queue, with a worker polling it each 15 seconds. Let’s say you started the worker precisely at 00 second, so the worker will poll the queue each minutes at 00, 15, 30 and 45 seconds.

If you have scheduled a job for 14:05:04, the job will only be added in the queue at 14:05:04, it has to wait until 14:05:15, for the worker to execute it.

By the Scheduler Worker polling time

Another point to know is that before the job is added to the queue, it ‘sleeps’ in a special queue. A special worker, the Scheduler Worker, will poll that queue to check for due jobs, and add them to the right queue. That Scheduler Worker also have its own pause between each polling (that you can define yourself), set by default to 3 seconds.

Example

With the previous example, the job is scheduled for 14:05:04. The Scheduler worker, that runs each 3 seconds (at 03, 06, 09 seconds, etc) will add it to the queue at 14:05:06, where the regular worker will execute it at 14:05:15. Of course, you can lower the polling time of the Scheduler Worker to 1 second, depending on your need.

Installation

Update the plugin

  • Backup CakeResque’s bootstrap.php file, located in Plugin/CakeResque/Config
  • Download the latest version of CakeResque, uncompress it and replace your current CakeResque folder with the new one. Obviously, folder name should remain ‘CakeResque’.
  • There’s some new settings in the new bootstrap, that goes from line 135 to 177. Copy and paste them to your backup bootstrap.php, then restore it.

Update dependencies

CakeResque 3 makes use of the latest version of php-resque-ex, and the php-resque-ex-scheduler addon.

Updating dependencies are simple, in the terminal, run :

# go to your Plugin/CakeResque directory
cd path/to/Plugin/CakeResque

# OPTION #1
# If you don't have composer installed, install it
curl -s https://getcomposer.org/installer | php
# Then install dependencies
php composer.phar install

# OPTION #2
# If you already have composer installed
composer install

The Scheduler Worker

The Scheduler Worker is a special worker used to moved the jobs in the right queue when they’re due. It must be running to handle scheduled jobs, else these jobs will never be pushed in the right queue, thus never be executed.

To start the scheduler worker, run in the terminal :

cake CakeResque.CakeResque startscheduler

You can also set the polling interval with the -i flag.

cake CakeResque.CakeResque startscheduler -i 3

Unlike the regular worker starting command, the interval flag is the only flag accepted by the startscheduler command. Default value is 3 seconds, that you can edit in the bootstrap, under CakeResque.Scheduler.Worker.interval.

You can only have one scheduler worker. Attempting to start another will fail. If you use the load command to start your worker, the scheduler worker will be started automatically.

This worker can be paused, resumed and stopped like any other worker, with the usual command.

Settings

Refer to bootstrap file for the scheduler worker settings. Scheduler Worker has its own settings.

Scheduler is disabled by default. Enable only if you use it, else the Scheduler Worker will just be a burden.

Other improvement

The stats command

The stats command has been updated with new informations. It now displays the number of jobs in each queues, and notify you when a queue is not monitored by a worker.

It also warn you when there is jobs scheduled, but the Scheduler Worker is not running.

Job tracking

You can track job status like usual. Scheduled jobs are labeled ‘scheduled’.

Background jobs with php and resque: part 8, a glance at php-resque-ex

In all the previous part of this tutorial serie, we were using php-resque. But in some cases, this original library is not enough.

Php-resque-ex is a fork of php-resque, and provides additional features. It has the same API as php-resque, and can replace it without problems.

Read the rest of this entry »

Background jobs with php and resque: part 7, Manage workers with Fresque

Fresque is a command line tool to manage your php-resque workers

Fresque removes all the hassles of manipulating process, pipe, daemon and other cli-guru related commands when managing php resque workers.

Instead of starting a worker with this command:

QUEUE=notification php resque.php

You use:

fresque start --queue notification

You begin to see its real usefulness when it convert that command

nohup QUEUE=notification php resque.php >> /path/to/your/logfile.log 2>&1 &

into that:

fresque start --queue notification --log /path/to/your/logfile.log

Fresque comes with a default fresque.ini file, storing the queues default settings. You can then transform the above command into:

fresque start

Cool, isn’t it ?

Read the rest of this entry »

Background jobs with php and resque: part 6, integration into CakePHP

Using background jobs inside a php framework with php-resque is a little bit different, as the framework is imposing its own convention. Let’s see how to create background jobs in CakePHP, with the CakeResque plugin.

CakeResque is a CakePHP plugin for creating background jobs that can be processed offline later.

CakeResque is more than a wrapper to use php-resque within CakePHP. Where it really shines is the way it manages the dirty jobs of creating and stopping workers via the cake console.

Read the rest of this entry »

Background jobs with php and resque: part 5, creating jobs

Now that you have some workers running, let’s feed them some jobs. In this part, we’ll try to understand what’s a job, and how to refactor your application to implement background jobs.

What’s a job A job is a written order to tell the workers to execute a particular task. This order looks like:

Mail, dest@mail.com, hi!, "this is a test content". Order can only contain string. It’s that string that’ll be pushed in the queue (enqueued). Read the rest of this entry »

Background jobs with php and resque: part 4, managing worker

This guide is intended for Linux and OS X users. Windows users will have to adapt some code to make them works.

Understanding the internal works

Technically, a worker is a PHP process that will run indefinitely, always monitoring for new jobs to execute.

Pseudo-code of a worker’s internal:

while (true) {
    $jobs = pullData(); # Pull jobs from the queues

    foreach ($jobs as $class => $args) { # For each jobs found
        $job = new $class();
        $job->perform($args); # Execute them
    }
    sleep(300); # Then Sleep for 5 minutes (300 seconds), and retry
}

Read the rest of this entry »

Background jobs with php and resque: part 3, installation

As said in part 2, we’ll use php-resque for our queue system. In this part, I’ll explain how to install all the tools needed to run php-resque, a port of Resque.

Resque (pronounced like “rescue”) is a Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later.

Obviously, we’ll also need to install Redis, and its php extension. The full list of tools to install :

Read the rest of this entry »