Connected-device telemetry: 7 tips learned from a Set-top-box usage analytics project.

The broadcasters industry historically had the problem of how to measure its audiences. Transmissions of radios, TV channels, cable TV or satellite services are essentially unidirectional communications services, between a broadcast station and the users. In other words, with the kind of technology that an air TV channel broadcasts, it is impossible to know how many viewers are actually watching that channel.

In order to have audience and ratings measurements, different types of solutions were created, from something as basic as telephone surveys to image and audio recognition systems installed in a sample of users houses, to then apply statistical models to get to know the audience or rating for a specific service. Of course, these methods are expensive in nature, require time, the willingness of users when they are called by telephone, or expensive devices and the need to give something in return to the user in order to encourage them to install a device in their home that detects the use they give to the TV.

All this history begins to change when Set-Top-Boxes with the ability to receive traditional broadcast signals as well as to connect to the Internet and offer the users access to services such as YouTube, Netflix or Hulu appeared. Within the TV industry these devices are called “Hybrid Set-Top-Boxes” or “Hybrid Devices”.

For the industry, the appearance of these hybrid devices was a significant change. Now it is possible to obtain and send information about device’s usage, know what channels have been watched, for how long, at what specific moment they zapped or changed the volume, if they muted or turned the device off.

Something that we must not forget is that the internet connection is an extra, the main function of the device is to receive the broadcast signals, such as ISDB-T or DVB-C. In case of not having an internet connection, the broadcast linear functionality should not be affected.

Below, we list a series of tips that we learned when we implemented these kind of systems but that are applicable to any telemetry system, where the connection of the device to the internet is not vital for its operation.

Tip 1: The Design Elements.

In order to design how to send information from the device and how to receive, process and store data in the backend, it is important to take into account the basic parameters for sizing the system. - the number of devices (device_count) - the space required to store a data report (report_size) - the frequency of data transmission of each device (report_frequency) - the time it is required to store them (report_days)

With these data, you can have a very quick notion how much storage space is required.

storage = device_count * report_size * report_frequency * report_days

It is also possible to calculate the amount of requests the server will have to support.

requests_per_second = device_count * report_frequency

Another very important fact to take into account is whether it is necessary to make decisions in real time with the information. This detail is key when deciding the implementation, since it directly impacts how the information is sent from the device as well as the amount of processing required by the system in charge of collecting the data.

Tip 2: Trade-off Low Latency vs. Scalability

Generally when we faced with a system that required taking action in real time, such as when Netflix detects that a user exceeded the number of allowed concurrent views, the sender system and data collection begins to be fundamental part in business logic and it is required to be treated as a critical system.

On the other hand, if what we are looking for is, for example, collecting audience data, it is possible to collect data within the device and then send information packed. In case the system is not active, because the device lost connection or because the server is down, it is possible to continue storing data and keep re-trying Send them until the shipment is successful.

To visualize how low latency impacts with scalability, suppose that it is necessary to collect data from 100000 users. In the event that real-time processing is required to detect concurrency (such as the Netflix example), it will be necessary to send user activity reports every, let’s say, 15 seconds, so the system will need to process more than 6500 requests/second. On the other hand, if we are looking for data to analyze later, we can send information collected every 15 minutes, which would leave us with about 110 requests/second. Clearly the needed infrastructure in one case or another are very different.

An important detail to keep in mind is that we need the devices to send the information in an uncoordinated way among them, in other words, that the reports arrive at the server distributed evenly over time. It would be really a problem that the 100,000 devices sent their reports all at once, this would generate processing peaks, that surely generate loss of data due to under-dimensioning the capacity.

A very simple implementation to achieve the uniform distribution of the messages is to make the devices report every fixed time plus another small random time. To give an example, instead of having reports every 15 minutes, they will actually be done every 15 min +/- 180 seconds. These +/- 3 minutes are calculated each time a shipment is generated. This idea is very simple to implement.

Tip 3: Use Web standards to send data

In our experience, adopting web standards has enabled us to reuse the same technologies and infrastructure that we have previously used in other projects. The use of HTTPS already gives us the necessary security to send the data. With JSON as a data structure we obtain all the necessary flexibility to send any type of data.

As an example, a report of a Set-Top-Box could be a POST by HTTPS where a list of events is sent with their respective time, type and information. It should be noted that it is always a good idea to use the ISO 8601 standard when representing dates and times.

Example of a POST message to the server with data collected by the Set-Top-Box:

POST /api/v1/report HTTP/1.1
Content-Type: application/json

{
    "created_at": "2007-11-03T12:31:00",
    "events": [
        {
            "time": "2007-11-03T12:15:00",
            "type": "tune_dvb_service",
            "data": "dvb: //'1.1.4A"
        }, {
            "time": "2007-11-03T12:17:45",
            "type": "set_volume",
            "data": 0
        }, {
            "time": "2007-11-03T12:19:15",
            "type": "set_volume",
            "data": 67
        }
    ]
}

Tip 4: Control the time of the next sending of information

If for some reason we begin to experience overload in our data collection system, either because we have problems with the infrastructure or because there are more devices reporting, with the implementation mentioned above the only way to lower the load of the servers will be to quickly increase the processing capacity.

A very simple way to mitigate this situation is to implement that every time the device reports data, the server will respond the time that the device must wait to send back information. We can then ask devices to slow or accelerate the report pace. For example, instead of sending data every 15 minutes (900 seconds), we can ask the device to send us a message again in 30 minutes (1800 seconds). This way, servers load should decrease to a half in about 15 minutes.

As an example, the response to the previous message could again be a JSON with the time in seconds that the device should wait to send the next message.

Response of the data collection system to the device report:

HTTP/1.1 200 OK
Content-Type: application/json

{
   "seconds_next_message": 1800
}

Tip 5: Keep the device simple

Whatever the device we are working with, whether a Set-Top-Box, a “wearable” or a GPS tracking device, it always happens that it is complex and expensive to perform firmware updates on the devices.

These updates are usually done through what is called an “OTA Update” (Over-The-Air Update). A device should receive, by some means, a new signed firmware version, verifying that it was effectively generated by the manufacturer.

This process tends to be complex in organizations. They normally increase operations and support costs, from customers calls and complains. They also require to build retrocompatible software to support devices that have not yet updated but are still connected and communicate with our servers.

This characteristic of the devices means that it ends up being more agile and less expensive to transfer everything that is necessary to process the data in the servers. The only responsibility of the devices is data collection, no matter what business use that data may have later.

To visualize this idea we can use the example report. Suppose we build a first model of audience measurement. Looking at the previous report, we consider that from 12:15:00 the user is watching a channel and at least until the time of the report, 12:31:00, did not change the channel. In this first model we consider that the user so far saw the channel for 16 minutes.

Then in a second model we decided to take into account that for 3 minutes the user had the volume in “mute” (it was set to “0” and then modified to “63”) and we did not compute those minutes for the report. In this new model, the user was visualizing the service for 13 minutes.

If we had implemented the first model in the software the device would surely be sending the calculation in the report. Shall we want to change to the second model, it would be necessary to update that logic that is already in all devices. In this context we see that leaving logic in our servers gives us the possibility to act much faster and change implementations in an agile way.

Following this philosophy, we will continue with the initial example and we will further enhance the report, also sending configuration data of the device. That could be very useful for support. Now it is possible to know, for each device, the audio and video configuration, the reception levels of the signals, the firmware version, among other data.

Having all this information when dealing with a complaint from a client helps to reduce the MTTR (Mean time to repair) and enables the possibility of preventing these cases, by prematurely detecting devices’ configuration and installation problems.

Example of a POST message with more operating data:

POST /api/v2/report  HTTP/1.1
Content-Type: application/json

{
    "created_at": "2007-11-03T12:31:00",
    "system_settings": {
         "default_audio_language": "eng",
         "default_subtitles_language": "spa",
         "system_language": "eng",
         "video_output_resolution": "1080i",
         "device_id": "123456789",
         "software_version": "3.2.1",
         ....
     }
    "Events": [
        {
            "time": "2007-11-03T12:15:00",
            "type": "tune_dvb_service",
            "data": {
                "signal": "-32db",
                "uri": "dvb://1.1.4A",
                "volume": 50,
                "audio_langauge": "eng",
                "subtitles_language": null,
                ....
            }
        }, {
            "time": "2007-11-03T12:17:45",
            "type": "set_volume",
            "data": 0
        }, {
            "time": "2007-11-03T12:19:15",
            "type": "set_volume",
            "data": 67
        }
    ]
}

Tip 6: Use a simple and scalable system to analyze data

Once we have the usage report structure and the events that will be sent, we will be able to build the system to collect and process the data.

To start an MVP of a data collection and analysis system, we recommend using the ELK Stack. It is a very simple tool that allows to process, collect and analyze data. In Qualabs we use it to store audit logs and usage of our software, as well as to centralize incidents and errors. We may write more about this in the future.

ELK is an acronym that refers to the names of 3 Open Source products that, when combined, build the entire information management solution that we need:

  • E: Elasticsearch, which is a search engine and analytics.
  • L: Logstash which is a log processing service.
  • K: Kibana, a data query and visualization tool for Elasticsearch.
How Logstash, Elasticsearch and Kibana are related

How Logstash, Elasticsearch and Kibana are related

Elasticsearch is basically a non-relational database in which it is possible to save the data as it is sent by our devices and perform search and analysis operations. It is able to quickly calculate the most repeated values, averages, median, among others. It is very easy to scale, since it is composed of nodes. If you need more resources, it is possible to add them to the cluster to increase the capacity and speed the queries.

To send the data to Elasticsearch we will use Logstash, a tool that is responsible for processing and queuing the messages to be saved and indexed by Elasticsearch. Logstash has a very wide variety of plugins, both to receive and to send data. As an example, there are plugins for input Tweets as well as output plugins to send the information of the event received by email.

Inputs and outputs of Logstash

Inputs and outputs of Logstash

Finally, Kibana is nothing more than a graphical interface to explore data and generate graphics with the information that Elasticsearch is able to generate from the data. The image below shows an example of a Dashboard configured with Kibana that displays graphics generated with visitor data on a website. This image is clear representation of the versatility and potential this tool has for us to explore and learn more about our data.

Example of dashboard with graphics in Kibana

Example of dashboard with graphics in Kibana

Tip 7: Encourage the connection of devices

It may seem a minor issue, but it is not . All this will make sense only if we have devices connected to the internet reporting data. If don’t give our users some value form connecting their devices, none of this will make sense.

In the cases of Set-Top-Boxes it is clear that the incentive for the user is to enjoy OTT services such as Youtube, Netflix and Hulu. But this benefit may not always be so clear in other types of products. For example, if I have a washing machine that can be connected to the internet, what is my benefit as a user if I connect it to the internet? Do I really earn enough to worry about fixing a connectivity problem on my device if it gets disconnected tomorrow?

Maybe this point is a bit basic, but you have to generate a genuine value proposition so that users are willing to give information about the use of their devices in exchange.

How does the story continue?

Up to now we have had a screenshot of all the processes and systems that interact in a devices data-collection solution. Once we start receiving data, the first thing we have to do is explore it, look for usage patterns and mainly look for empirical data that endorse or destroy our beliefs about how users interact with the devices.

This is a mindset change. We’ll need to start breaking prejudices. From now on, the culture within the company must change, since the decisions must have a real sustenance, must be data-driven, justified by the data that the users are giving us at first hand.

If you want to implement a project like this and do not know where to start, at Qualabs we have the resources, flexibility and experience necessary to do so! Write to info@qualabs.com and we’ll help you.

Have other ideas or services you need?
We develop custom solutions

info@qualabs.com