Re: [Tails-ux] Report on Piwik prototype

Delete this message

Reply to this message
Author: sajolida
Date:  
To: Tails user experience & user interface design
Old-Topics: Re: [Tails-ux] Report on Piwik prototype
Subject: Re: [Tails-ux] Report on Piwik prototype
intrigeri:
> sajolida:
>> A single user is defined by:
>
>>     (OS, browser + browser plugins + IP address + browser language)

>
> Did you really mean plugins, or are add-ons taken into account as well?


I don't know.

>> - I'm not sure boum.org will be able or ready to have IP in their logs.
>
> It's worth asking. We could suggest they store these logs in RAM only,
> and either help them get enough RAM to handle it, or retrieve the logs
> as often as needed so they don't need too much memory.


As said in my previous answer to u, I'm putting Piwik "on hold" until I
have more time and maybe have more calendar regarding #14588. I would
find it sad to ask boum.org to do weird stuff if we're moving our
website somewhere else shortly after. Now, if we decide to rely on
JavaScript and the IP anonymization feature of Piwik, we could bypass
this limitation ourselves already :)

>> - We won't have analytics from people without JavaScript.
>
> I suspect that's a pretty small portion of our website visitors, but
> it would be sad to simply ignore them.
>
>> - The JavaScript might not give us all the analytics we need.
>> For example the hits on the security upgrade feed by Tails Upgrader.
>
> Can we do both, i.e. importing logs *and* setting up the JS?
> Will Piwik be able to de-duplicate hits?


I'm not sure but I would bet that Piwik cannot de-duplicate incoming
data and merge Apache logs *and* JavaScript. At least I've never read
about that and it seems like you have to do one thing or the other...

For example, the tool to import Apache logs into Piwik is a Python
script that actually generates hits on the Piwik API just as the
JavaScript would do.

We could also activate the JavaScript from Piwik for a few days just to
try out stuff, before taking a decision. For example we could compare
the hits with Apache logs and with the Piwik JavaScript and see if we're
talking about 1% of users without JavaScript, 10% of 0.1%.

I also understand that with only Apache logs, Piwik wouldn't be able to
differentiate Tor Browser users coming out of a single exit node while
this would be possible with the JavaScript. So we're probably loosing
something with both options and we would need more data to make an
informed decision.

>> Resources needed
>> ================
>
>> - I'm running it on a dedicated X200 with a Core 2 Duo P8700 and 4 GiB
>> of RAM
>
> What kind of storage hosted the raw logs and DB?


An external rotatory hard disk plugged through USB :)

>> - I imported logs from November 10 to January 19 (71 days) and the
>> database is now 13.0 GiB (from 1.1 GiB of gzip Apache logs).
>
> Wow! Note that this impacts not only storage, but probably also RAM
> requirements to handle the dataset efficiently.


Yeap. Note that by default Piwik purges nothing. Which means that both
visitor logs (the full activity) and reports (the aggregated stats) are
kept forever, which was the case for my database. But you can of course
configure that and for example, delete visitor logs after some time:

https://piwik.org/docs/managing-your-databases-size/

>> My understanding is that I generated the reports for this period but
>> didn't get rid of the raw data from the database. This is possible to
>> do but is a different process I think.
>
> Good to know. We'll need to learn more about this whenever we
> seriously think of deploying this in production.


See the previous link. That's a config in the admin interface :)

>> - Processing all reports for all this data takes several hours,
>> maybe almost a day. I didn't try to process a report for a single day
>> only.
>
> I'm curious what was the bottleneck:
>
> * Were all CPU cores used during this process?
> * Was I/O a blocker, i.e. were processes blocked waiting for I/O?
> * Was all available memory used by this process?


All this I don't know. I could show you next time we meet because
drawing conclusions on resources over email between the two of us will
take much more time and errors :)

> * Did you configure MariaDB in any way to optimize for large DBs?


No

>> Lovely sysadmins
>> ================
>>
>> I want guidance from the sysadmins team on how to move this forward and
>> be integrated in our official infrastructure.
>
> Now is a good time to ask, since we'll likely be upgrading our
> hardware later this year.
>
> To start with, we need the list of package dependencies


I installed Apache, MariaDB, PHP (7.0.19) and a bunch of PHP libs:

php php-curl php-gd php-cli php-xml php-mbstring

And that was it.

> what access
> you need beside a shell (e.g. write access to file X, ability to run
> command Y as root)


On the command line I run:

- The log importer (/path/to/piwik/misc/log-analytics/import_logs.py)
- The Piwik console (/path/to/piwik/console)

> the list of DBs and directories to backup


There's not even a single config file (apart from Apache), so backing up
the database and the root directory of Piwik should be enough.

> and
> resources requirements (ideally: current needs & what you'll need in
> 2 years).


This needs more investigation :)

→ #14601

> We can discuss the specifics later of where to draw the line between
> managing the other bits of the setup with Puppet vs. managing things
> by hand. Each has serious pros & cons.


Ok.

>> The sysadmin work I had to do here was very little:
>
>> - Default Apache configuration
>
> Thankfully, it seems that nginx is supported as well.
>
>> But on the long run there might be work needed to monitor the
>> performance issues, tweak for better performance, etc. There is some doc
>> about that on their website [4].
>>
>> [4]: https://piwik.org/docs/optimize-how-to/
>
> Good to know. FWIW, what I found somewhat concerning at first glance:
>
>  * The part about Redis and Queued Tracking (which is currently
>    "BETA"), that will require us to dive into yet another technology
>    we don't know.

>
>  * We have no expertise internally wrt. efficiently handling large
>    datasets in a SQL database, nor about hosting a high-traffic PHP
>    webapp either, so the learning process will be slow and will take
>    us quite some time. Let's keep in mind that we have no such time
>    allocated at the moment.


Of course. And I'm not in a hurry at all. For now, I'm happy that I did
a quick experiment, have a prototype to play with, and know the next
steps, pending discussion, and research.