23 stories
1 follower

The data.table Cheat Sheet

1 Share

(This article was first published on DataCamp Blog » R, and kindly contributed to R-bloggers)

The data.table package provides an enhanced version of data.frame that allows you to do blazing fast data manipulations. data.table is being used in different fields such as finance and genomics, and is especially useful for those of you that are working with large data sets (e.g. 1GB to 100GB in RAM).

Although its typical syntax structure is not hard to master, it is unlike other things you might have seen in R. Hence the reason to create this cheat sheet. DataCamp’s data.table cheat sheet is a quick reference for doing data manipulations in R with the data.table package, and is a free-for-all supplement to DataCamp’s interactive course Data Analysis the data.table Way.

The cheat sheet  will guide you from doing simple data manipulations using data.table’s basic i, j, by syntax, to chaining expressions, to using the famous set()-family. You can learn more about data.table at DataCamp.com.

The post The data.table Cheat Sheet appeared first on DataCamp Blog.

To leave a comment for the author, please follow the link and comment on his blog: DataCamp Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Read the whole story
1329 days ago
Share this story

Standards for Scientific Graphic Presentation

1 Share

TL;DR: Over the previous hundred years, a lot of work has gone into standardizing the way scientific data is presented. All of this knowledge has been largely forgotten. I want us to bring it back to life.

Before we talk about science, let’s take a short but scenic detour. A few months ago, while scrolling through the latest popular Kickstarters (as I tend to do in an effort to get motivated to get out of bed), I came across this fantastic project: Full size reissue of the NYCTA Graphics Standards. You see, in the 1960s, taking the subway in New York was a chaotic experience.

Chaotic signs of the New York subway

Walking down the stairs of a subway entrance (once you were lucky enough to find one), you were bombarded by a cacophony of diverse signs that took quite a lot of mental effort to comprehend. Figuring out exactly which train you have to take from which platform1 on which side, was far from trivial.

Massimo Vignelli, Bob Noorda, Unimark

In the late sixties the transit authority finally hired designers Massimo Vignelli and Bob Noorda (Unimark) to organize this chaos and devise a new system for wayfinding in New York’s subway system. When the work was finished in 1970 with the publication of the Graphic Standards Manual, it not only succeeded in its initial goals, but also became a timeless example of great design elegantly solving a real problem.

As a result, all of the various styles of conveying directions and wayfinding were unified into a single iconic graphic standard that helps you effortlessly find your way around New York to this day, enabling you to focus on your journey as opposed to focusing on deciphering signs.

OK, that’s cool, but what does all this have to do with science?

Hold on, we’re getting there. A few days later I was, coincidentally, reading up on the history of information and data visualization while working on a project at PLOS visualizing article metrics in various ways, when I stumbled on this little hidden treasure:

... a set of standards and rules for graphic presentation was finally adopted by a joint committee (Joint Committee on Standards for Graphic Presentation, 1914)

Uhm, what? What standards? I have never heard of standards in graphic presentation, i.e. standards for charts and figures. What is all this about? And with this question a journey into the knowledge black hole began.

And what an exhilarating journey it was, and really, still is. Take my hand and let’s go! First, let’s find this document from 1914: http://www.jstor.org/stable/pdfplus/2965153.pdf2. Ideally, I’d like you to send this PDF to your tablet or print it out, make a cup of tea, take a deep breath, jump on the couch and just read it through. It's only 10 pages and even these are heavily illustrated.

This document was produced by a committee of several really smart people from all branches of science and engineering, people and organizations like Willard Brinton (chairman) from the American Society of Mechanical Engineers (of Graphic Presentation fame), the American Mathematical Society, the American Statistical Association, American Genetic Association, American Association for the Advancement of Science, and so on. I hope I’ve made the point that this was a very capable group people coming together for a common cause.

So what does this document contain? It begins with a great thought and rationale for standardizing scientific graphics:

If simple and convenient standards can be found and made generally known, there will be possible a more universal use of graphic methods with a consequent gain to mankind because of the greater speed and accuracy with which complex information may be
imparted and interpreted.

That's exactly what the Graphic Standards Manual achieved for the subway system, unifying how information is conveyed and thereby significantly speeding up and simplifying its use, to the point where you don’t think about how the information is presented, but instead think only about the information itself. And this is the connection between the New York transit authority’s efforts, and the efforts of the scientific community a hundred years ago. With one major, critical distinction: While the Graphics Standards Manual is still successfully being used today, the efforts of the Joint Committee on Standards for Graphic Presentation have been eroded by the sands of time and simply forgotten.

There are a lot of other pearls of wisdom to be found in this work, but there is one in particular that stood out:

If numerical data are not included in the diagram it is desirable to give the data in tabular form accompanying the diagram

This is essentially saying that data should accompany the figure. A hundred years later, we’re still far from applying this simple but powerful rule. Projects like The Content Mine expend a tremendous amount of energy trying to get data back from figures, and even then it’s a very lossy process. All of this could be avoided if we just follow this one simple rule.

It’s clear these people know what they’re talking about and it makes sense to try and find more of their work. After a lot more digging, I found that this was not the last time this committee convened. In Calvin F. Schmid's review of "The Role of Standards in Graphic Presentation"3, it’s mentioned briefly that several more meetings were held:

Since the publication, in 1915, of the report by the original Joint Committee, other committees prepared expanded reports on the standards of graphic presentation in 1936, 1938 and 1960.

Obviously, we need to find these documents as well. This, however, is not easy. In fact, I’ve only managed to find the 1938 report4 online. Consider this also a call for help: If you have any ideas about how to locate the 1936 and 19605 documents (titled "Time-series charts"), please let me know. Additionally, upon further research I discovered there was one more publication, a final revision in 1979 of what was at that point an actual ANSI standard Y15.2M6. This document cannot be found online either, and is available only in select libraries in the States.

But let’s work with what we have: The 1938 revision is an incredible piece of work. It addresses everything you could possibly think of in terms of how to construct your charts. From good composition practices, such as centering a chart around an optical center, to usage of grids, inclusion of a 0 baseline to avoid perceptual distortion, the importance of scale selection, labeling of axes and curves, reference notes, line styles, and so much more. We should try and use this knowledge to improve the presentation of scientific data today.

Like the New York subway system in early 1960s, scientific figures are currently in a very chaotic state There are significant differences in presentation of the same types of data, which undoubtedly results in a loss of efficiency when researchers are interpreting this information.

On top of that, some other common issues are7:

  1. There are too many lines packed on a single line chart.
  2. The charts are needlessly ornamented with things like gradients or 3D effects.
  3. The axis is missing or constructed in a non-intuitive way, or the baseline is missing.
  4. The data is not included in the figure.
  5. An improper chart type is selected.

All of these issues have been addressed in significant detail in the documents mentioned and while a lot of best practices have been established organically, there is simply no excuse for forgetting the vast knowledge that was painstakingly created so many years ago. On top of that, while it was possibly a bigger effort to follow these standards a hundred years ago, when each figure was drawn by hand, it is technically possible to follow these standards with a single click today. In fact, with the technical capabilities we have, it's unforgivable that we still allow figures to be flattened and mushed together into compressed images when they are prepared for publication. But I digress.

Show ... me ... the example!

I had a difficult time finding a figure which would be easy to rebuild using modern technology, purely because the data necessary to recreate figures is almost never available. One bright example which does include data is, not surprisingly, Heather Piwowar's paper on the open data citation advantage. Following some standards found in the documents and using some modern tools, I decided to rebuild Figure 2 of that paper:

This is an interactive figure, which also has data included. If you click on the data link, you will automatically download the data that is presented, in a JSON format.

This is of course a simple example, but there are MANY such simple charts in scientific papers and it is not a significant effort to recreate them following a century old standards.

What can we do next?

As a first step, we should bring this knowledge back to life. Find copies of the reports and digitize them, upload them to archive.org so that they may never disappear again.

Second, given the breadth of the material found and of the material remaining to be found, it will take some time to study all of it. If you are interested in helping with that, drop me a note. The creation of these standards was a community effort and such should be their modernization. Things like interactivity and modern ways to include data should be discussed as well.

Third, the problem of the way we create scientific charts and figures should simply be recognized. The mistakes of flattening each figure and compressing it, mangling the data, converting vector illustrations into raster images - all of those should be recognized and addressed. Only be recognizing that this is actually a problem, and acknowledging that we do have the means to fix this problem, will we be able to progress.

Fourth, we should connect efforts such as http://metricsgraphicsjs.org, http://idl.cs.washington.edu/projects/lyra/, Datawrapper and so many more. Modernizing scientific data presentation is our common goal.


Sometimes it makes sense to listen to the old masters. The issue of graphic presentation is certainly one where we’ve ignored their efforts and we should try and correct that.

It starts with you.


  1. I did not actually ride the subway in the 1960s, but would be happy to learn what the experience was like from anyone who did.
  2. Joint Committee on Standards for Graphic Presentation. Publications of the American Statistical Association, Vol. 14, No. 112 (Dec., 1915), pp. 790-797. Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association. Article DOI: 10.2307/2965153. Article Stable URL: http://www.jstor.org/stable/2965153
  3. Graphic Presentation of Statistical Information: Papers Presented at the 136th Annual Meeting of the American Statistical Association, Social Statistics Section : Session of Graphical Methods for Presenting Statistical Data : Boston, Massachusetts, August 23-26, 1976
  4. American Society of Mechanical Engineers. (1938). Time series charts: a manual of design and construction. American standard. Approved by American standards association, November 1938. New York: Committee on Standard for Graphic Presentation, American Society of Mechanical Engineers.
  5. American Standards Association., & American Society of Mechanical Engineers. (1960). Time-series charts. New York: American Society of Mechanical Engineers.
  6. American National Standards Institute., & American Society of Mechanical Engineers. (1979). Americal national standard: Time-series charts. New York: American Society of Mechanical Engineers.
  7. I’ve made the examples black and white, to further emphasize some of the accessibility issues
Read the whole story
1335 days ago
Share this story

Das BSI hat einen Praxis-Leitfaden für Penetrationstests ...

1 Share
Das BSI hat einen Praxis-Leitfaden für Penetrationstests herausgegeben. Ich bin kein Freund von BSI-Bashing und blogge auch selten über Berufliches, aber dieser Leitfaden verdient Aufmerksamkeit. Zum Kontext: Das BSI ist bisher nicht durch eigene Penetrationstest-Kompetenz aufgefallen, hat keine weiterbringenden Advisories veröffentlicht, keine neuen Bugs gefunden, outsourced Dinge dafür gerne an Drittfirmen.

Die Kern-Forderung des Dokumentes ist, dass man nicht einfach bei irgendwelchen Krautern seine Penetrationstests in Auftrag geben soll, sondern bei zertifizierten Prüfstellen. Nun gibt es bisher keine zertifizierten Prüfstellen, weil völlig unklar ist, wer da überhaupt die Kompetenz haben soll, andere Marktteilnehmer zu zertifizieren. Es gibt natürlich Common Criteria und ISO9000, aber das ist hier mit Zertifizierung nicht gemeint.

Ich für meinen Teil sehe das Problem auch, dass es unseriöse Marktteilnehmer gibt, aber das wird man mit Zertifikaten nicht los. Im Gegenteil, die verkaufen selber üblicherweise Zertifikate. Ich würde sogar soweit gehen, dass man unseriöse Marktteilnehmer relativ zuverlässig daran erkennen kann, dass sie Zertifikate verkaufen.

Hier ist, was das BSI vorschlägt (Sektion 2.2.3).

Anbieter, die IS-Penetrationstests anbieten, sollten möglichst als Prüfstelle zertifiziert sein. Sie sollten nachweislich die Grundsätze des Datenschutzes, der sicheren Datenhaltung und der IT-Sicherheit einhalten und qualifiziertes Personal beschäftigen.
Das ist in der Praxis natürlich Blödsinn. Einhaltung des Datenschutzes weist man nicht nach, wie soll das auch vor dem Einsatz für die Zukunft gehen, sondern man macht einen Vertrag, wo das mit heftigen Strafen belegt wird. Und dann leakt da auch nichts. Bei der sicheren Datenhaltung kann man jetzt natürlich sagen, dass das wie eine gute Idee klingt. Aber in der Praxis steht im Vertrag drin: ihr dürft unseren Quellcode sehen, aber nicht kopieren. Es gibt da keine Daten, die man als Pentester halten würde. Außer natürlich dem Report, den man im Rahmen des Auftrags schreibt. Man archiviert da nicht jahrelang irgendwelche Daten von irgendwelchen Kunden, die bei irgendwelchen Tests vor 10 Jahren angefallen sind. Das tut man in den Vertrag und fertig.

Das mit dem qualifizierten Personal halte ich für am gefährlichsten an der ganzen Chose. Denn da gibt es schlicht keine Metriken. Es gibt natürlich irgendwelche Gruppierungen aus dem Zertifikate-Verkloppen-Umfeld, die auch Zertifikate für Personen verkloppen. Und bestürzenderweise zitiert das BSI auch genau diese Leute in ihren Referenzen am Ende, irgendwelche sinnlosen "ethical hacker"-Pömpel mit genau Null Aussagekraft. Selbst wenn jemand sein Leben lang ehrlich war, kann der ja trotzdem morgen kriminell werden und mit deinen Daten weglaufen. Die Aussagefähigkeit und Messbarkeit ist hier meines Erachtens nicht gegeben. Und insbesondere haftet keiner dieser Zertifikatsherausgeber dafür, wenn einer ihrer Zertifikatsträger sich dann anders als zertifiziert verhält. Zertifikate verkaufen ist ein Geschäftsmodell, kein Kundendienst.

Und so stellt sich auch für das BSI die Frage, was man denn jetzt machen soll, wenn keine Zertifikate vorliegen. Antwort:

Es wird dort verlangt, dass ein IS-Penetrationstester Berufserfahrung in dem Bereich IT-Penetrationstest besitzt und eine technische Ausbildung abgeschlossen hat. Der Projektverantwortliche muss in den letzten acht Jahren mindestens fünf Jahre Berufserfahrung (Vollzeit) im Bereich IT erworben haben, davon sollten mindestens zwei Jahre (Vollzeit) im Bereich Informationssicherheit absolviert worden sein. Zudem sollte der IS-Penetrationstester an mindestens sechs Penetrationstests in den letzten drei Jahren teilgenommen haben. Dies sollte möglichst vom Anbieter über entsprechende Referenzen nachgewiesen werden.
Aus meiner Sicht ist das wildes Hand Waving. Gestikulieren, damit keiner merkt, dass der Kaiser keine Kleider trägt.

In der Praxis kann man das mit den Referenzen vergessen. Viele Kunden wollen nicht als Referenz auftreten. Und selbst wenn Firma XY mit Mitarbeiter A und B einen Pentest bei Firma Z durchgeführt hat, weiß Z doch gar nicht, wer da tatsächlich die Arbeit gemacht hat. Das ist schlicht nicht bewertbar, ob A oder B fit ist. Oder beide. Oder vielleicht keiner von ihnen und sie hatten nur Glück oder ein gutes Tool dabei.

Der Report fällt dann noch durch Vorschläge wie den hier auf (Sektion 3.1.2):

Der Auftraggeber sollte gewährleisten, dass keine Änderungen an den Systemen während der Tests durchgeführt werden. Sollte der Ansprechpartner des Auftraggebers durch Beobachten des IS-Penetrationstests oder Gespräche auf Sicherheitslücken aufmerksam werden, so muss er warten, bis der IS-Penetrationstest abgeschlossen ist, bevor er die Lücke beseitigt, da sonst die Testergebnisse verfälscht werden können.
Bitte was? Der Kunde darf seine Systeme nicht sicherer machen, weil sonst der Test verfälscht werden könnte? Liebes BSI, Ziel eines Pentests ist, dass die Systeme sicherer werden, nicht dass der Test nicht verfälscht wird. Was ist DAS denn für ein Blödsinn?! Und im Übrigen, ich als Pentest-Durchführer bin der Auftragnehmer, ich kann da meinem Auftraggeber keine Vorschriften machen, wie ein Test ordentlich durchgeführt gehört.
Sollte eine derart gravierende Lücke entdeckt werden, dass es unabdingbar ist, diese sofort zu schließen, so sollte der IS-Penetrationstest abgebrochen und zu einem späteren Zeitpunkt weiter durchgeführt werden.
Es gibt keine fachliche Grundlage für diese Empfehlung.

Humoristisch besonders wertvoll ist auch diese Empfehlung (Sektion 4.1.2):

Das BSI empfiehlt, das IS-Penetrationstester nur solche Exploits einsetzen, deren Wirkungsweise sie schon untersucht und getestet haben.
Scheiße, Bernd!! Hätte mir das doch mal vorher jemand gesagt!1!!

Kurz gesagt: Das ist ein Schuss in den Ofen. Ich verstehe, dass das BSI hier gerne aktiv werden würde, denn Zertifikate kosten Geld, womöglich kann das BSI sich hier ähnlich wie das Patentamt zu einem Profit Center entwickeln. Aber den Kunden hilft das nichts. Aus meiner Sicht sind die Empfehlungen alle gesunder Menschenverstand, angereichert mit "kauft uns doch ein paar Zertifikate ab, Zertifikate können wir gut!"

Read the whole story
1335 days ago
Share this story

Statistics doesn’t have to be so hard: simulate!

1 Share

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:  


The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come. 

(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)

O'Reilly Strata: Statistics Without the Agonizing Pain

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Read the whole story
1366 days ago
Share this story

Das ehemalige Nachrichtenmagazin besucht mit Snowden-Dokumenten ...

1 Share
Das ehemalige Nachrichtenmagazin besucht mit Snowden-Dokumenten eine Internet-über-Satelligen-Provider und guckt sich sich vor laufender Kamera mit den betroffenen Ingenieuren an. Sehr, sehr gruselig. Und die Leute sind in den Dokumenten auch noch namentlich erwähnt. Das muss ja echt unangenehm für die gewesen sein.
Read the whole story
1401 days ago
Share this story

Making Your Code Citable

1 Share

(This article was first published on BioCode's Notes, and kindly contributed to R-bloggers)

Original post from GitHub Guides:

Digital Object Identifiers (DOI) are the backbone of the academic reference and metrics system. If you’re a researcher writing software, this guide will show you how to make the work you share on GitHub citable by archiving one of your GitHub repositories and assigning a DOI with the data archiving tool Zenodo.
ProTip: This tutorial is aimed at researchers who want to cite GitHub repositories in academic literature. Provided you’ve already set up a GitHub repository, this tutorial can be completed without installing any special software. If you haven’t yet created a project on GitHub, start first byuploading your work to a repository.

 Choose your repository
Repositories are the most basic element of GitHub. They’re easiest to imagine as your project’s folder. The first step in creating a DOI is to select the repository you want to archive in Zenodo. To do so, head over to your profile and click the Repositories tab.
Important! Make sure you tell people how they can reuse your work by including a license in your repository. If you don’t know which license is right for you, then take a look at choosealicense.com.

Login to Zenodo

Next, head over to Zenodo and click the Sign In button at the top right of the page, which gives you an option to login with your GitHub account.
Zenodo will redirect you back to GitHub to ask for your permission to share your email address and the ability to configure webhooks on your repositories. Go ahead and click Authorize application to give Zenodo the permissions it needs.

Pick the repository you want to archive

At this point, you’ve authorized Zenodo to configure the repository webhooks needed to allow for archiving and DOI-issuing. To enable this functionality, simply click the On toggle button next to your repository (in this case My-Awesome-Science-Software).

Check repository settings

By enabling archiving in Zenodo, you have set up a new webhook on your repository. Click the settings icon  in your repository, and then click ‘Webhooks & Services’ in the left-hand menu. You should see something like the image below, which shows a new webhook configured to send messages to Zenodo.

Create a new release

By default, Zenodo takes an archive of your GitHub repository each time you create a new Release. To test this out, head back to the main repository view and click on the releases header item.
Unless you’ve created releases for this repository before, you will be asked toCreate a new release. Go ahead and click this button and fill in the new release form.
If this is the first release of your code then you should give it a version number of1.0. Fill in any release notes and click the Publish release button.

Checking everything has worked

Creating a new release will trigger Zenodo into archiving your repository. You can confirm that this process took place by click the Upload tab in your Zenodo profile. You should see a new upload in the right-hand panel.

Minting a DOI

Before Zenodo can issue a DOI for your repository, you will need to provide some information about the GitHub repo that you’ve just archived.
Once you’re happy with the description of your software, click the Submitbutton at the bottom of the Zenodo form, and voilà, you’ve just made a shiny new DOI for your GitHub repo!

Finishing up

Back on your Zenodo GitHub page you should now see your repository listed with a shiny new badge showing your new DOI!
ProTip: If you really want to show off, then right click on the gray and blue DOI image and copy the URL and place it in your README on your GitHub repo.
Last updated May, 2014

To leave a comment for the author, please follow the link and comment on his blog: BioCode's Notes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Read the whole story
1420 days ago
Share this story
Next Page of Stories