Google Open Source Blog

GitHub on BigQuery: Analyze all the code

Wednesday, June 29, 2016

Posted by Felipe Hoffa, Google Developer Advocate

Google, in collaboration with GitHub, is releasing an incredible new open dataset on Google BigQuery. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks GitHub Archive project!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?

The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.

For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.

On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can. Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:

With BigQuery everyone gets a terabyte every month to run queries. If you've never tried BigQuery before, follow these getting started instructions.
The contents table has all the non-binary files in GitHub that are less than 1MB. It's a huge table, with more than 1.5 terabytes of data! This means the monthly terabyte for BigQuery queries won't last long if you want to query this table. To make your life easier, we've created extracts with only a sample of 10% of all files of the most popular projects, as well as another dataset with all the .go, .rb. .js, .php, .py, and .java code. Use them to make your free quota last!
If these tables are not enough, you can always create your own extracts (but you'll be billed for the respective storage). To do so, you could sign up for $300 in Google Cloud Platform credits. These credits could be used to store terabytes (and more) of data in BigQuery.
BigQuery makes it easy to join different datasets. How about ranking coding patterns by the number of stars their projects get? See a related post looking at the Hacker News effect on a project’s GitHub stars.
SQL is not enough? Learn how BigQuery allows you to run arbitrary JavaScript code inside SQL to enable a full range of possibilities.

To learn more, read GitHub's announcement and try some sample queries. Share your queries and findings in our reddit.com/r/bigquery and Hacker News posts. The ideas are endless, and I'll start collecting tips and links to other articles on this post on Medium.

Stay curious!

More statistics from Google Summer of Code 2016

Tuesday, June 28, 2016

Google Summer of Code (GSoC) 2016 is officially at its halfway point. Mentors and students have just completed their midterm evaluations and it’s time for our second stats post. This time we take a closer look at our participating students.

First, we’d like to highlight the universities with the most student participants. Congratulations are due to the International Institute of Information Technology - Hyderabad for claiming the top spot for the third consecutive year!

Country	School	2016 Accepted Students	2015 Accepted Students	12 Year Total
India	International Institute of Information Technology - Hyderabad	50	62	252
Sri Lanka	University of Moratuwa	29	44	320
Romania	University POLITEHNICA of Bucharest	24	14	155
India	Birla Institute of Technology and Science Pilani, Goa Campus	22	15	110
India	Birla Institute of Technology and Science, Pilani Campus	22	18	116
India	Indian Institute of Technology, Bombay	18	13	75
India	Indian Institute of Technology, Kharagpur	15	8	92
India	Indian Institute of Technology, Roorkee	15	8	57
India	Indraprastha Institute of Information Technology Delhi	15	7	27
India	Amrita School of Engineering, Amrita University, Amritapuri Campus	13	5	33
India	Indian Institute of Technology, Guwahati	13	5	38
Cameroon	University of Buea	12	10	26
India	Delhi Technological University	12	9	60
India	Indian Institute of Technology BHU Varanasi	12	12	37
Germany	TU Munich	11	7	45

Next, we are proud to announce that 2016 marks the largest number of female GSoC participants to date — 12% of accepted students are female, up 2.2% from 2015. This is good progress, but we are certain we can do better in the future to diversify our program. The Google Open Source team will continue our outreach to many organizations, for example, Grace Hopper and Black Girls Code, to increase this number even more 2017. If you have any suggestions of organizations we should work with, please let us know in the comments.

Finally, each year we like to look at the majors of students. As expected, the most common area of study for our participants is Computer Science (approximately 78%), but this year we have a wide variety of studies including Linguistics, Law, Music Technology and Psychology. The majority of our students this year are undergraduates (67%), followed by Masters (23%) and then PhD students (9%).

Although reviewing GSoC statistics each year is great fun, we want to stress that being “first place” is not the point of the program. Our goal is to get more and more students involved in creating free and open source software. We hope Google Summer of Code encourages contributions to projects that have the potential to make a difference worldwide. Congratulations to the students from all over the globe and keep up the good work!

By Mary Radomile, Open Source Programs Office

Coding has begun for Google Summer of Code 2016

Monday, May 23, 2016

Today marks the start of coding for the 12th annual Google Summer of Code. With the community bonding period complete, about 1,200 students now begin 12 weeks of writing code for 178 different open source organizations.

We are excited to see the contributions this year’s students will make to the open source community.

For more information on important dates for the program please visit our timeline. Stay tuned as we will highlight some of the new mentoring organizations over the next few months.

Have a great summer and happy coding!

By Josh Simmons, Open Source Programs Office