Datasets

To participate in the challenge you must create and submit a working, clickable, and interactive data visualization utilizing the Analytics for Hadoop service on IBM Bluemix, AND analyzing one or more datasets included in the curated list of datasets below OR include data meeting the data requirements described below. The total size of the Datasets analyzed must be between 500MB - 7GB in size.

Dataset requirements:

  • The total size of the Datasets analyzed must be between 500MB - 7GB in size. (Explore, join and find correlations from multiple datasets!)
  • Dataset content must be civic-based, meaning it must relate to a city or region, and or residents or citizens or a city or region. Datasets do not have to be issued by a government entity as long as the dataset content is civic-based. Compliance with this requirement will be determined at the sole discretion of the Administrator and Poster.
  • Dataset must be either publicly available, or you must have a license to use the data
  • Dataset contents must be in English. [Non-English datasets may be evaluated on a case-by-case basis. IBM has the sole discretion to accept or deny the requests.Please submit requests to support@challengepost.com.]

  

Quickstart Questions and Data

Good news! We’ve taken the liberty of putting together a few datasets, ready to go with question options - in case you’re not sure what data to use or which question to ask.

Crime and socioeconomic indicators: Is there a relationship between the number of crimes in the city of Chicago and socioeconomic indicators (Median income, Poverty, Education, etc)? Data: City of Chicago Crime Data, Median household income by county, Socioeconomic Indicators

Total 311 complaints locations and median household income: Can we predict the total 311 complaints per zip code with the median household income in San Francisco? Data: 311 complaints in San Francisco, Median household income by county

Department of homeless services (DHS) and unemployment: Is there a correlation between the total persons sleeping in shelter each day in NYC and the unemployment rate? Data: DHS daily data, Unemployment Data

Red light violations and crime rate: Can we project the corners with high red light violations using crime rate data of each zip code (or county)? Data:  Red light violation in Chicago, City of Chicago Crime Data

Pollution, traffic and population of cities: Can we explain cities that are highly polluted with the population and the traffic data of the city? Data: Pollution, Traffic, Population

US population and GMOs: What correlations can you find between US population estimations and the adoption of genetically engineered crops in the U.S ? Data: US Census Population Estimates, ERS GMO crops

Obesity and food availability: How does commodity availability and consumption in the United States (fruits, vegetables, meat, grains, etc.) relate to obesity? Data: Obesity Data from NY, USDA ERS Commodity Consumption Data

 

IBM-Curated Datasets

You aren’t required to use an IBM-curated dataset, but here are a few options to get you started! Please note that additional datasets may be added throughout the competition.

 

Other open data portals to check out:

 

Data Analysis Prompts

Not sure what question to ask when analyzing your data? The judging criteria includes the relevance and potential social impact of the civic insights gathered. When thinking about what your chosen data could reveal consider what the data could tell you about:

  • Making cities smarter
  • Local economics
  • Residents’ quality of life
  • Crime
  • Public health
  • Poverty
  • Education
  • Local environment or “green initiatives”
  • Social equality
  • Work and joblessness
  • Traffic patterns, violations, congestion, and fuel saving

 

Dig a Little Deeper

Still having trouble wrapping your mind around what questions to ask your chosen datasets? Feel free to ask some big questions! 

The Building Blocks of a Sustainable Community
Correlating the quality of an areas infrastructure to the economic well being of the area could help organizations better understand what is needed to support long-term economic sustainability. From roads to fresh water access to access to markets for goods - there are many factors to consider. Look beyond just numbers to explore the quality of each major infrastructure element and see what you can find.

Economic Impact of Care Giving
Study demographic and economic data about a metropolitan area and/or region to determine the current and future impact of care giving. Along with household data, data about geriatric care facilities, access to geriatric medical help, and support organizations could be used. Social media could be mined to look at the number of people talking about care giving challenges over time. Specific issues such as Alzheimer's or other ailments could also be incorporated into the project.

 

More Questions?

Want to discuss data or get help vetting your analysis idea with a Hadoop expert? Post your questions on the Discussion Board. For questions about the Big Data for Social Good Challenge process or rules, email Support@ChallengePost.com.