Crash Course in The Data Mine

Our team knows that we’ve missed things in this document. Please feel free to add to it or send suggestions to datamine@purdue.edu

TDM Structure:

Students work on two types of projects. Corporate Partners (CRP) projects, are experiential learning projects that are run by employees at the different companies who meet with the students (mentors).

The second project type are called sponsored research (SPS) projects. These are made up of teams who are hand-picked and work on more advanced goals with their mentors. TDM only has a few SPS projects at any given time. We don’t plan to have an SPS project as part of the Deaf-PODS grant.

TDM is made up of 3 teams:

  • Corporate Partners (CRP): Run all of the relationships with the companies and mentors. Lead TA recruitment and training for the CRP projects. Help to outline SPS contracts.

  • Academic Programs: Coordinate all of the information for classes, classrooms, and registration with Purdue. Run the living learning community (LLC) organizations. Execute the Indiana Data Mine (IDM) and National Data Mine (NDMN) expansion efforts.

  • Data Science (DS): Setup and administer the technical environments that the students use. Advise student teams on questions involving code or data analytics.

The student teams in the summer program will each have a Microsoft Teams environment setup for them. This can be confusing because we will often say “on Teams” or “on Anvil” when talking about the 2 environments that the students primarily use.

Technical Environments:

Note: it’s OK not to understand each of these terms in-depth. Our goal is to help clarify common terms that you will see when interpreting for students.

Students work in the Anvil environment on Purdue’s high-performance computing (HPC) cluster. The system is administered by Research Computing (RCAC) and it uses a front-end system called ACCESS. If students encounter issues with their ACCESS ID, they may sometimes need to submit a ticket to RCAC for help.

Student teams primarily code in two languages: R and Python. They run the code in a Jupyter Notebook on Anvil or in a code development environment on their computer. A very common code development environment offered by Microsoft is called VSCode. Both R and Python use pre-built groups of code that help make certain functions easier. In R these are called Libraries while in Python they are known as Packages.

In a Jupyter Notebook on Anvil the students utilize objects called kernels to run their Python or R code. The most common kernel is called “f2022-s2023”. TDM team will sometimes need to double check the kernels that the students are using to ensure that their code runs in the correct way.

When working in Anvil the students have two primary locations for their code. One is their home directory which is setup for each person and is secured too only them. The other is their project directory. Which is setup by TDM and is secured to all students on the project team. It is often better for students to store code in their project directory. That way other students can use it as well.

When TDM asks about the location of their code we will often ask for the path. This is the full fille path that we can use to navigate to the code and will often look something like “/anvil/projects/tdm/corporate/david”. Spelling and the direction of the slashes will make a difference when working in the Anvil environment.

Teams can sometimes utilize database resources to store and query information about their projects. Most often, TDM will utilize a PostgreSQL database via a took called SQLite. These databases can often work with files that use comma separated values (CSV).

Some teams will also work with dashboarding tools. These tools allow for easy any quick visualization of data from many different sources. The most common dashboarding tools are called Tableau and Power BI.

Another area of research that some teams will utilize is called geospatial information systems (GIS). This focuses on maps, geographies, and analytics within those areas. The most common tool in this space is called ArcGIS and is owned by a company called ESRI.

Analytics Terms:

Many terms focus on data analytics in its different forms. Machine learning, data science, and predictive analytics often focus on using data and statistics to make business decisions. Business intelligence teams also utilize data, but tend to focus more on dashboards and reporting. Deep learning is a different form of machine learning where predictive models utilize huge amounts of data and unique model types in order to try to make better predictions or suggestions.

There are thousands of different analytics terms, and we won’t try to cover all of them here. However, one topic that tends to come up a lot is natural language processing (NLP). NLP is a research field that focuses on using text to understand things like sentiment, topics, or recommendations.

TDM Quick Reference:

  • TDM - The Data Mine

  • CRP - Corporate Partners (either the student project or the team)

  • Mentors - Employees at the different companies who lead the student teams

  • SPS - Sponsored research projects

  • LLC - Living Learning Community

  • IDM - Indiana Data Mine

  • NDMN - National Data Mine

  • Microsoft Teams - Chat and file sharing application

  • Anvil - Supercomputer at Purdue

  • HPC - High-performance computing cluster

  • RCAC - Research Computing team

  • ACCESS - User portal for Anvil (often students reference their ACCESS ID)

  • R - coding language

  • Python - coding language

  • Jupyter Notebook – Application host on Anvil that lets students write and run code

  • VSCode - Free code development environment

  • Libraries - Groups of useful code in R that students can use

  • Packages - Groups of useful code in Python that students can use

  • Kernels - Environments to run Python or R code

  • F2022-s2023 - The most common kernel for students to use

  • Home directory - Location on Anvil that is secured to each user only

  • Project directory - Shared location on Anvil that the full student team can use

  • Path - How to find a file on Anvil: /anvil/projects/tdm/corporate/david.txt

  • Database - Application to store and retrieve data

  • PostgreSQL - Type of database that TDM uses

  • SQLite - Python package that can be used to setup a PostgreSQL database

  • CSV - Comma Separated Value (common file type)

  • Dashboarding tools - Easy to use tools that focus on data visualization and reporting

  • Tableau - Common dashboarding tool

  • Power BI - Common dashboarding tool

  • GIS - Geospatial Information Systems (type of analysis focused on geographical relationships)

  • ArcGIS - Common application for GIS research

  • ESRI - Company that makes ArcGIS

  • Machine learning, data science, predictive analytics - similar names for the practice of using data for research and decision making

  • Business intelligence - Using data for visualization and reporting

  • Deep learning - Using huge amounts of data and unique analytics models for increased accuracy in prediction

  • NLP - Natural Language Process (using text in analysis)