Data Engineering Fundamentals Capture the Flagπ
This class is a five day Capture the Flag event to get to know with the basics of systems usage, specifically linux, git, and ssh. There is also a large section on python, with an emphasis on data science scripting practices using numpy and pandas in jupyter notebooks.
This is a self-guided exercise with resources and questions on this site. You, the participant, must look for the answer to the questions through reading documentation, discussing with others, and trying things. Try to avoid searching for answers online in a search engine; the answers can almost always be found in documentation.
Answers can be submitted through an API with the CTF server. Questions will be made available over the course of 5 sessions. Responding correctly to a question gives 1 point, and an additional 0.5 points are awarded for being the first to submit the correct answer to a question. That half point is the flag - be the first to capture it!
If you're speeding through the questions, consider helping others learn the material. Depending on your background, you may have varied experience with these tools. Get to know the other participants by helping them capture a flag too.
Linuxπ
Linux is an open-source operating system based on Unix. It is a standard choice for development and is the most dominant operating system for web servers, cloud computing, and high performance computing at 80% of global public servers. There are many different distributions but they share a common set of tools, notably GNU software. A very common Linux distribution is Android, at 73% of all mobile devices, so you might be a Linux user already without realizing it!
You most likely don't use Linux as the operating system of your personal computer, however. If you are using one the 2.5 % of personal computers with Linux, you can skip straight to the Submission section
MacOS is also based on Unix, so if you're using MacOS, most things should work just as in Linux! A few commands will be different from the course instructions, and the questions will always refer to Linux resources, for example documentation. It is highly recommended to install homebrew (https://brew.sh/) which will allow for package installation via the command line.
Installation on Windowsπ
The easiest way to use Linux on Windows is through the Windows Subsystem for Linux. Installation instructions are here: https://docs.microsoft.com/en-us/windows/wsl/install. Make sure to follow all instructions carefully. If asked to join a "Windows Insiders Program", ignore this. By default, this installs Ubuntu, which is good for this systems class and for all of SDD.
The WSL is similar to a virtual machine inside of Windows, but it integrates with some existing components of Windows. You can access your Windows files from Linux at /mnt/
, but you should make sure you're familiar with Linux first.
Submissionπ
All questions will be posted to the CTF github repository. In the second class, we will use git to download this repository locally, and it will be used to host the files and data needed to respond to questions.
The CTF server's IP address is 34.155.94.97
. You can see a leaderboard there and it is the address for submitting answers. The first way we'll look at submitting answers is with curl
in Linux.
Once you have a Unix-type environment, either native Linux or macOS, or through the WSL, you're ready to submit to the CTF. You will use the curl
command; you can verify that you have curl
by running which curl
in the command line. curl
is a tool for transferring data from or to a server. How do you know that? By checking the documentation of curl
using man curl
. Try it out!
To respond to a question, send a POST request with the data of the question number
and answer
, and your username as user
(your username should be your ISAE login, but you can also check on the leaderboard). For example, the first question asks where the curl
executable is (hint: use which
). Then use curl
:
curl -X POST 'http://34.155.94.97/' \
-d 'number=1' \
-d 'answer=your answer here' \
-d 'user=your username here'
Some of the questions will require access to some files, called file_a.txt
, file_b.txt
, and file_c.txt
. Those are available on the CTF git repository.
You are ready to start answering questions! If you don't know an answer, check the resources below and read documentation using man
.
You can see which questions you have answered by sending a GET request:
curl 'http://34.155.94.97/user/d.wilson'
You can also see which questions have remaining flags, the bonus points associated with answering the question for the first time, with a GET request:
curl 'http://34.155.94.97/answers/'
Python Submissionπ
Note that you can use the requests
library to submit responses:
import requests
data = {"number": "1",
"answer": "",
"user": "d.wilson"}
r = requests.post("http://34.155.94.97/", data=data)
Bash Resourcesπ
- ISAE class on CLI, Linux, and Bash
- Shell class from MIT
- Bash exercises
- More bash exercises
- Short exercises in regular expressions
Linux toolsπ
Now that you're an expert in Linux, let's quickly look at some useful tools. You may need to install some of these, either using apt
, brew
, yum
, pacman
, or whichever package manager you use. Linux comes with many programs installed by default, especially distributions like Ubuntu, however the tools in this section will be more useful than the base Linux tools. We'll cover four: apt
for package management, top
for system monitoring, tmux
for terminal management, and vim
for file editing. There are alternatives to all of these programs that are great, but it is worth being familiar with these four.
Linux Resourcesπ
- apt manual
- Alternatives to top
- Guide to tmux
- tmux cheat sheet
- Editors from MIT class
- Vim adventures
- tldr, short man pages
Gitπ
Git is a version control system used worldwide for maintaining code, documents, video games, and much more. It has seen wide adoption with servers like Github and Gitlab while being an open-source tool that anyone can install as a client or server. In this class, we will look at repositories hosted on Github, but git is much larger than that and many organizations like ISAE have their own private git server.
Installationπ
If you're using Ubuntu, chances are you already have git
. If not, simply do:
sudo apt install git
These questions concern two repositories: the Machine Learning class in SDD (https://github.com/SupaeroDataScience/machine-learning) and the Seaborn library, a popular graphing library (https://github.com/mwaskom/seaborn). You will need to download both repositories. First choose a directory to host them in, for example ~/SDD/FSD312
:
mkdir -p ~/SDD/FSD312
cd ~/SDD/FSD312
and then download them using git clone:
git clone https://github.com/SupaeroDataScience/machine-learning.git
git clone https://github.com/mwaskom/seaborn.git
The commit for all questions on the seaborn
repository is 1e6739
:
git checkout 1e6739
Git Resourcesπ
- Git course
- Introduction to github
- Github video course
- Learn git branching
- Git SCM book
- Git cheat sheet
Git Exerciseπ
In order to access the server for the next parts of the CTF, you will need to provide your public ssh key. The SSH section has references explaining public-key cryptography, but in general you will make a key pair with a private side and public side. You will give the public side to services like this class or Github to perform secure communication, keeping your private key secret to prove that it is you.
First, start by making a key pair and uploading your public key to Github. This will allow you use to SSH to make push requests, instead of using a personal access token. Create an SSH key and add it to your Github account.
Then, we will use git as a way for you to transfer your public key to the class. We could use another means, like a USB key, email, or a very large QR code, but for this exercise we will use git. First make a fork of the https://github.com/SupaeroDataScience/ctf2024 repository. Then, make a pull request with your key as a file in keys/
. Please name your key with your name, like the example keys/dennis-wilson.pub
. Be sure to upload only your public key. Do not ever upload your private key to public servers.
Once your key is in the repository, you are ready for the SSH and Python portions of the CTF.
SSHπ
For the ssh section, you will connect to the CTF server to answer questions about the remote environment. Your public key must be uploaded to the git repository above to get access to the server. You will use the corresponding private key to access the server. Your user on the server is ctf
and the IP is the same as the CTF webserver: 34.155.94.97
.
Please note that ISAE-EDU and ethernet block ssh to most servers, including this one and github.com
. In order to ssh to the server, you will need to either use the eduroam network or a different network like a mobile hotspot.
SSH Resourcesπ
Pythonπ
An overview and reminder of the python programming language, with a focus on numpy and pandas manipulation using Jupyter.
Installationπ
You most likely have python installed on your Linux system, but it is worthwhile to make sure and to upgrade. Python 3.8, 3.9, or 3.10 are all supported.
sudo apt install python3
It is highly recommended to make a virtual environment
to manage your python packages. There are three main libraries for virtual environments:
Virtualenv
is recommended for new users on Linux. Conda, or the platform Anaconda, can be useful on Windows as many packages are built specifically for windows, but not all packages are available via conda. Pipenv
is an exciting project aimed at Python developers, but it adds additional complexity.
Once you have a virtual environment created, please install the following packages for the rest of the Seminars class:
numpy
pandas
scipy
matplotlib
jupyter
The following packages will also be used in SDD:
seaborn
scikit-learn
keras
torch
geos
graphviz
nltk
networkx
statsmodels
pyspark
cython
cma
gym
Jupyterπ
Jupyter (stands for the three original languages in the project: Julia, Python, and R) is a way to use and develop code interactively in the browser. Once you've installed the jupyter package, you can run a Jupyter notebook by simply running jupyter notebook
.
For Windows users, you can run Jupyter in the WSL. As explained in this blog post, you simply need to execute jupyter notebook --no-browser
on the WSL and then copy and paste the URL and token generated into a Windows browser.
Some additional packages for improving Jupyter are nbopen nbdime RISE
. Be sure to read their documentation before installing to verify if these are relevant to you.
Python Resourcesπ
- Python 3 Documentation
- Pip documentation
- Pandas cheatsheet
- Stanford Python and Numpy tutorial
- Python seminar
- Google Colab: Jupyter notebooks on the cloud
- Binder: Also Jupyter notebooks on the cloud, not hosted by Google