Performing large-scale deep learning on AWS is a cheap and effective way to learn and develop. It can be recommended to use tens of gigabytes of memory, dozens of CPUs, and multiple GPUs for a small amount of money.
These commands are very effective if you are a newcomer using EC2 or Linux commands and performing deep learning scripts in the cloud.
The main contents of this article include:
1) Copy data between the local and EC2 instances
2) Make the script run safely by day, week and month
3) Monitor the performance of the process, system and GPU
Note: All commands are executed in a linux-like environment (Linux, OS x or cygwin)
Assuming that AWS EC2 is up and running, for the sake of convenience, make the following settings for the environment:
1) The IP address of the EC2 server is 54.218.86.47
2) The user name is ec2-user
3) The SSH key is located in ~/.ssh/ and the file name is aws-keypair.pem;
4) Working with python scripts
Before doing anything, first log in to the target server. Simply use the SSH command. Store the SSH key in ~/.ssh/ with a meaningful filename, such as aws-keypair.pem. Use the following command to log in to the EC2 host, paying attention to the address and username:
Ssh -i ~/.ssh/aws-keypair.pem .47
Use the SCP command to copy local files to the server. For example, the command to copy the script.py file to the EC2 server is as follows:
Scp -i ~/.ssh/aws-keypair.pem script.py .47:~/
Execute the script in the background of the service, you can ignore other semaphores, ignore standard input and output, and redirect all output and error information to a log file. This is necessary for deep learning models that require long runs.
> nohup python /home/ec2-user/script.py >/home/ec2-user/script.py.log &1 &
The script.py and script.py.log in this command are located in the /home/ec2-user/ directory. A detailed introduction to nohup and redirection references (such as the introduction in wikipedia).
4. Execute the script on the specified GPU of the server It is recommended to run multiple scripts at the same time if EC2 supports it. For example, if EC2 has 4 GPUs, you can run a separate script on each GPU. The sample code is as follows:
CUDA_VISIBLE_DEVICES=0 nohup python /home/ec2-user/script.py >/home/ec2-user/script.py.log &1 &
If there are 4 GPUs, you can specify CUDA_VISIBLE_DEVICES from 0 to 3. This is feasible on the Keras that TF is doing backstage, and has not been tested in Theano.
5, monitor the output of the script If there is an item score in the output or the result of an algorithm, the output of the real-time monitoring script is meaningful. An example is as follows:
Tail -f script.py.log
Unfortunately, AWS will close the terminal when there is no output on the screen for a while, so it's best to use:
Watch "tail script.py.log"
Sometimes I can't see the standard output of Python, I don't know if it's a python problem or an EC2 issue.
6. Monitor system and process performance It makes sense to monitor the performance of the EC2 system, especially how much memory is already in use or left. E.g:
Top -M
Or specify the process ID PID:
Top -p PID -M
If you execute multiple scripts simultaneously on the GPU and execute them in parallel, it is a good idea to look at the performance and usage of each GPU. E.g:
Watch "nvidia-smi"
In general, the terminal will remain open.
Watch "ps -ef | grep python"
It is generally not recommended to modify directly on the server, except of course you are familiar with vi:
Vi ~/script.py
The usage of vi is not covered here.
10, download files from the server As opposed to uploading a file, this is an example of the next png file:
Scp -i ~/.ssh/aws-keypair.pem .47:~/*.png .
Points to note
If you want to run multiple scripts at the same time, it's best to use EC2 with multiple GPUs.
Best to write scripts locally
Output the execution result to a file and download it to the local for analysis
Use the watch command to keep the terminal running
Execute remote commands locally
Shenzhen Ruidian Technology CO., Ltd , https://www.wisonen.com