#archiveteam-bs 2018-06-21,Thu

↑back Search

Time Nickname Message
00:10 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
00:10 🔗 Mateon1 has joined #archiveteam-bs
00:11 🔗 Jens has quit IRC (Remote host closed the connection)
00:12 🔗 Jens has joined #archiveteam-bs
00:18 🔗 BlueMax has joined #archiveteam-bs
01:42 🔗 godane SketchCow: i'm starting to upload simply k-pop episodes i got from youtube
01:51 🔗 Sk1d has quit IRC (Read error: Operation timed out)
01:54 🔗 Sk1d has joined #archiveteam-bs
02:05 🔗 ta9le has quit IRC (Quit: Connection closed for inactivity)
02:56 🔗 adinbied has joined #archiveteam-bs
03:01 🔗 ivan you can read tasks from stdin with something like for line in sys.stdin:
03:01 🔗 ivan you probably want to use something that can write WARCs
03:01 🔗 adinbied @ivan, Ah - didn't know that. Is there any way to do multiple concurrent get requests or something
03:01 🔗 ivan just launch more processes working on different tasks
03:01 🔗 ivan also there's a command `parallel` for easily launching subprocesses
03:02 🔗 ivan by default it works like xargs
03:03 🔗 ivan if you make it work that way, perhaps read the ids with sys.argv[1:]
03:03 🔗 adinbied Given the API calls are just returning plain text data, I decided not to go with WARC so that it could be processed easier (Importing into a SQL DB or Google BigTables or something). Would there be any overhead/downsides in using WARC?
03:04 🔗 ivan WARC makes sense if you think there might be any value in having the responses available in wayback
03:04 🔗 wp494 has quit IRC (Ping timeout: 633 seconds)
03:04 🔗 wp494 has joined #archiveteam-bs
03:04 🔗 ivan it also captures request and response headers that might be valuable to someone looking for some evidence that the server did actually serve that at that time
03:05 🔗 ivan other than that not much value I can think of
03:05 🔗 adinbied Sorry, I'm still learning the ins and outs of python - could you either link to some documentation or give an example of how you might go about using the sub processes and sys.argv?
03:07 🔗 adinbied Yeah, fair point. My main concern is having my API Key be part of the headers - and while there is little risk in this case, I would prefer not to have to have any keys out in the public
03:08 🔗 ivan OK, no WARC then
03:11 🔗 ivan 12 processes churning through a max of 100 ids per process:
03:11 🔗 ivan for i in {1..1000}; do echo $i; done | parallel --will-cite --eta -d '\n' -j 12 -n 100 'python3 -c "import sys; print(sys.argv[1:])"'
03:11 🔗 ivan you can of course generate lists of ids and run them on different machines entirely
03:11 🔗 ivan for i in {1..1000}; do echo $i; done > ids
03:11 🔗 ivan cat ids | parallel ...
03:13 🔗 ivan parallel comes from the parallel package on debian (not from moreutils)
03:16 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
03:16 🔗 ivan to be clear in the python program it would be something like: for id in sys.argv[1:]: to do something with each id
03:17 🔗 adinbied OK, I'm still not understanding how I would implement the parallel command with a scraping python script.
03:17 🔗 adinbied Ah, OK - that makes sense
03:23 🔗 adinbied So how would I get each URL then? Because fullurl = baseurl + str(sys.argv[1:]) + afterurl returns a list instead of a single number in the for loop
03:23 🔗 adinbied NVM
03:24 🔗 adinbied Im dumb
03:28 🔗 odemg has joined #archiveteam-bs
03:33 🔗 adinbied OK, what am I doing wrong here? Python script: https://gist.github.com/adinbied/13635e5afd7a76cec29e467fb145ba72
03:34 🔗 adinbied and I'm calling it with for i in {1..100}; do echo $i; done | parallel --eta -d '\n' -j 12 -n 100 'python3 scrape.py'
03:34 🔗 adinbied Yet no requests are being made....
03:36 🔗 adinbied @ivan, sorry for pestering you with questions, just trying to learn from all of this
03:48 🔗 * ivan looks
03:49 🔗 ivan adinbied: what does the first line of parallel --version say?
03:50 🔗 adinbied GNU parallel 20141022
03:51 🔗 ivan can you print(sys.argv[1:]) in your program and see what that says?
03:51 🔗 ivan I guess you saw no errors from python?
03:51 🔗 adinbied No, no errors with python
03:53 🔗 adinbied Here's the output: https://pastebin.com/s7LVMkn2
03:54 🔗 ivan assuming you put that print inside the body of the for loop I think the parallel and the sys.argv[1:] is working fine
03:56 🔗 adinbied Yup, its in the for loop. I tried print(fullurl) inside the for loop and it returned all of the URLS correctly.
03:56 🔗 ivan try some print debugging to see what requests is doing
03:56 🔗 ivan for example printing non-200 status codes
04:00 🔗 adinbied Ah, I'm an idiot. The set ID's don't start until 173....Derp. Thanks for your help!
04:01 🔗 ivan I figured it might be that :-)
04:03 🔗 adinbied Are there any optimizations I should make with the -j and -n paramaters of parallel ?
04:04 🔗 adinbied IE is 12 processes the max I should use? It's averaging about 4.5 sec per process to complete
04:05 🔗 ReimuHaku has quit IRC (Ping timeout: 252 seconds)
04:06 🔗 ReimuHaku has joined #archiveteam-bs
04:07 🔗 ivan adinbied: do some calculations and see how much parallelism you need to grab it within your time budget
04:08 🔗 ivan then hope you don't get banned for sending that many requests per second
04:09 🔗 ivan if you cause noticeable slowdown someone is going to look at a log
04:09 🔗 ivan (sometimes)
04:12 🔗 ivan I mean you try a -j as high as you want; you might want to raise that -n 100 too to avoid pointless process churn
04:16 🔗 ivan TIL https://archive.softwareheritage.org/
04:18 🔗 adinbied Huh, after some trial and error I found that it goes faster with a lower n value
04:20 🔗 adinbied Seems the sweet spot is -j 12 -n 50. Thanks for your help!
04:21 🔗 ivan that's bizarre, lower -n would generally increase time because you have to wait for more pythons to start and load a bunch of bytecode
04:22 🔗 ivan either a fluke or maybe the server is slowing down after many requests on a single connection
04:24 🔗 adinbied When a higher n value is used (at least for me) the Max Jobs to Run decreases. At -n 50 it's at 6 jobs, -n 100 is 4, -n 200 is 2. IDK, it works, and is loads faster than previously
04:27 🔗 ivan how many ids are you giving parallel?
04:28 🔗 adinbied 500 in this case
04:28 🔗 ivan well 500 ids / 100 ids per process is going to be a maximum of 5 processes instead of the maximum -j 12
04:29 🔗 adinbied Ahhh, that makes more sense.
04:29 🔗 TC01 has quit IRC (Read error: Operation timed out)
04:33 🔗 atomicthu has quit IRC (Ping timeout: 480 seconds)
04:35 🔗 c4rc4s has quit IRC (Ping timeout: 600 seconds)
04:36 🔗 adinbied has quit IRC (Quit: Leaving)
04:45 🔗 TC01 has joined #archiveteam-bs
06:48 🔗 atomicthu has joined #archiveteam-bs
06:51 🔗 davisonio has quit IRC (Ping timeout: 260 seconds)
07:02 🔗 c4rc4s has joined #archiveteam-bs
07:16 🔗 davisonio has joined #archiveteam-bs
08:37 🔗 BlueMax has quit IRC (Leaving)
09:01 🔗 DFJustin has quit IRC (Ping timeout: 260 seconds)
09:10 🔗 DFJustin has joined #archiveteam-bs
09:52 🔗 ta9le has joined #archiveteam-bs
11:21 🔗 cf has quit IRC (Read error: Operation timed out)
11:23 🔗 cf has joined #archiveteam-bs
12:52 🔗 Sk1d has quit IRC (Read error: Operation timed out)
12:53 🔗 w00dsman has joined #archiveteam-bs
12:58 🔗 Sk1d has joined #archiveteam-bs
12:58 🔗 w00dsman has quit IRC (w00dsman)
13:00 🔗 w00dsman has joined #archiveteam-bs
13:16 🔗 w00dsman1 has joined #archiveteam-bs
13:23 🔗 w00dsman has quit IRC (Read error: Operation timed out)
13:23 🔗 w00dsman1 is now known as w00dsman
13:39 🔗 w00dsman1 has joined #archiveteam-bs
13:45 🔗 w00dsman has quit IRC (Read error: Operation timed out)
13:45 🔗 w00dsman1 is now known as w00dsman
13:48 🔗 w00dsman has quit IRC (w00dsman)
15:40 🔗 mls has joined #archiveteam-bs
15:47 🔗 luxim has joined #archiveteam-bs
15:47 🔗 luxim has left
16:14 🔗 w00dsman has joined #archiveteam-bs
16:26 🔗 godane SketchCow: did you mail tapes this week to me?
16:58 🔗 w00dsman has quit IRC (w00dsman)
17:26 🔗 godane SketchCow: https://archive.org/details/disney-adventures-v3i11
17:27 🔗 Sk2d has joined #archiveteam-bs
17:29 🔗 Sk1d has quit IRC (Read error: Operation timed out)
17:29 🔗 Sk2d is now known as Sk1d
18:49 🔗 jschwart has joined #archiveteam-bs
19:12 🔗 godane SketchCow: https://archive.org/details/disney-adventures-v9i5
19:39 🔗 VADemon has joined #archiveteam-bs
20:39 🔗 VADemon_ has joined #archiveteam-bs
20:39 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
20:59 🔗 wp494 has quit IRC (Ping timeout: 255 seconds)
21:00 🔗 wp494 has joined #archiveteam-bs
21:54 🔗 jschwart has quit IRC (Konversation terminated!)
22:10 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
22:10 🔗 Mateon1 has joined #archiveteam-bs
22:44 🔗 phillipsj has quit IRC (Quit: Leaving)

irclogger-viewer