[00:10] *** Mateon1 has quit IRC (Read error: Operation timed out) [00:10] *** Mateon1 has joined #archiveteam-bs [00:11] *** Jens has quit IRC (Remote host closed the connection) [00:12] *** Jens has joined #archiveteam-bs [00:18] *** BlueMax has joined #archiveteam-bs [01:42] SketchCow: i'm starting to upload simply k-pop episodes i got from youtube [01:51] *** Sk1d has quit IRC (Read error: Operation timed out) [01:54] *** Sk1d has joined #archiveteam-bs [02:05] *** ta9le has quit IRC (Quit: Connection closed for inactivity) [02:56] *** adinbied has joined #archiveteam-bs [03:01] you can read tasks from stdin with something like for line in sys.stdin: [03:01] you probably want to use something that can write WARCs [03:01] @ivan, Ah - didn't know that. Is there any way to do multiple concurrent get requests or something [03:01] just launch more processes working on different tasks [03:01] also there's a command `parallel` for easily launching subprocesses [03:02] by default it works like xargs [03:03] if you make it work that way, perhaps read the ids with sys.argv[1:] [03:03] Given the API calls are just returning plain text data, I decided not to go with WARC so that it could be processed easier (Importing into a SQL DB or Google BigTables or something). Would there be any overhead/downsides in using WARC? [03:04] WARC makes sense if you think there might be any value in having the responses available in wayback [03:04] *** wp494 has quit IRC (Ping timeout: 633 seconds) [03:04] *** wp494 has joined #archiveteam-bs [03:04] it also captures request and response headers that might be valuable to someone looking for some evidence that the server did actually serve that at that time [03:05] other than that not much value I can think of [03:05] Sorry, I'm still learning the ins and outs of python - could you either link to some documentation or give an example of how you might go about using the sub processes and sys.argv? [03:07] Yeah, fair point. My main concern is having my API Key be part of the headers - and while there is little risk in this case, I would prefer not to have to have any keys out in the public [03:08] OK, no WARC then [03:11] 12 processes churning through a max of 100 ids per process: [03:11] for i in {1..1000}; do echo $i; done | parallel --will-cite --eta -d '\n' -j 12 -n 100 'python3 -c "import sys; print(sys.argv[1:])"' [03:11] you can of course generate lists of ids and run them on different machines entirely [03:11] for i in {1..1000}; do echo $i; done > ids [03:11] cat ids | parallel ... [03:13] parallel comes from the parallel package on debian (not from moreutils) [03:16] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:16] to be clear in the python program it would be something like: for id in sys.argv[1:]: to do something with each id [03:17] OK, I'm still not understanding how I would implement the parallel command with a scraping python script. [03:17] Ah, OK - that makes sense [03:23] So how would I get each URL then? Because fullurl = baseurl + str(sys.argv[1:]) + afterurl returns a list instead of a single number in the for loop [03:23] NVM [03:24] Im dumb [03:28] *** odemg has joined #archiveteam-bs [03:33] OK, what am I doing wrong here? Python script: https://gist.github.com/adinbied/13635e5afd7a76cec29e467fb145ba72 [03:34] and I'm calling it with for i in {1..100}; do echo $i; done | parallel --eta -d '\n' -j 12 -n 100 'python3 scrape.py' [03:34] Yet no requests are being made.... [03:36] @ivan, sorry for pestering you with questions, just trying to learn from all of this [03:48] * ivan looks [03:49] adinbied: what does the first line of parallel --version say? [03:50] GNU parallel 20141022 [03:51] can you print(sys.argv[1:]) in your program and see what that says? [03:51] I guess you saw no errors from python? [03:51] No, no errors with python [03:53] Here's the output: https://pastebin.com/s7LVMkn2 [03:54] assuming you put that print inside the body of the for loop I think the parallel and the sys.argv[1:] is working fine [03:56] Yup, its in the for loop. I tried print(fullurl) inside the for loop and it returned all of the URLS correctly. [03:56] try some print debugging to see what requests is doing [03:56] for example printing non-200 status codes [04:00] Ah, I'm an idiot. The set ID's don't start until 173....Derp. Thanks for your help! [04:01] I figured it might be that :-) [04:03] Are there any optimizations I should make with the -j and -n paramaters of parallel ? [04:04] IE is 12 processes the max I should use? It's averaging about 4.5 sec per process to complete [04:05] *** ReimuHaku has quit IRC (Ping timeout: 252 seconds) [04:06] *** ReimuHaku has joined #archiveteam-bs [04:07] adinbied: do some calculations and see how much parallelism you need to grab it within your time budget [04:08] then hope you don't get banned for sending that many requests per second [04:09] if you cause noticeable slowdown someone is going to look at a log [04:09] (sometimes) [04:12] I mean you try a -j as high as you want; you might want to raise that -n 100 too to avoid pointless process churn [04:16] TIL https://archive.softwareheritage.org/ [04:18] Huh, after some trial and error I found that it goes faster with a lower n value [04:20] Seems the sweet spot is -j 12 -n 50. Thanks for your help! [04:21] that's bizarre, lower -n would generally increase time because you have to wait for more pythons to start and load a bunch of bytecode [04:22] either a fluke or maybe the server is slowing down after many requests on a single connection [04:24] When a higher n value is used (at least for me) the Max Jobs to Run decreases. At -n 50 it's at 6 jobs, -n 100 is 4, -n 200 is 2. IDK, it works, and is loads faster than previously [04:27] how many ids are you giving parallel? [04:28] 500 in this case [04:28] well 500 ids / 100 ids per process is going to be a maximum of 5 processes instead of the maximum -j 12 [04:29] Ahhh, that makes more sense. [04:29] *** TC01 has quit IRC (Read error: Operation timed out) [04:33] *** atomicthu has quit IRC (Ping timeout: 480 seconds) [04:35] *** c4rc4s has quit IRC (Ping timeout: 600 seconds) [04:36] *** adinbied has quit IRC (Quit: Leaving) [04:45] *** TC01 has joined #archiveteam-bs [06:48] *** atomicthu has joined #archiveteam-bs [06:51] *** davisonio has quit IRC (Ping timeout: 260 seconds) [07:02] *** c4rc4s has joined #archiveteam-bs [07:16] *** davisonio has joined #archiveteam-bs [08:37] *** BlueMax has quit IRC (Leaving) [09:01] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [09:10] *** DFJustin has joined #archiveteam-bs [09:52] *** ta9le has joined #archiveteam-bs [11:21] *** cf has quit IRC (Read error: Operation timed out) [11:23] *** cf has joined #archiveteam-bs [12:52] *** Sk1d has quit IRC (Read error: Operation timed out) [12:53] *** w00dsman has joined #archiveteam-bs [12:58] *** Sk1d has joined #archiveteam-bs [12:58] *** w00dsman has quit IRC (w00dsman) [13:00] *** w00dsman has joined #archiveteam-bs [13:16] *** w00dsman1 has joined #archiveteam-bs [13:23] *** w00dsman has quit IRC (Read error: Operation timed out) [13:23] *** w00dsman1 is now known as w00dsman [13:39] *** w00dsman1 has joined #archiveteam-bs [13:45] *** w00dsman has quit IRC (Read error: Operation timed out) [13:45] *** w00dsman1 is now known as w00dsman [13:48] *** w00dsman has quit IRC (w00dsman) [15:40] *** mls has joined #archiveteam-bs [15:47] *** luxim has joined #archiveteam-bs [15:47] *** luxim has left [16:14] *** w00dsman has joined #archiveteam-bs [16:26] SketchCow: did you mail tapes this week to me? [16:58] *** w00dsman has quit IRC (w00dsman) [17:26] SketchCow: https://archive.org/details/disney-adventures-v3i11 [17:27] *** Sk2d has joined #archiveteam-bs [17:29] *** Sk1d has quit IRC (Read error: Operation timed out) [17:29] *** Sk2d is now known as Sk1d [18:49] *** jschwart has joined #archiveteam-bs [19:12] SketchCow: https://archive.org/details/disney-adventures-v9i5 [19:39] *** VADemon has joined #archiveteam-bs [20:39] *** VADemon_ has joined #archiveteam-bs [20:39] *** VADemon has quit IRC (Read error: Connection reset by peer) [20:59] *** wp494 has quit IRC (Ping timeout: 255 seconds) [21:00] *** wp494 has joined #archiveteam-bs [21:54] *** jschwart has quit IRC (Konversation terminated!) [22:10] *** Mateon1 has quit IRC (Read error: Operation timed out) [22:10] *** Mateon1 has joined #archiveteam-bs [22:44] *** phillipsj has quit IRC (Quit: Leaving)