[00:10] *** Mateon1 has quit IRC (Read error: Operation timed out)
[00:10] *** Mateon1 has joined #archiveteam-bs
[00:11] *** Jens has quit IRC (Remote host closed the connection)
[00:12] *** Jens has joined #archiveteam-bs
[00:18] *** BlueMax has joined #archiveteam-bs
[01:42] <godane> SketchCow: i'm starting to upload simply k-pop episodes i got from youtube
[01:51] *** Sk1d has quit IRC (Read error: Operation timed out)
[01:54] *** Sk1d has joined #archiveteam-bs
[02:05] *** ta9le has quit IRC (Quit: Connection closed for inactivity)
[02:56] *** adinbied has joined #archiveteam-bs
[03:01] <ivan> you can read tasks from stdin with something like for line in sys.stdin:
[03:01] <ivan> you probably want to use something that can write WARCs
[03:01] <adinbied> @ivan, Ah - didn't know that. Is there any way to do multiple concurrent get requests or something
[03:01] <ivan> just launch more processes working on different tasks
[03:01] <ivan> also there's a command `parallel` for easily launching subprocesses
[03:02] <ivan> by default it works like xargs
[03:03] <ivan> if you make it work that way, perhaps read the ids with sys.argv[1:]
[03:03] <adinbied> Given the API calls are just returning plain text data, I decided not to go with WARC so that it could be processed easier (Importing into a SQL DB or Google BigTables or something). Would there be any overhead/downsides in using WARC?
[03:04] <ivan> WARC makes sense if you think there might be any value in having the responses available in wayback
[03:04] *** wp494 has quit IRC (Ping timeout: 633 seconds)
[03:04] *** wp494 has joined #archiveteam-bs
[03:04] <ivan> it also captures request and response headers that might be valuable to someone looking for some evidence that the server did actually serve that at that time
[03:05] <ivan> other than that not much value I can think of
[03:05] <adinbied> Sorry, I'm still learning the ins and outs of python - could you either link to some documentation or give an example of how you might go about using the sub processes and sys.argv?
[03:07] <adinbied> Yeah, fair point. My main concern is having my API Key be part of the headers - and while there is little risk in this case, I would prefer not to have to have any keys out in the public
[03:08] <ivan> OK, no WARC then
[03:11] <ivan> 12 processes churning through a max of 100 ids per process:
[03:11] <ivan> for i in {1..1000}; do echo $i; done | parallel --will-cite --eta -d '\n' -j 12 -n 100 'python3 -c "import sys; print(sys.argv[1:])"'
[03:11] <ivan> you can of course generate lists of ids and run them on different machines entirely
[03:11] <ivan> for i in {1..1000}; do echo $i; done > ids
[03:11] <ivan> cat ids | parallel ...
[03:13] <ivan> parallel comes from the parallel package on debian (not from moreutils)
[03:16] *** odemg has quit IRC (Ping timeout: 260 seconds)
[03:16] <ivan> to be clear in the python program it would be something like: for id in sys.argv[1:]: to do something with each id
[03:17] <adinbied> OK, I'm still not understanding how I would implement the parallel command with a scraping python script.
[03:17] <adinbied> Ah, OK - that makes sense
[03:23] <adinbied> So how would I get each URL then? Because fullurl = baseurl + str(sys.argv[1:]) + afterurl returns a list instead of a single number in the for loop
[03:23] <adinbied> NVM
[03:24] <adinbied> Im dumb
[03:28] *** odemg has joined #archiveteam-bs
[03:33] <adinbied> OK, what am I doing wrong here? Python script: https://gist.github.com/adinbied/13635e5afd7a76cec29e467fb145ba72
[03:34] <adinbied> and I'm calling it with for i in {1..100}; do echo $i; done | parallel --eta -d '\n' -j 12 -n 100 'python3 scrape.py'
[03:34] <adinbied> Yet no requests are being made....
[03:36] <adinbied> @ivan, sorry for pestering you with questions, just trying to learn from all of this
[03:48] * ivan looks
[03:49] <ivan> adinbied: what does the first line of parallel --version say?
[03:50] <adinbied> GNU parallel 20141022
[03:51] <ivan> can you print(sys.argv[1:]) in your program and see what that says?
[03:51] <ivan> I guess you saw no errors from python?
[03:51] <adinbied> No, no errors with python
[03:53] <adinbied> Here's the output: https://pastebin.com/s7LVMkn2
[03:54] <ivan> assuming you put that print inside the body of the for loop I think the parallel and the sys.argv[1:] is working fine
[03:56] <adinbied> Yup, its in the for loop. I tried print(fullurl) inside the for loop and it returned all of the URLS correctly. 
[03:56] <ivan> try some print debugging to see what requests is doing
[03:56] <ivan> for example printing non-200 status codes
[04:00] <adinbied> Ah, I'm an idiot. The set ID's don't start until 173....Derp. Thanks for your help!
[04:01] <ivan> I figured it might be that :-)
[04:03] <adinbied> Are there any optimizations I should make with the -j and -n paramaters of parallel ?
[04:04] <adinbied> IE is 12 processes the max I should use? It's averaging about 4.5 sec per process to complete
[04:05] *** ReimuHaku has quit IRC (Ping timeout: 252 seconds)
[04:06] *** ReimuHaku has joined #archiveteam-bs
[04:07] <ivan> adinbied: do some calculations and see how much parallelism you need to grab it within your time budget
[04:08] <ivan> then hope you don't get banned for sending that many requests per second
[04:09] <ivan> if you cause noticeable slowdown someone is going to look at a log
[04:09] <ivan> (sometimes)
[04:12] <ivan> I mean you try a -j as high as you want; you might want to raise that -n 100 too to avoid pointless process churn
[04:16] <ivan> TIL https://archive.softwareheritage.org/
[04:18] <adinbied> Huh, after some trial and error I found that it goes faster with a lower n value
[04:20] <adinbied> Seems the sweet spot is -j 12 -n 50. Thanks for your help!
[04:21] <ivan> that's bizarre, lower -n would generally increase time because you have to wait for more pythons to start and load a bunch of bytecode
[04:22] <ivan> either a fluke or maybe the server is slowing down after many requests on a single connection
[04:24] <adinbied> When a higher n value is used (at least for me) the Max Jobs to Run decreases. At -n 50 it's at 6 jobs, -n 100 is 4, -n 200 is 2. IDK, it works, and is loads faster than previously
[04:27] <ivan> how many ids are you giving parallel?
[04:28] <adinbied> 500 in this case
[04:28] <ivan> well 500 ids / 100 ids per process is going to be a maximum of 5 processes instead of the maximum -j 12
[04:29] <adinbied> Ahhh, that makes more sense. 
[04:29] *** TC01 has quit IRC (Read error: Operation timed out)
[04:33] *** atomicthu has quit IRC (Ping timeout: 480 seconds)
[04:35] *** c4rc4s has quit IRC (Ping timeout: 600 seconds)
[04:36] *** adinbied has quit IRC (Quit: Leaving)
[04:45] *** TC01 has joined #archiveteam-bs
[06:48] *** atomicthu has joined #archiveteam-bs
[06:51] *** davisonio has quit IRC (Ping timeout: 260 seconds)
[07:02] *** c4rc4s has joined #archiveteam-bs
[07:16] *** davisonio has joined #archiveteam-bs
[08:37] *** BlueMax has quit IRC (Leaving)
[09:01] *** DFJustin has quit IRC (Ping timeout: 260 seconds)
[09:10] *** DFJustin has joined #archiveteam-bs
[09:52] *** ta9le has joined #archiveteam-bs
[11:21] *** cf has quit IRC (Read error: Operation timed out)
[11:23] *** cf has joined #archiveteam-bs
[12:52] *** Sk1d has quit IRC (Read error: Operation timed out)
[12:53] *** w00dsman has joined #archiveteam-bs
[12:58] *** Sk1d has joined #archiveteam-bs
[12:58] *** w00dsman has quit IRC (w00dsman)
[13:00] *** w00dsman has joined #archiveteam-bs
[13:16] *** w00dsman1 has joined #archiveteam-bs
[13:23] *** w00dsman has quit IRC (Read error: Operation timed out)
[13:23] *** w00dsman1 is now known as w00dsman
[13:39] *** w00dsman1 has joined #archiveteam-bs
[13:45] *** w00dsman has quit IRC (Read error: Operation timed out)
[13:45] *** w00dsman1 is now known as w00dsman
[13:48] *** w00dsman has quit IRC (w00dsman)
[15:40] *** mls has joined #archiveteam-bs
[15:47] *** luxim has joined #archiveteam-bs
[15:47] *** luxim has left 
[16:14] *** w00dsman has joined #archiveteam-bs
[16:26] <godane> SketchCow: did you mail tapes this week to me?
[16:58] *** w00dsman has quit IRC (w00dsman)
[17:26] <godane> SketchCow: https://archive.org/details/disney-adventures-v3i11
[17:27] *** Sk2d has joined #archiveteam-bs
[17:29] *** Sk1d has quit IRC (Read error: Operation timed out)
[17:29] *** Sk2d is now known as Sk1d
[18:49] *** jschwart has joined #archiveteam-bs
[19:12] <godane> SketchCow: https://archive.org/details/disney-adventures-v9i5
[19:39] *** VADemon has joined #archiveteam-bs
[20:39] *** VADemon_ has joined #archiveteam-bs
[20:39] *** VADemon has quit IRC (Read error: Connection reset by peer)
[20:59] *** wp494 has quit IRC (Ping timeout: 255 seconds)
[21:00] *** wp494 has joined #archiveteam-bs
[21:54] *** jschwart has quit IRC (Konversation terminated!)
[22:10] *** Mateon1 has quit IRC (Read error: Operation timed out)
[22:10] *** Mateon1 has joined #archiveteam-bs
[22:44] *** phillipsj has quit IRC (Quit: Leaving)