#archiveteam-bs 2018-06-21,Thu

↑back Search ←Prev date (last date) Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***Mateon1 has quit IRC (Read error: Operation timed out)
Mateon1 has joined #archiveteam-bs
Jens has quit IRC (Remote host closed the connection)
Jens has joined #archiveteam-bs
[00:10]
BlueMax has joined #archiveteam-bs [00:18]
................. (idle for 1h24mn)
godaneSketchCow: i'm starting to upload simply k-pop episodes i got from youtube [01:42]
***Sk1d has quit IRC (Read error: Operation timed out)
Sk1d has joined #archiveteam-bs
[01:51]
ta9le has quit IRC (Quit: Connection closed for inactivity) [02:05]
........... (idle for 51mn)
adinbied has joined #archiveteam-bs [02:56]
ivanyou can read tasks from stdin with something like for line in sys.stdin:
you probably want to use something that can write WARCs
[03:01]
adinbied@ivan, Ah - didn't know that. Is there any way to do multiple concurrent get requests or something [03:01]
ivanjust launch more processes working on different tasks
also there's a command `parallel` for easily launching subprocesses
by default it works like xargs
if you make it work that way, perhaps read the ids with sys.argv[1:]
[03:01]
adinbiedGiven the API calls are just returning plain text data, I decided not to go with WARC so that it could be processed easier (Importing into a SQL DB or Google BigTables or something). Would there be any overhead/downsides in using WARC? [03:03]
ivanWARC makes sense if you think there might be any value in having the responses available in wayback [03:04]
***wp494 has quit IRC (Ping timeout: 633 seconds)
wp494 has joined #archiveteam-bs
[03:04]
ivanit also captures request and response headers that might be valuable to someone looking for some evidence that the server did actually serve that at that time
other than that not much value I can think of
[03:04]
adinbiedSorry, I'm still learning the ins and outs of python - could you either link to some documentation or give an example of how you might go about using the sub processes and sys.argv?
Yeah, fair point. My main concern is having my API Key be part of the headers - and while there is little risk in this case, I would prefer not to have to have any keys out in the public
[03:05]
ivanOK, no WARC then
12 processes churning through a max of 100 ids per process:
for i in {1..1000}; do echo $i; done | parallel --will-cite --eta -d '\n' -j 12 -n 100 'python3 -c "import sys; print(sys.argv[1:])"'
you can of course generate lists of ids and run them on different machines entirely
for i in {1..1000}; do echo $i; done > ids
cat ids | parallel ...
parallel comes from the parallel package on debian (not from moreutils)
[03:08]
***odemg has quit IRC (Ping timeout: 260 seconds) [03:16]
ivanto be clear in the python program it would be something like: for id in sys.argv[1:]: to do something with each id [03:16]
adinbiedOK, I'm still not understanding how I would implement the parallel command with a scraping python script.
Ah, OK - that makes sense
[03:17]
So how would I get each URL then? Because fullurl = baseurl + str(sys.argv[1:]) + afterurl returns a list instead of a single number in the for loop
NVM
Im dumb
[03:23]
***odemg has joined #archiveteam-bs [03:28]
adinbiedOK, what am I doing wrong here? Python script: https://gist.github.com/adinbied/13635e5afd7a76cec29e467fb145ba72
and I'm calling it with for i in {1..100}; do echo $i; done | parallel --eta -d '\n' -j 12 -n 100 'python3 scrape.py'
Yet no requests are being made....
@ivan, sorry for pestering you with questions, just trying to learn from all of this
[03:33]
ivanivan looks
adinbied: what does the first line of parallel --version say?
[03:48]
adinbiedGNU parallel 20141022 [03:50]
ivancan you print(sys.argv[1:]) in your program and see what that says?
I guess you saw no errors from python?
[03:51]
adinbiedNo, no errors with python
Here's the output: https://pastebin.com/s7LVMkn2
[03:51]
ivanassuming you put that print inside the body of the for loop I think the parallel and the sys.argv[1:] is working fine [03:54]
adinbiedYup, its in the for loop. I tried print(fullurl) inside the for loop and it returned all of the URLS correctly. [03:56]
ivantry some print debugging to see what requests is doing
for example printing non-200 status codes
[03:56]
adinbiedAh, I'm an idiot. The set ID's don't start until 173....Derp. Thanks for your help! [04:00]
ivanI figured it might be that :-) [04:01]
adinbiedAre there any optimizations I should make with the -j and -n paramaters of parallel ?
IE is 12 processes the max I should use? It's averaging about 4.5 sec per process to complete
[04:03]
***ReimuHaku has quit IRC (Ping timeout: 252 seconds)
ReimuHaku has joined #archiveteam-bs
[04:05]
ivanadinbied: do some calculations and see how much parallelism you need to grab it within your time budget
then hope you don't get banned for sending that many requests per second
if you cause noticeable slowdown someone is going to look at a log
(sometimes)
I mean you try a -j as high as you want; you might want to raise that -n 100 too to avoid pointless process churn
TIL https://archive.softwareheritage.org/
[04:07]
adinbiedHuh, after some trial and error I found that it goes faster with a lower n value
Seems the sweet spot is -j 12 -n 50. Thanks for your help!
[04:18]
ivanthat's bizarre, lower -n would generally increase time because you have to wait for more pythons to start and load a bunch of bytecode
either a fluke or maybe the server is slowing down after many requests on a single connection
[04:21]
adinbiedWhen a higher n value is used (at least for me) the Max Jobs to Run decreases. At -n 50 it's at 6 jobs, -n 100 is 4, -n 200 is 2. IDK, it works, and is loads faster than previously [04:24]
ivanhow many ids are you giving parallel? [04:27]
adinbied500 in this case [04:28]
ivanwell 500 ids / 100 ids per process is going to be a maximum of 5 processes instead of the maximum -j 12 [04:28]
adinbiedAhhh, that makes more sense. [04:29]
***TC01 has quit IRC (Read error: Operation timed out)
atomicthu has quit IRC (Ping timeout: 480 seconds)
c4rc4s has quit IRC (Ping timeout: 600 seconds)
adinbied has quit IRC (Quit: Leaving)
[04:29]
TC01 has joined #archiveteam-bs [04:45]
......................... (idle for 2h3mn)
atomicthu has joined #archiveteam-bs
davisonio has quit IRC (Ping timeout: 260 seconds)
[06:48]
c4rc4s has joined #archiveteam-bs [07:02]
davisonio has joined #archiveteam-bs [07:16]
................. (idle for 1h21mn)
BlueMax has quit IRC (Leaving) [08:37]
..... (idle for 24mn)
DFJustin has quit IRC (Ping timeout: 260 seconds) [09:01]
DFJustin has joined #archiveteam-bs [09:10]
......... (idle for 42mn)
ta9le has joined #archiveteam-bs [09:52]

↑back Search ←Prev date (last date) Show only urls(Click on time to select a line by its url)