Time |
Nickname |
Message |
00:10
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
00:10
🔗
|
|
Mateon1 has joined #archiveteam-bs |
00:11
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
00:12
🔗
|
|
Jens has joined #archiveteam-bs |
00:18
🔗
|
|
BlueMax has joined #archiveteam-bs |
01:42
🔗
|
godane |
SketchCow: i'm starting to upload simply k-pop episodes i got from youtube |
01:51
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
02:05
🔗
|
|
ta9le has quit IRC (Quit: Connection closed for inactivity) |
02:56
🔗
|
|
adinbied has joined #archiveteam-bs |
03:01
🔗
|
ivan |
you can read tasks from stdin with something like for line in sys.stdin: |
03:01
🔗
|
ivan |
you probably want to use something that can write WARCs |
03:01
🔗
|
adinbied |
@ivan, Ah - didn't know that. Is there any way to do multiple concurrent get requests or something |
03:01
🔗
|
ivan |
just launch more processes working on different tasks |
03:01
🔗
|
ivan |
also there's a command `parallel` for easily launching subprocesses |
03:02
🔗
|
ivan |
by default it works like xargs |
03:03
🔗
|
ivan |
if you make it work that way, perhaps read the ids with sys.argv[1:] |
03:03
🔗
|
adinbied |
Given the API calls are just returning plain text data, I decided not to go with WARC so that it could be processed easier (Importing into a SQL DB or Google BigTables or something). Would there be any overhead/downsides in using WARC? |
03:04
🔗
|
ivan |
WARC makes sense if you think there might be any value in having the responses available in wayback |
03:04
🔗
|
|
wp494 has quit IRC (Ping timeout: 633 seconds) |
03:04
🔗
|
|
wp494 has joined #archiveteam-bs |
03:04
🔗
|
ivan |
it also captures request and response headers that might be valuable to someone looking for some evidence that the server did actually serve that at that time |
03:05
🔗
|
ivan |
other than that not much value I can think of |
03:05
🔗
|
adinbied |
Sorry, I'm still learning the ins and outs of python - could you either link to some documentation or give an example of how you might go about using the sub processes and sys.argv? |
03:07
🔗
|
adinbied |
Yeah, fair point. My main concern is having my API Key be part of the headers - and while there is little risk in this case, I would prefer not to have to have any keys out in the public |
03:08
🔗
|
ivan |
OK, no WARC then |
03:11
🔗
|
ivan |
12 processes churning through a max of 100 ids per process: |
03:11
🔗
|
ivan |
for i in {1..1000}; do echo $i; done | parallel --will-cite --eta -d '\n' -j 12 -n 100 'python3 -c "import sys; print(sys.argv[1:])"' |
03:11
🔗
|
ivan |
you can of course generate lists of ids and run them on different machines entirely |
03:11
🔗
|
ivan |
for i in {1..1000}; do echo $i; done > ids |
03:11
🔗
|
ivan |
cat ids | parallel ... |
03:13
🔗
|
ivan |
parallel comes from the parallel package on debian (not from moreutils) |
03:16
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
03:16
🔗
|
ivan |
to be clear in the python program it would be something like: for id in sys.argv[1:]: to do something with each id |
03:17
🔗
|
adinbied |
OK, I'm still not understanding how I would implement the parallel command with a scraping python script. |
03:17
🔗
|
adinbied |
Ah, OK - that makes sense |
03:23
🔗
|
adinbied |
So how would I get each URL then? Because fullurl = baseurl + str(sys.argv[1:]) + afterurl returns a list instead of a single number in the for loop |
03:23
🔗
|
adinbied |
NVM |
03:24
🔗
|
adinbied |
Im dumb |
03:28
🔗
|
|
odemg has joined #archiveteam-bs |
03:33
🔗
|
adinbied |
OK, what am I doing wrong here? Python script: https://gist.github.com/adinbied/13635e5afd7a76cec29e467fb145ba72 |
03:34
🔗
|
adinbied |
and I'm calling it with for i in {1..100}; do echo $i; done | parallel --eta -d '\n' -j 12 -n 100 'python3 scrape.py' |
03:34
🔗
|
adinbied |
Yet no requests are being made.... |
03:36
🔗
|
adinbied |
@ivan, sorry for pestering you with questions, just trying to learn from all of this |
03:48
🔗
|
* |
ivan looks |
03:49
🔗
|
ivan |
adinbied: what does the first line of parallel --version say? |
03:50
🔗
|
adinbied |
GNU parallel 20141022 |
03:51
🔗
|
ivan |
can you print(sys.argv[1:]) in your program and see what that says? |
03:51
🔗
|
ivan |
I guess you saw no errors from python? |
03:51
🔗
|
adinbied |
No, no errors with python |
03:53
🔗
|
adinbied |
Here's the output: https://pastebin.com/s7LVMkn2 |
03:54
🔗
|
ivan |
assuming you put that print inside the body of the for loop I think the parallel and the sys.argv[1:] is working fine |
03:56
🔗
|
adinbied |
Yup, its in the for loop. I tried print(fullurl) inside the for loop and it returned all of the URLS correctly. |
03:56
🔗
|
ivan |
try some print debugging to see what requests is doing |
03:56
🔗
|
ivan |
for example printing non-200 status codes |
04:00
🔗
|
adinbied |
Ah, I'm an idiot. The set ID's don't start until 173....Derp. Thanks for your help! |
04:01
🔗
|
ivan |
I figured it might be that :-) |
04:03
🔗
|
adinbied |
Are there any optimizations I should make with the -j and -n paramaters of parallel ? |
04:04
🔗
|
adinbied |
IE is 12 processes the max I should use? It's averaging about 4.5 sec per process to complete |
04:05
🔗
|
|
ReimuHaku has quit IRC (Ping timeout: 252 seconds) |
04:06
🔗
|
|
ReimuHaku has joined #archiveteam-bs |
04:07
🔗
|
ivan |
adinbied: do some calculations and see how much parallelism you need to grab it within your time budget |
04:08
🔗
|
ivan |
then hope you don't get banned for sending that many requests per second |
04:09
🔗
|
ivan |
if you cause noticeable slowdown someone is going to look at a log |
04:09
🔗
|
ivan |
(sometimes) |
04:12
🔗
|
ivan |
I mean you try a -j as high as you want; you might want to raise that -n 100 too to avoid pointless process churn |
04:16
🔗
|
ivan |
TIL https://archive.softwareheritage.org/ |
04:18
🔗
|
adinbied |
Huh, after some trial and error I found that it goes faster with a lower n value |
04:20
🔗
|
adinbied |
Seems the sweet spot is -j 12 -n 50. Thanks for your help! |
04:21
🔗
|
ivan |
that's bizarre, lower -n would generally increase time because you have to wait for more pythons to start and load a bunch of bytecode |
04:22
🔗
|
ivan |
either a fluke or maybe the server is slowing down after many requests on a single connection |
04:24
🔗
|
adinbied |
When a higher n value is used (at least for me) the Max Jobs to Run decreases. At -n 50 it's at 6 jobs, -n 100 is 4, -n 200 is 2. IDK, it works, and is loads faster than previously |
04:27
🔗
|
ivan |
how many ids are you giving parallel? |
04:28
🔗
|
adinbied |
500 in this case |
04:28
🔗
|
ivan |
well 500 ids / 100 ids per process is going to be a maximum of 5 processes instead of the maximum -j 12 |
04:29
🔗
|
adinbied |
Ahhh, that makes more sense. |
04:29
🔗
|
|
TC01 has quit IRC (Read error: Operation timed out) |
04:33
🔗
|
|
atomicthu has quit IRC (Ping timeout: 480 seconds) |
04:35
🔗
|
|
c4rc4s has quit IRC (Ping timeout: 600 seconds) |
04:36
🔗
|
|
adinbied has quit IRC (Quit: Leaving) |
04:45
🔗
|
|
TC01 has joined #archiveteam-bs |
06:48
🔗
|
|
atomicthu has joined #archiveteam-bs |
06:51
🔗
|
|
davisonio has quit IRC (Ping timeout: 260 seconds) |
07:02
🔗
|
|
c4rc4s has joined #archiveteam-bs |
07:16
🔗
|
|
davisonio has joined #archiveteam-bs |
08:37
🔗
|
|
BlueMax has quit IRC (Leaving) |
09:01
🔗
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
09:10
🔗
|
|
DFJustin has joined #archiveteam-bs |
09:52
🔗
|
|
ta9le has joined #archiveteam-bs |
11:21
🔗
|
|
cf has quit IRC (Read error: Operation timed out) |
11:23
🔗
|
|
cf has joined #archiveteam-bs |
12:52
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:53
🔗
|
|
w00dsman has joined #archiveteam-bs |
12:58
🔗
|
|
Sk1d has joined #archiveteam-bs |
12:58
🔗
|
|
w00dsman has quit IRC (w00dsman) |
13:00
🔗
|
|
w00dsman has joined #archiveteam-bs |
13:16
🔗
|
|
w00dsman1 has joined #archiveteam-bs |
13:23
🔗
|
|
w00dsman has quit IRC (Read error: Operation timed out) |
13:23
🔗
|
|
w00dsman1 is now known as w00dsman |
13:39
🔗
|
|
w00dsman1 has joined #archiveteam-bs |
13:45
🔗
|
|
w00dsman has quit IRC (Read error: Operation timed out) |
13:45
🔗
|
|
w00dsman1 is now known as w00dsman |
13:48
🔗
|
|
w00dsman has quit IRC (w00dsman) |
15:40
🔗
|
|
mls has joined #archiveteam-bs |
15:47
🔗
|
|
luxim has joined #archiveteam-bs |
15:47
🔗
|
|
luxim has left |
16:14
🔗
|
|
w00dsman has joined #archiveteam-bs |
16:26
🔗
|
godane |
SketchCow: did you mail tapes this week to me? |
16:58
🔗
|
|
w00dsman has quit IRC (w00dsman) |
17:26
🔗
|
godane |
SketchCow: https://archive.org/details/disney-adventures-v3i11 |
17:27
🔗
|
|
Sk2d has joined #archiveteam-bs |
17:29
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
17:29
🔗
|
|
Sk2d is now known as Sk1d |
18:49
🔗
|
|
jschwart has joined #archiveteam-bs |
19:12
🔗
|
godane |
SketchCow: https://archive.org/details/disney-adventures-v9i5 |
19:39
🔗
|
|
VADemon has joined #archiveteam-bs |
20:39
🔗
|
|
VADemon_ has joined #archiveteam-bs |
20:39
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
20:59
🔗
|
|
wp494 has quit IRC (Ping timeout: 255 seconds) |
21:00
🔗
|
|
wp494 has joined #archiveteam-bs |
21:54
🔗
|
|
jschwart has quit IRC (Konversation terminated!) |
22:10
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
22:10
🔗
|
|
Mateon1 has joined #archiveteam-bs |
22:44
🔗
|
|
phillipsj has quit IRC (Quit: Leaving) |