#archiveteam-bs 2018-06-21,Thu

↑back Search

Time	Nickname	Message
00:10 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
00:10 ^🔗		Mateon1 has joined #archiveteam-bs
00:11 ^🔗		Jens has quit IRC (Remote host closed the connection)
00:12 ^🔗		Jens has joined #archiveteam-bs
00:18 ^🔗		BlueMax has joined #archiveteam-bs
01:42 ^🔗	godane	SketchCow: i'm starting to upload simply k-pop episodes i got from youtube
01:51 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
01:54 ^🔗		Sk1d has joined #archiveteam-bs
02:05 ^🔗		ta9le has quit IRC (Quit: Connection closed for inactivity)
02:56 ^🔗		adinbied has joined #archiveteam-bs
03:01 ^🔗	ivan	you can read tasks from stdin with something like for line in sys.stdin:
03:01 ^🔗	ivan	you probably want to use something that can write WARCs
03:01 ^🔗	adinbied	@ivan, Ah - didn't know that. Is there any way to do multiple concurrent get requests or something
03:01 ^🔗	ivan	just launch more processes working on different tasks
03:01 ^🔗	ivan	also there's a command `parallel` for easily launching subprocesses
03:02 ^🔗	ivan	by default it works like xargs
03:03 ^🔗	ivan	if you make it work that way, perhaps read the ids with sys.argv[1:]
03:03 ^🔗	adinbied	Given the API calls are just returning plain text data, I decided not to go with WARC so that it could be processed easier (Importing into a SQL DB or Google BigTables or something). Would there be any overhead/downsides in using WARC?
03:04 ^🔗	ivan	WARC makes sense if you think there might be any value in having the responses available in wayback
03:04 ^🔗		wp494 has quit IRC (Ping timeout: 633 seconds)
03:04 ^🔗		wp494 has joined #archiveteam-bs
03:04 ^🔗	ivan	it also captures request and response headers that might be valuable to someone looking for some evidence that the server did actually serve that at that time
03:05 ^🔗	ivan	other than that not much value I can think of
03:05 ^🔗	adinbied	Sorry, I'm still learning the ins and outs of python - could you either link to some documentation or give an example of how you might go about using the sub processes and sys.argv?
03:07 ^🔗	adinbied	Yeah, fair point. My main concern is having my API Key be part of the headers - and while there is little risk in this case, I would prefer not to have to have any keys out in the public
03:08 ^🔗	ivan	OK, no WARC then
03:11 ^🔗	ivan	12 processes churning through a max of 100 ids per process:
03:11 ^🔗	ivan	for i in {1..1000}; do echo $i; done \| parallel --will-cite --eta -d '\n' -j 12 -n 100 'python3 -c "import sys; print(sys.argv[1:])"'
03:11 ^🔗	ivan	you can of course generate lists of ids and run them on different machines entirely
03:11 ^🔗	ivan	for i in {1..1000}; do echo $i; done > ids
03:11 ^🔗	ivan	cat ids \| parallel ...
03:13 ^🔗	ivan	parallel comes from the parallel package on debian (not from moreutils)
03:16 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
03:16 ^🔗	ivan	to be clear in the python program it would be something like: for id in sys.argv[1:]: to do something with each id
03:17 ^🔗	adinbied	OK, I'm still not understanding how I would implement the parallel command with a scraping python script.
03:17 ^🔗	adinbied	Ah, OK - that makes sense
03:23 ^🔗	adinbied	So how would I get each URL then? Because fullurl = baseurl + str(sys.argv[1:]) + afterurl returns a list instead of a single number in the for loop
03:23 ^🔗	adinbied	NVM
03:24 ^🔗	adinbied	Im dumb
03:28 ^🔗		odemg has joined #archiveteam-bs
03:33 ^🔗	adinbied	OK, what am I doing wrong here? Python script: https://gist.github.com/adinbied/13635e5afd7a76cec29e467fb145ba72
03:34 ^🔗	adinbied	and I'm calling it with for i in {1..100}; do echo $i; done \| parallel --eta -d '\n' -j 12 -n 100 'python3 scrape.py'
03:34 ^🔗	adinbied	Yet no requests are being made....
03:36 ^🔗	adinbied	@ivan, sorry for pestering you with questions, just trying to learn from all of this
03:48 ^🔗	*	ivan looks
03:49 ^🔗	ivan	adinbied: what does the first line of parallel --version say?
03:50 ^🔗	adinbied	GNU parallel 20141022
03:51 ^🔗	ivan	can you print(sys.argv[1:]) in your program and see what that says?
03:51 ^🔗	ivan	I guess you saw no errors from python?
03:51 ^🔗	adinbied	No, no errors with python
03:53 ^🔗	adinbied	Here's the output: https://pastebin.com/s7LVMkn2
03:54 ^🔗	ivan	assuming you put that print inside the body of the for loop I think the parallel and the sys.argv[1:] is working fine
03:56 ^🔗	adinbied	Yup, its in the for loop. I tried print(fullurl) inside the for loop and it returned all of the URLS correctly.
03:56 ^🔗	ivan	try some print debugging to see what requests is doing
03:56 ^🔗	ivan	for example printing non-200 status codes
04:00 ^🔗	adinbied	Ah, I'm an idiot. The set ID's don't start until 173....Derp. Thanks for your help!
04:01 ^🔗	ivan	I figured it might be that :-)
04:03 ^🔗	adinbied	Are there any optimizations I should make with the -j and -n paramaters of parallel ?
04:04 ^🔗	adinbied	IE is 12 processes the max I should use? It's averaging about 4.5 sec per process to complete
04:05 ^🔗		ReimuHaku has quit IRC (Ping timeout: 252 seconds)
04:06 ^🔗		ReimuHaku has joined #archiveteam-bs
04:07 ^🔗	ivan	adinbied: do some calculations and see how much parallelism you need to grab it within your time budget
04:08 ^🔗	ivan	then hope you don't get banned for sending that many requests per second
04:09 ^🔗	ivan	if you cause noticeable slowdown someone is going to look at a log
04:09 ^🔗	ivan	(sometimes)
04:12 ^🔗	ivan	I mean you try a -j as high as you want; you might want to raise that -n 100 too to avoid pointless process churn
04:16 ^🔗	ivan	TIL https://archive.softwareheritage.org/
04:18 ^🔗	adinbied	Huh, after some trial and error I found that it goes faster with a lower n value
04:20 ^🔗	adinbied	Seems the sweet spot is -j 12 -n 50. Thanks for your help!
04:21 ^🔗	ivan	that's bizarre, lower -n would generally increase time because you have to wait for more pythons to start and load a bunch of bytecode
04:22 ^🔗	ivan	either a fluke or maybe the server is slowing down after many requests on a single connection
04:24 ^🔗	adinbied	When a higher n value is used (at least for me) the Max Jobs to Run decreases. At -n 50 it's at 6 jobs, -n 100 is 4, -n 200 is 2. IDK, it works, and is loads faster than previously
04:27 ^🔗	ivan	how many ids are you giving parallel?
04:28 ^🔗	adinbied	500 in this case
04:28 ^🔗	ivan	well 500 ids / 100 ids per process is going to be a maximum of 5 processes instead of the maximum -j 12
04:29 ^🔗	adinbied	Ahhh, that makes more sense.
04:29 ^🔗		TC01 has quit IRC (Read error: Operation timed out)
04:33 ^🔗		atomicthu has quit IRC (Ping timeout: 480 seconds)
04:35 ^🔗		c4rc4s has quit IRC (Ping timeout: 600 seconds)
04:36 ^🔗		adinbied has quit IRC (Quit: Leaving)
04:45 ^🔗		TC01 has joined #archiveteam-bs
06:48 ^🔗		atomicthu has joined #archiveteam-bs
06:51 ^🔗		davisonio has quit IRC (Ping timeout: 260 seconds)
07:02 ^🔗		c4rc4s has joined #archiveteam-bs
07:16 ^🔗		davisonio has joined #archiveteam-bs
08:37 ^🔗		BlueMax has quit IRC (Leaving)
09:01 ^🔗		DFJustin has quit IRC (Ping timeout: 260 seconds)
09:10 ^🔗		DFJustin has joined #archiveteam-bs
09:52 ^🔗		ta9le has joined #archiveteam-bs
11:21 ^🔗		cf has quit IRC (Read error: Operation timed out)
11:23 ^🔗		cf has joined #archiveteam-bs
12:52 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
12:53 ^🔗		w00dsman has joined #archiveteam-bs
12:58 ^🔗		Sk1d has joined #archiveteam-bs
12:58 ^🔗		w00dsman has quit IRC (w00dsman)
13:00 ^🔗		w00dsman has joined #archiveteam-bs
13:16 ^🔗		w00dsman1 has joined #archiveteam-bs
13:23 ^🔗		w00dsman has quit IRC (Read error: Operation timed out)
13:23 ^🔗		w00dsman1 is now known as w00dsman
13:39 ^🔗		w00dsman1 has joined #archiveteam-bs
13:45 ^🔗		w00dsman has quit IRC (Read error: Operation timed out)
13:45 ^🔗		w00dsman1 is now known as w00dsman
13:48 ^🔗		w00dsman has quit IRC (w00dsman)
15:40 ^🔗		mls has joined #archiveteam-bs
15:47 ^🔗		luxim has joined #archiveteam-bs
15:47 ^🔗		luxim has left
16:14 ^🔗		w00dsman has joined #archiveteam-bs
16:26 ^🔗	godane	SketchCow: did you mail tapes this week to me?
16:58 ^🔗		w00dsman has quit IRC (w00dsman)
17:26 ^🔗	godane	SketchCow: https://archive.org/details/disney-adventures-v3i11
17:27 ^🔗		Sk2d has joined #archiveteam-bs
17:29 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
17:29 ^🔗		Sk2d is now known as Sk1d
18:49 ^🔗		jschwart has joined #archiveteam-bs
19:12 ^🔗	godane	SketchCow: https://archive.org/details/disney-adventures-v9i5
19:39 ^🔗		VADemon has joined #archiveteam-bs
20:39 ^🔗		VADemon_ has joined #archiveteam-bs
20:39 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
20:59 ^🔗		wp494 has quit IRC (Ping timeout: 255 seconds)
21:00 ^🔗		wp494 has joined #archiveteam-bs
21:54 ^🔗		jschwart has quit IRC (Konversation terminated!)
22:10 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
22:10 ^🔗		Mateon1 has joined #archiveteam-bs
22:44 ^🔗		phillipsj has quit IRC (Quit: Leaving)

irclogger-viewer