#archiveteam-bs 2019-10-26,Sat

↑back Search

Time	Nickname	Message
00:08 ^🔗	xdax	ok so i still can't find television advertising awards shortlists anywhere
00:09 ^🔗	xdax	despite having a hard cover book in front of me that i can cite winners from there's just no page containing all the winners
00:16 ^🔗		godane has joined #archiveteam-bs
00:29 ^🔗		DFJustin has quit IRC (Ping timeout: 745 seconds)
00:49 ^🔗		SynMonger has quit IRC (Quit: Wait, what?)
00:52 ^🔗		SynMonger has joined #archiveteam-bs
01:07 ^🔗	markedL	what's the award called?
01:30 ^🔗	phillipsj	I win! I got 27ms!
01:34 ^🔗		icedice has joined #archiveteam-bs
01:54 ^🔗		killsushi has quit IRC (Quit: Leaving)
02:27 ^🔗		pew has quit IRC (Ping timeout: 252 seconds)
02:33 ^🔗		DFJustin has joined #archiveteam-bs
02:40 ^🔗		pew has joined #archiveteam-bs
03:46 ^🔗		odemgi has joined #archiveteam-bs
03:51 ^🔗		odemgi_ has quit IRC (Read error: Operation timed out)
03:52 ^🔗		qw3rty has joined #archiveteam-bs
03:56 ^🔗		odemg has quit IRC (Ping timeout: 745 seconds)
03:59 ^🔗		qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:00 ^🔗		odemg has joined #archiveteam-bs
04:47 ^🔗		odemgi_ has joined #archiveteam-bs
04:52 ^🔗		odemgi has quit IRC (Read error: Operation timed out)
04:55 ^🔗		qw3rty2 has joined #archiveteam-bs
04:58 ^🔗		odemg has quit IRC (Ping timeout: 745 seconds)
05:02 ^🔗		odemg has joined #archiveteam-bs
05:02 ^🔗		qw3rty has quit IRC (Ping timeout: 745 seconds)
05:26 ^🔗	xdax	markedL: there's a couple
05:27 ^🔗	xdax	https://en.wikipedia.org/wiki/Category:Advertising_awards
05:27 ^🔗	xdax	clios purged everything including their shortlists going back to 2009 and gave 130,000 ads in film reels to a university
05:28 ^🔗	xdax	https://news.iu.edu/stories/2017/12/iub/releases/14-clio-collection.html
05:29 ^🔗	xdax	there's not even records of what they are so we can't get other copies in case something happens
05:29 ^🔗	xdax	there -are- archive sites but they're meant for agencies and charge accordingly and there's no telling the amount or quality of content on them
05:30 ^🔗	xdax	contentid has been picking copies off youtube of anything with licensed music
05:49 ^🔗	xdax	cannes shortlists and content in low quality might be possible with weird non-english searches
07:33 ^🔗		odemgi has joined #archiveteam-bs
07:37 ^🔗		odemgi_ has quit IRC (Read error: Operation timed out)
07:45 ^🔗		d5f4a3622 has quit IRC (Ping timeout: 612 seconds)
07:50 ^🔗	jodizzle	Could we maybe do a mips run for Royal Society PDFs? There are a couple different Royal Society related jobs going right now, including a targeted one for PDFs, but I doubt they're going to finish by the end of the free to access period.
08:29 ^🔗		schbirid has joined #archiveteam-bs
08:31 ^🔗	markedL	tell us about the website layout and number of things you think we should get
08:32 ^🔗		d5f4a3622 has joined #archiveteam-bs
08:41 ^🔗		bluefoo has joined #archiveteam-bs
09:12 ^🔗	jodizzle	markedL: I think a lot of the site has already been grabbed through a couple different jobs. The main question is the article PDFs.
09:13 ^🔗	jodizzle	There's a job for those PDFs running right now, but it keeps getting hit with 403s if you crawl too quickly. So I was suggesting that mips might be a way around that.
09:21 ^🔗		BlueMax has quit IRC (Quit: Leaving)
09:22 ^🔗	markedL	who has the list of URLs for all the PDFs ?
09:24 ^🔗		Hani111 has joined #archiveteam-bs
09:25 ^🔗	jodizzle	Here's the list: https://transfer.notkiska.pw/fmU4m/royalsocietypublishing_org-articles-pdf-sorted.txt
09:26 ^🔗	jodizzle	No guarantee that it is really all of them, of course.
09:27 ^🔗	markedL	every time I click a link in this channel, there's a second after where I think, wait is this going to fill my drive
09:27 ^🔗	jodizzle	Ha, no, it's only a few MiBs
09:33 ^🔗	markedL	yeah, it's a better fit for mips than warrior, unless Fusl passes
09:34 ^🔗		Hani has quit IRC (Ping timeout: 745 seconds)
09:34 ^🔗		Hani111 is now known as Hani
09:34 ^🔗		mls_ has quit IRC (Remote host closed the connection)
09:39 ^🔗		mls_ has joined #archiveteam-bs
09:39 ^🔗		VADemon_ has joined #archiveteam-bs
09:43 ^🔗		VADemon has quit IRC (Ping timeout: 258 seconds)
09:45 ^🔗		VADemon_ has quit IRC (Quit: left4dead)
10:37 ^🔗		manjaro-u has quit IRC (Quit: Konversation terminated!)
10:37 ^🔗		Jamesatja has joined #archiveteam-bs
10:37 ^🔗	markedL	or MIA. I'll have something in an hour
11:09 ^🔗		d5f4a3622 has quit IRC (Ping timeout: 246 seconds)
11:38 ^🔗		Jamesatja has quit IRC (Read error: Connection reset by peer)
12:27 ^🔗	markedL	code's ready, setting up new drives
13:22 ^🔗	Fusl_	if its just a few mbs, throw it at JAA and he'll queue on mips
13:42 ^🔗	markedL	JAA : can you queue this on mips today (saturday) : https://transfer.notkiska.pw/fmU4m/royalsocietypublishing_org-articles-pdf-sorted.txt
14:20 ^🔗	Fusl	its queued
14:20 ^🔗	Fusl	http://103.230.141.2:29000/
14:37 ^🔗	markedL	cool, how much storage does mips have?
14:47 ^🔗	Fusl	Filesystem Size Used Avail Use% Mounted on
14:47 ^🔗	Fusl	/dev/sda2 2.0T 1.1T 901G 56% /
15:12 ^🔗	JAA	jodizzle: I haven't written it down anywhere yet, no. But basically, each list block (i.e. consecutive lines starting with '* ') gets transformed into something like the tables on the ArchiveBot/* pages. The syntax for the individual list entries is the same as for that old bot, e.g. '* URL \| note = Something to add'. Other than that, you're completely free how you want to structure the page.
15:15 ^🔗	JAA	jodizzle: Thanks for that, I wanted to look into Royal Society more but didn't have enough time. I can also throw it into qwarc if needed, assuming they don't have rate limits per IP.
15:18 ^🔗	markedL	the list is small enough it'll finish tonight, and mips has a few but rare 403's
15:22 ^🔗	markedL	looking at the wrong field, finish tomorrow mid day
15:22 ^🔗	JAA	Do we want the HTML pages as well?
15:23 ^🔗	markedL	jodizzle ^
15:23 ^🔗	markedL	qwarc could be a fit to yourshot
15:23 ^🔗	JAA	No, it'll crash the server in a matter of seconds.
15:28 ^🔗	markedL	well, it's some non-obvious load profile, I plan on fixing it in any case
15:36 ^🔗	markedL	the highest transaction grab I have going on right now is actually that bitly alias that people would said not to try
15:44 ^🔗	markedL	10,000 redirects/min using 25 connections
15:51 ^🔗	JAA	Huh, it doesn't have the normal bit.ly rate limits?
15:57 ^🔗	JAA	You are talking about on.natgeo.com, right?
15:58 ^🔗		bluefoo has quit IRC (Ping timeout: 252 seconds)
16:00 ^🔗	markedL	yeah, but I'm running 10million known ID's. So either there's no limit or the limit is only triggered by 404's
16:01 ^🔗	JAA	Nope, it's just the request rate.
16:01 ^🔗	JAA	On bit.ly and most aliases, that is.
16:16 ^🔗		wyatt8740 has quit IRC (Read error: Operation timed out)
16:28 ^🔗	markedL	has the rate limit for bitly been quantified?
16:29 ^🔗	JAA	On the order of one request per second.
16:29 ^🔗	JAA	If not less.
16:30 ^🔗	markedL	Ok, hmm, I'll throw some 404's in, but after the 301's are done
16:30 ^🔗	JAA	Can you upload a sample of those 10M codes?
16:30 ^🔗	markedL	I don't image this is complete, do you already know what's missing: https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/61
16:30 ^🔗	markedL	sample is easy, sure.
16:30 ^🔗	JAA	Just 1k random codes or whatever.
16:33 ^🔗	markedL	https://transfer.notkiska.pw/eHhRZ/bitly-natgeo-sample.txt
16:35 ^🔗	JAA	Thanks!
16:42 ^🔗	JAA	Yeah, interesting, I don't seem to get rate limited on those. Also not when using bit.ly instead of on.natgeo.com.
16:42 ^🔗	JAA	I did use a Firefox UA though.
16:42 ^🔗		Stiletto has quit IRC (Ping timeout: 246 seconds)
16:45 ^🔗		Stilett0 has joined #archiveteam-bs
16:48 ^🔗	JAA	Looks like on.natgeo.com is special anyway. It doesn't resolve normal bit.ly shortcodes, and all 404s just redirect to the Nat Geo homepage.
16:49 ^🔗	JAA	But bans from bit.ly do carry over to on.natgeo.com.
16:50 ^🔗		Stilett0 is now known as Stiletto
16:52 ^🔗	JAA	I did manage to get banned on on.natgeo.com as well though after throwing random codes at it.
17:47 ^🔗		tech234a has joined #archiveteam-bs
17:58 ^🔗	markedL	JAA, is this sufficient, I recall there's a record at the end but I don't know what's the minimum here https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/61/commits/ea8abb8afbc61b7a8ff5140f58425186b46579fc
18:03 ^🔗	markedL	if the answer is the warc spec really needs to be read, I can relay that instead
18:11 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
18:13 ^🔗		RichardG has joined #archiveteam-bs
18:18 ^🔗		d5f4a3622 has joined #archiveteam-bs
18:19 ^🔗	JAA	markedL: Some tools write the retrieval log at the end of the WARC. The format of that is obviously arbitrary. In general, the more info the better obviously, but I don't see anything terribly wrong with that code.
18:19 ^🔗	JAA	I would suggest moving the version number elsewhere though to ensure it doesn't get forgotten on changes.
18:30 ^🔗	markedL	Thanks, will do. Then for the records, it's hard to mess that up because warcio handles that
18:34 ^🔗	JAA	Yeah, with requests and warcio's capture_http, it should probably work correctly. I've never used or verified it though, in particular with chunked transfer encoding.
18:35 ^🔗	markedL	is chunked suppose to be unchuck/decode, or preserve bit wise as is, or both legal?
18:39 ^🔗		killsushi has joined #archiveteam-bs
18:44 ^🔗	JAA	Preserved exactly as sent by the server.
18:45 ^🔗	JAA	The payload digest should in theory be of the decoded body, but I'm not aware of any tool actually following that part of the standard. A systematic investigation into that is still on my todo list though. Cf. https://github.com/webrecorder/warcio/issues/74
18:46 ^🔗	JAA	(In other words, at this point, the standard should likely be changed to reflect that.)
18:47 ^🔗		omglolba- has joined #archiveteam-bs
18:47 ^🔗	jodizzle	JAA: Royal Society HTML pages should've been grabbed by an archivebot job already. It's unclear to me how often they contain the full article contents (and if they're any different for this free-to-access period), but at least it's something.
18:48 ^🔗		omglolbah has quit IRC (Ping timeout: 258 seconds)
18:48 ^🔗	jodizzle	Interestingly it seemed like there was basiclaly no rate limiting on the HTML versions.
18:48 ^🔗	JAA	jodizzle: Ah ok, good. When I looked at it yesterday, it seemed like they all contain the full article now. I didn't investigate in detail though.
18:50 ^🔗	jodizzle	Yeah, I had these jobs going for a couple days. Sorry, should've mentioned (and asked for mips) earlier. I was hoping that archivebot would be able to get the PDFs more naturally, but nope.
18:52 ^🔗	markedL	the mips job is getting a small number of 403, but is it a lot less than what you get on other systems?
18:53 ^🔗	jodizzle	Yeah, I mean you basically have to crawl real slowly or you get hit with a long ban (not sure how long).
18:54 ^🔗	jodizzle	I'm still playing with it on archivebot.
18:56 ^🔗	jodizzle	Another good thing is that the list of URLs is sorted to prioritize the subset of articles that seem to only be free-to-access for this period.
18:56 ^🔗	jodizzle	So that should help grab the most valuable contents first.
19:02 ^🔗	JAA	Sounds good.
19:02 ^🔗	JAA	Let me know if you want me to requeue the 403s on mips.
19:12 ^🔗	jodizzle	We definitely should, but we can probably wait on it for a little longer.
19:47 ^🔗		manjaro-u has joined #archiveteam-bs
19:59 ^🔗		wyatt8740 has joined #archiveteam-bs
20:09 ^🔗	JAA	jodizzle: Yeah, that makes sense. Perhaps the easiest will actually be to extract the 403s when the job is done and rerun them. Recursion isn't needed, so requeueing them while the job is running isn't necessary (and really shouldn't be done anyway since it messes with all kinds of things). I won't be around for that tomorrow until the late evening (UTC) though. I can look at it then unless Fusl wants
20:09 ^🔗	JAA	to do it earlier.
20:17 ^🔗	JAA	markedL: So my bit.ly ban expired at some point (it did last quite long though), and I can confirm that I can't trigger a ban with existing redirects. I wonder if that is the case for standard bit.ly as well. Might test that at some point.
20:18 ^🔗	markedL	cool, sounds right
20:47 ^🔗		nepeat has quit IRC (Read error: Operation timed out)
20:55 ^🔗		nepeat has joined #archiveteam-bs
20:57 ^🔗		tech234a has quit IRC (Quit: Connection closed for inactivity)
21:02 ^🔗	jodizzle	JAA: One problem is that it's unclear if the free access period ends on the 27th or runs through the 27th.
21:02 ^🔗	jodizzle	If it ends right at the beginning of the 27th, then we're almost out of time. Not much we can do about that, unfortunately.
21:03 ^🔗	jodizzle	Maybe we should crank the concurrency up to 3?
21:08 ^🔗	jodizzle	Hm, might not be necessary, actually. If the list is sorted correctly, then only the first 47,760 URLs are limited-time free-to-access.
21:09 ^🔗		coderobe9 is now known as coderobe
21:09 ^🔗	jodizzle	Ideally we should still go through the whole list, though.
21:28 ^🔗		schbirid has quit IRC (Quit: Leaving)
22:15 ^🔗		BlueMax has joined #archiveteam-bs

irclogger-viewer