#archiveteam-bs 2019-10-26,Sat

↑back Search

Time Nickname Message
00:08 🔗 xdax ok so i still can't find television advertising awards shortlists anywhere
00:09 🔗 xdax despite having a hard cover book in front of me that i can cite winners from there's just no page containing all the winners
00:16 🔗 godane has joined #archiveteam-bs
00:29 🔗 DFJustin has quit IRC (Ping timeout: 745 seconds)
00:49 🔗 SynMonger has quit IRC (Quit: Wait, what?)
00:52 🔗 SynMonger has joined #archiveteam-bs
01:07 🔗 markedL what's the award called?
01:30 🔗 phillipsj I win! I got 27ms!
01:34 🔗 icedice has joined #archiveteam-bs
01:54 🔗 killsushi has quit IRC (Quit: Leaving)
02:27 🔗 pew has quit IRC (Ping timeout: 252 seconds)
02:33 🔗 DFJustin has joined #archiveteam-bs
02:40 🔗 pew has joined #archiveteam-bs
03:46 🔗 odemgi has joined #archiveteam-bs
03:51 🔗 odemgi_ has quit IRC (Read error: Operation timed out)
03:52 🔗 qw3rty has joined #archiveteam-bs
03:56 🔗 odemg has quit IRC (Ping timeout: 745 seconds)
03:59 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:00 🔗 odemg has joined #archiveteam-bs
04:47 🔗 odemgi_ has joined #archiveteam-bs
04:52 🔗 odemgi has quit IRC (Read error: Operation timed out)
04:55 🔗 qw3rty2 has joined #archiveteam-bs
04:58 🔗 odemg has quit IRC (Ping timeout: 745 seconds)
05:02 🔗 odemg has joined #archiveteam-bs
05:02 🔗 qw3rty has quit IRC (Ping timeout: 745 seconds)
05:26 🔗 xdax markedL: there's a couple
05:27 🔗 xdax https://en.wikipedia.org/wiki/Category:Advertising_awards
05:27 🔗 xdax clios purged everything including their shortlists going back to 2009 and gave 130,000 ads in film reels to a university
05:28 🔗 xdax https://news.iu.edu/stories/2017/12/iub/releases/14-clio-collection.html
05:29 🔗 xdax there's not even records of what they are so we can't get other copies in case something happens
05:29 🔗 xdax there -are- archive sites but they're meant for agencies and charge accordingly and there's no telling the amount or quality of content on them
05:30 🔗 xdax contentid has been picking copies off youtube of anything with licensed music
05:49 🔗 xdax cannes shortlists and content in low quality might be possible with weird non-english searches
07:33 🔗 odemgi has joined #archiveteam-bs
07:37 🔗 odemgi_ has quit IRC (Read error: Operation timed out)
07:45 🔗 d5f4a3622 has quit IRC (Ping timeout: 612 seconds)
07:50 🔗 jodizzle Could we maybe do a mips run for Royal Society PDFs? There are a couple different Royal Society related jobs going right now, including a targeted one for PDFs, but I doubt they're going to finish by the end of the free to access period.
08:29 🔗 schbirid has joined #archiveteam-bs
08:31 🔗 markedL tell us about the website layout and number of things you think we should get
08:32 🔗 d5f4a3622 has joined #archiveteam-bs
08:41 🔗 bluefoo has joined #archiveteam-bs
09:12 🔗 jodizzle markedL: I think a lot of the site has already been grabbed through a couple different jobs. The main question is the article PDFs.
09:13 🔗 jodizzle There's a job for those PDFs running right now, but it keeps getting hit with 403s if you crawl too quickly. So I was suggesting that mips might be a way around that.
09:21 🔗 BlueMax has quit IRC (Quit: Leaving)
09:22 🔗 markedL who has the list of URLs for all the PDFs ?
09:24 🔗 Hani111 has joined #archiveteam-bs
09:25 🔗 jodizzle Here's the list: https://transfer.notkiska.pw/fmU4m/royalsocietypublishing_org-articles-pdf-sorted.txt
09:26 🔗 jodizzle No guarantee that it is really all of them, of course.
09:27 🔗 markedL every time I click a link in this channel, there's a second after where I think, wait is this going to fill my drive
09:27 🔗 jodizzle Ha, no, it's only a few MiBs
09:33 🔗 markedL yeah, it's a better fit for mips than warrior, unless Fusl passes
09:34 🔗 Hani has quit IRC (Ping timeout: 745 seconds)
09:34 🔗 Hani111 is now known as Hani
09:34 🔗 mls_ has quit IRC (Remote host closed the connection)
09:39 🔗 mls_ has joined #archiveteam-bs
09:39 🔗 VADemon_ has joined #archiveteam-bs
09:43 🔗 VADemon has quit IRC (Ping timeout: 258 seconds)
09:45 🔗 VADemon_ has quit IRC (Quit: left4dead)
10:37 🔗 manjaro-u has quit IRC (Quit: Konversation terminated!)
10:37 🔗 Jamesatja has joined #archiveteam-bs
10:37 🔗 markedL or MIA. I'll have something in an hour
11:09 🔗 d5f4a3622 has quit IRC (Ping timeout: 246 seconds)
11:38 🔗 Jamesatja has quit IRC (Read error: Connection reset by peer)
12:27 🔗 markedL code's ready, setting up new drives
13:22 🔗 Fusl_ if its just a few mbs, throw it at JAA and he'll queue on mips
13:42 🔗 markedL JAA : can you queue this on mips today (saturday) : https://transfer.notkiska.pw/fmU4m/royalsocietypublishing_org-articles-pdf-sorted.txt
14:20 🔗 Fusl its queued
14:20 🔗 Fusl http://103.230.141.2:29000/
14:37 🔗 markedL cool, how much storage does mips have?
14:47 🔗 Fusl Filesystem Size Used Avail Use% Mounted on
14:47 🔗 Fusl /dev/sda2 2.0T 1.1T 901G 56% /
15:12 🔗 JAA jodizzle: I haven't written it down anywhere yet, no. But basically, each list block (i.e. consecutive lines starting with '* ') gets transformed into something like the tables on the ArchiveBot/* pages. The syntax for the individual list entries is the same as for that old bot, e.g. '* URL | note = Something to add'. Other than that, you're completely free how you want to structure the page.
15:15 🔗 JAA jodizzle: Thanks for that, I wanted to look into Royal Society more but didn't have enough time. I can also throw it into qwarc if needed, assuming they don't have rate limits per IP.
15:18 🔗 markedL the list is small enough it'll finish tonight, and mips has a few but rare 403's
15:22 🔗 markedL looking at the wrong field, finish tomorrow mid day
15:22 🔗 JAA Do we want the HTML pages as well?
15:23 🔗 markedL jodizzle ^
15:23 🔗 markedL qwarc could be a fit to yourshot
15:23 🔗 JAA No, it'll crash the server in a matter of seconds.
15:28 🔗 markedL well, it's some non-obvious load profile, I plan on fixing it in any case
15:36 🔗 markedL the highest transaction grab I have going on right now is actually that bitly alias that people would said not to try
15:44 🔗 markedL 10,000 redirects/min using 25 connections
15:51 🔗 JAA Huh, it doesn't have the normal bit.ly rate limits?
15:57 🔗 JAA You are talking about on.natgeo.com, right?
15:58 🔗 bluefoo has quit IRC (Ping timeout: 252 seconds)
16:00 🔗 markedL yeah, but I'm running 10million known ID's. So either there's no limit or the limit is only triggered by 404's
16:01 🔗 JAA Nope, it's just the request rate.
16:01 🔗 JAA On bit.ly and most aliases, that is.
16:16 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
16:28 🔗 markedL has the rate limit for bitly been quantified?
16:29 🔗 JAA On the order of one request per second.
16:29 🔗 JAA If not less.
16:30 🔗 markedL Ok, hmm, I'll throw some 404's in, but after the 301's are done
16:30 🔗 JAA Can you upload a sample of those 10M codes?
16:30 🔗 markedL I don't image this is complete, do you already know what's missing: https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/61
16:30 🔗 markedL sample is easy, sure.
16:30 🔗 JAA Just 1k random codes or whatever.
16:33 🔗 markedL https://transfer.notkiska.pw/eHhRZ/bitly-natgeo-sample.txt
16:35 🔗 JAA Thanks!
16:42 🔗 JAA Yeah, interesting, I don't seem to get rate limited on those. Also not when using bit.ly instead of on.natgeo.com.
16:42 🔗 JAA I did use a Firefox UA though.
16:42 🔗 Stiletto has quit IRC (Ping timeout: 246 seconds)
16:45 🔗 Stilett0 has joined #archiveteam-bs
16:48 🔗 JAA Looks like on.natgeo.com is special anyway. It doesn't resolve normal bit.ly shortcodes, and all 404s just redirect to the Nat Geo homepage.
16:49 🔗 JAA But bans from bit.ly do carry over to on.natgeo.com.
16:50 🔗 Stilett0 is now known as Stiletto
16:52 🔗 JAA I did manage to get banned on on.natgeo.com as well though after throwing random codes at it.
17:47 🔗 tech234a has joined #archiveteam-bs
17:58 🔗 markedL JAA, is this sufficient, I recall there's a record at the end but I don't know what's the minimum here https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/61/commits/ea8abb8afbc61b7a8ff5140f58425186b46579fc
18:03 🔗 markedL if the answer is the warc spec really needs to be read, I can relay that instead
18:11 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
18:13 🔗 RichardG has joined #archiveteam-bs
18:18 🔗 d5f4a3622 has joined #archiveteam-bs
18:19 🔗 JAA markedL: Some tools write the retrieval log at the end of the WARC. The format of that is obviously arbitrary. In general, the more info the better obviously, but I don't see anything terribly wrong with that code.
18:19 🔗 JAA I would suggest moving the version number elsewhere though to ensure it doesn't get forgotten on changes.
18:30 🔗 markedL Thanks, will do. Then for the records, it's hard to mess that up because warcio handles that
18:34 🔗 JAA Yeah, with requests and warcio's capture_http, it should probably work correctly. I've never used or verified it though, in particular with chunked transfer encoding.
18:35 🔗 markedL is chunked suppose to be unchuck/decode, or preserve bit wise as is, or both legal?
18:39 🔗 killsushi has joined #archiveteam-bs
18:44 🔗 JAA Preserved exactly as sent by the server.
18:45 🔗 JAA The payload digest should in theory be of the decoded body, but I'm not aware of any tool actually following that part of the standard. A systematic investigation into that is still on my todo list though. Cf. https://github.com/webrecorder/warcio/issues/74
18:46 🔗 JAA (In other words, at this point, the standard should likely be changed to reflect that.)
18:47 🔗 omglolba- has joined #archiveteam-bs
18:47 🔗 jodizzle JAA: Royal Society HTML pages should've been grabbed by an archivebot job already. It's unclear to me how often they contain the full article contents (and if they're any different for this free-to-access period), but at least it's something.
18:48 🔗 omglolbah has quit IRC (Ping timeout: 258 seconds)
18:48 🔗 jodizzle Interestingly it seemed like there was basiclaly no rate limiting on the HTML versions.
18:48 🔗 JAA jodizzle: Ah ok, good. When I looked at it yesterday, it seemed like they all contain the full article now. I didn't investigate in detail though.
18:50 🔗 jodizzle Yeah, I had these jobs going for a couple days. Sorry, should've mentioned (and asked for mips) earlier. I was hoping that archivebot would be able to get the PDFs more naturally, but nope.
18:52 🔗 markedL the mips job is getting a small number of 403, but is it a lot less than what you get on other systems?
18:53 🔗 jodizzle Yeah, I mean you basically have to crawl real slowly or you get hit with a long ban (not sure how long).
18:54 🔗 jodizzle I'm still playing with it on archivebot.
18:56 🔗 jodizzle Another good thing is that the list of URLs is sorted to prioritize the subset of articles that seem to only be free-to-access for this period.
18:56 🔗 jodizzle So that should help grab the most valuable contents first.
19:02 🔗 JAA Sounds good.
19:02 🔗 JAA Let me know if you want me to requeue the 403s on mips.
19:12 🔗 jodizzle We definitely should, but we can probably wait on it for a little longer.
19:47 🔗 manjaro-u has joined #archiveteam-bs
19:59 🔗 wyatt8740 has joined #archiveteam-bs
20:09 🔗 JAA jodizzle: Yeah, that makes sense. Perhaps the easiest will actually be to extract the 403s when the job is done and rerun them. Recursion isn't needed, so requeueing them while the job is running isn't necessary (and really shouldn't be done anyway since it messes with all kinds of things). I won't be around for that tomorrow until the late evening (UTC) though. I can look at it then unless Fusl wants
20:09 🔗 JAA to do it earlier.
20:17 🔗 JAA markedL: So my bit.ly ban expired at some point (it did last quite long though), and I can confirm that I can't trigger a ban with existing redirects. I wonder if that is the case for standard bit.ly as well. Might test that at some point.
20:18 🔗 markedL cool, sounds right
20:47 🔗 nepeat has quit IRC (Read error: Operation timed out)
20:55 🔗 nepeat has joined #archiveteam-bs
20:57 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
21:02 🔗 jodizzle JAA: One problem is that it's unclear if the free access period ends on the 27th or runs through the 27th.
21:02 🔗 jodizzle If it ends right at the beginning of the 27th, then we're almost out of time. Not much we can do about that, unfortunately.
21:03 🔗 jodizzle Maybe we should crank the concurrency up to 3?
21:08 🔗 jodizzle Hm, might not be necessary, actually. If the list is sorted correctly, then only the first 47,760 URLs are limited-time free-to-access.
21:09 🔗 coderobe9 is now known as coderobe
21:09 🔗 jodizzle Ideally we should still go through the whole list, though.
21:28 🔗 schbirid has quit IRC (Quit: Leaving)
22:15 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer