Time |
Nickname |
Message |
00:08
🔗
|
xdax |
ok so i still can't find television advertising awards shortlists anywhere |
00:09
🔗
|
xdax |
despite having a hard cover book in front of me that i can cite winners from there's just no page containing all the winners |
00:16
🔗
|
|
godane has joined #archiveteam-bs |
00:29
🔗
|
|
DFJustin has quit IRC (Ping timeout: 745 seconds) |
00:49
🔗
|
|
SynMonger has quit IRC (Quit: Wait, what?) |
00:52
🔗
|
|
SynMonger has joined #archiveteam-bs |
01:07
🔗
|
markedL |
what's the award called? |
01:30
🔗
|
phillipsj |
I win! I got 27ms! |
01:34
🔗
|
|
icedice has joined #archiveteam-bs |
01:54
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
02:27
🔗
|
|
pew has quit IRC (Ping timeout: 252 seconds) |
02:33
🔗
|
|
DFJustin has joined #archiveteam-bs |
02:40
🔗
|
|
pew has joined #archiveteam-bs |
03:46
🔗
|
|
odemgi has joined #archiveteam-bs |
03:51
🔗
|
|
odemgi_ has quit IRC (Read error: Operation timed out) |
03:52
🔗
|
|
qw3rty has joined #archiveteam-bs |
03:56
🔗
|
|
odemg has quit IRC (Ping timeout: 745 seconds) |
03:59
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
04:00
🔗
|
|
odemg has joined #archiveteam-bs |
04:47
🔗
|
|
odemgi_ has joined #archiveteam-bs |
04:52
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:55
🔗
|
|
qw3rty2 has joined #archiveteam-bs |
04:58
🔗
|
|
odemg has quit IRC (Ping timeout: 745 seconds) |
05:02
🔗
|
|
odemg has joined #archiveteam-bs |
05:02
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
05:26
🔗
|
xdax |
markedL: there's a couple |
05:27
🔗
|
xdax |
https://en.wikipedia.org/wiki/Category:Advertising_awards |
05:27
🔗
|
xdax |
clios purged everything including their shortlists going back to 2009 and gave 130,000 ads in film reels to a university |
05:28
🔗
|
xdax |
https://news.iu.edu/stories/2017/12/iub/releases/14-clio-collection.html |
05:29
🔗
|
xdax |
there's not even records of what they are so we can't get other copies in case something happens |
05:29
🔗
|
xdax |
there -are- archive sites but they're meant for agencies and charge accordingly and there's no telling the amount or quality of content on them |
05:30
🔗
|
xdax |
contentid has been picking copies off youtube of anything with licensed music |
05:49
🔗
|
xdax |
cannes shortlists and content in low quality might be possible with weird non-english searches |
07:33
🔗
|
|
odemgi has joined #archiveteam-bs |
07:37
🔗
|
|
odemgi_ has quit IRC (Read error: Operation timed out) |
07:45
🔗
|
|
d5f4a3622 has quit IRC (Ping timeout: 612 seconds) |
07:50
🔗
|
jodizzle |
Could we maybe do a mips run for Royal Society PDFs? There are a couple different Royal Society related jobs going right now, including a targeted one for PDFs, but I doubt they're going to finish by the end of the free to access period. |
08:29
🔗
|
|
schbirid has joined #archiveteam-bs |
08:31
🔗
|
markedL |
tell us about the website layout and number of things you think we should get |
08:32
🔗
|
|
d5f4a3622 has joined #archiveteam-bs |
08:41
🔗
|
|
bluefoo has joined #archiveteam-bs |
09:12
🔗
|
jodizzle |
markedL: I think a lot of the site has already been grabbed through a couple different jobs. The main question is the article PDFs. |
09:13
🔗
|
jodizzle |
There's a job for those PDFs running right now, but it keeps getting hit with 403s if you crawl too quickly. So I was suggesting that mips might be a way around that. |
09:21
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
09:22
🔗
|
markedL |
who has the list of URLs for all the PDFs ? |
09:24
🔗
|
|
Hani111 has joined #archiveteam-bs |
09:25
🔗
|
jodizzle |
Here's the list: https://transfer.notkiska.pw/fmU4m/royalsocietypublishing_org-articles-pdf-sorted.txt |
09:26
🔗
|
jodizzle |
No guarantee that it is really all of them, of course. |
09:27
🔗
|
markedL |
every time I click a link in this channel, there's a second after where I think, wait is this going to fill my drive |
09:27
🔗
|
jodizzle |
Ha, no, it's only a few MiBs |
09:33
🔗
|
markedL |
yeah, it's a better fit for mips than warrior, unless Fusl passes |
09:34
🔗
|
|
Hani has quit IRC (Ping timeout: 745 seconds) |
09:34
🔗
|
|
Hani111 is now known as Hani |
09:34
🔗
|
|
mls_ has quit IRC (Remote host closed the connection) |
09:39
🔗
|
|
mls_ has joined #archiveteam-bs |
09:39
🔗
|
|
VADemon_ has joined #archiveteam-bs |
09:43
🔗
|
|
VADemon has quit IRC (Ping timeout: 258 seconds) |
09:45
🔗
|
|
VADemon_ has quit IRC (Quit: left4dead) |
10:37
🔗
|
|
manjaro-u has quit IRC (Quit: Konversation terminated!) |
10:37
🔗
|
|
Jamesatja has joined #archiveteam-bs |
10:37
🔗
|
markedL |
or MIA. I'll have something in an hour |
11:09
🔗
|
|
d5f4a3622 has quit IRC (Ping timeout: 246 seconds) |
11:38
🔗
|
|
Jamesatja has quit IRC (Read error: Connection reset by peer) |
12:27
🔗
|
markedL |
code's ready, setting up new drives |
13:22
🔗
|
Fusl_ |
if its just a few mbs, throw it at JAA and he'll queue on mips |
13:42
🔗
|
markedL |
JAA : can you queue this on mips today (saturday) : https://transfer.notkiska.pw/fmU4m/royalsocietypublishing_org-articles-pdf-sorted.txt |
14:20
🔗
|
Fusl |
its queued |
14:20
🔗
|
Fusl |
http://103.230.141.2:29000/ |
14:37
🔗
|
markedL |
cool, how much storage does mips have? |
14:47
🔗
|
Fusl |
Filesystem Size Used Avail Use% Mounted on |
14:47
🔗
|
Fusl |
/dev/sda2 2.0T 1.1T 901G 56% / |
15:12
🔗
|
JAA |
jodizzle: I haven't written it down anywhere yet, no. But basically, each list block (i.e. consecutive lines starting with '* ') gets transformed into something like the tables on the ArchiveBot/* pages. The syntax for the individual list entries is the same as for that old bot, e.g. '* URL | note = Something to add'. Other than that, you're completely free how you want to structure the page. |
15:15
🔗
|
JAA |
jodizzle: Thanks for that, I wanted to look into Royal Society more but didn't have enough time. I can also throw it into qwarc if needed, assuming they don't have rate limits per IP. |
15:18
🔗
|
markedL |
the list is small enough it'll finish tonight, and mips has a few but rare 403's |
15:22
🔗
|
markedL |
looking at the wrong field, finish tomorrow mid day |
15:22
🔗
|
JAA |
Do we want the HTML pages as well? |
15:23
🔗
|
markedL |
jodizzle ^ |
15:23
🔗
|
markedL |
qwarc could be a fit to yourshot |
15:23
🔗
|
JAA |
No, it'll crash the server in a matter of seconds. |
15:28
🔗
|
markedL |
well, it's some non-obvious load profile, I plan on fixing it in any case |
15:36
🔗
|
markedL |
the highest transaction grab I have going on right now is actually that bitly alias that people would said not to try |
15:44
🔗
|
markedL |
10,000 redirects/min using 25 connections |
15:51
🔗
|
JAA |
Huh, it doesn't have the normal bit.ly rate limits? |
15:57
🔗
|
JAA |
You are talking about on.natgeo.com, right? |
15:58
🔗
|
|
bluefoo has quit IRC (Ping timeout: 252 seconds) |
16:00
🔗
|
markedL |
yeah, but I'm running 10million known ID's. So either there's no limit or the limit is only triggered by 404's |
16:01
🔗
|
JAA |
Nope, it's just the request rate. |
16:01
🔗
|
JAA |
On bit.ly and most aliases, that is. |
16:16
🔗
|
|
wyatt8740 has quit IRC (Read error: Operation timed out) |
16:28
🔗
|
markedL |
has the rate limit for bitly been quantified? |
16:29
🔗
|
JAA |
On the order of one request per second. |
16:29
🔗
|
JAA |
If not less. |
16:30
🔗
|
markedL |
Ok, hmm, I'll throw some 404's in, but after the 301's are done |
16:30
🔗
|
JAA |
Can you upload a sample of those 10M codes? |
16:30
🔗
|
markedL |
I don't image this is complete, do you already know what's missing: https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/61 |
16:30
🔗
|
markedL |
sample is easy, sure. |
16:30
🔗
|
JAA |
Just 1k random codes or whatever. |
16:33
🔗
|
markedL |
https://transfer.notkiska.pw/eHhRZ/bitly-natgeo-sample.txt |
16:35
🔗
|
JAA |
Thanks! |
16:42
🔗
|
JAA |
Yeah, interesting, I don't seem to get rate limited on those. Also not when using bit.ly instead of on.natgeo.com. |
16:42
🔗
|
JAA |
I did use a Firefox UA though. |
16:42
🔗
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |
16:45
🔗
|
|
Stilett0 has joined #archiveteam-bs |
16:48
🔗
|
JAA |
Looks like on.natgeo.com is special anyway. It doesn't resolve normal bit.ly shortcodes, and all 404s just redirect to the Nat Geo homepage. |
16:49
🔗
|
JAA |
But bans from bit.ly do carry over to on.natgeo.com. |
16:50
🔗
|
|
Stilett0 is now known as Stiletto |
16:52
🔗
|
JAA |
I did manage to get banned on on.natgeo.com as well though after throwing random codes at it. |
17:47
🔗
|
|
tech234a has joined #archiveteam-bs |
17:58
🔗
|
markedL |
JAA, is this sufficient, I recall there's a record at the end but I don't know what's the minimum here https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/61/commits/ea8abb8afbc61b7a8ff5140f58425186b46579fc |
18:03
🔗
|
markedL |
if the answer is the warc spec really needs to be read, I can relay that instead |
18:11
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
18:13
🔗
|
|
RichardG has joined #archiveteam-bs |
18:18
🔗
|
|
d5f4a3622 has joined #archiveteam-bs |
18:19
🔗
|
JAA |
markedL: Some tools write the retrieval log at the end of the WARC. The format of that is obviously arbitrary. In general, the more info the better obviously, but I don't see anything terribly wrong with that code. |
18:19
🔗
|
JAA |
I would suggest moving the version number elsewhere though to ensure it doesn't get forgotten on changes. |
18:30
🔗
|
markedL |
Thanks, will do. Then for the records, it's hard to mess that up because warcio handles that |
18:34
🔗
|
JAA |
Yeah, with requests and warcio's capture_http, it should probably work correctly. I've never used or verified it though, in particular with chunked transfer encoding. |
18:35
🔗
|
markedL |
is chunked suppose to be unchuck/decode, or preserve bit wise as is, or both legal? |
18:39
🔗
|
|
killsushi has joined #archiveteam-bs |
18:44
🔗
|
JAA |
Preserved exactly as sent by the server. |
18:45
🔗
|
JAA |
The payload digest should in theory be of the decoded body, but I'm not aware of any tool actually following that part of the standard. A systematic investigation into that is still on my todo list though. Cf. https://github.com/webrecorder/warcio/issues/74 |
18:46
🔗
|
JAA |
(In other words, at this point, the standard should likely be changed to reflect that.) |
18:47
🔗
|
|
omglolba- has joined #archiveteam-bs |
18:47
🔗
|
jodizzle |
JAA: Royal Society HTML pages should've been grabbed by an archivebot job already. It's unclear to me how often they contain the full article contents (and if they're any different for this free-to-access period), but at least it's something. |
18:48
🔗
|
|
omglolbah has quit IRC (Ping timeout: 258 seconds) |
18:48
🔗
|
jodizzle |
Interestingly it seemed like there was basiclaly no rate limiting on the HTML versions. |
18:48
🔗
|
JAA |
jodizzle: Ah ok, good. When I looked at it yesterday, it seemed like they all contain the full article now. I didn't investigate in detail though. |
18:50
🔗
|
jodizzle |
Yeah, I had these jobs going for a couple days. Sorry, should've mentioned (and asked for mips) earlier. I was hoping that archivebot would be able to get the PDFs more naturally, but nope. |
18:52
🔗
|
markedL |
the mips job is getting a small number of 403, but is it a lot less than what you get on other systems? |
18:53
🔗
|
jodizzle |
Yeah, I mean you basically have to crawl real slowly or you get hit with a long ban (not sure how long). |
18:54
🔗
|
jodizzle |
I'm still playing with it on archivebot. |
18:56
🔗
|
jodizzle |
Another good thing is that the list of URLs is sorted to prioritize the subset of articles that seem to only be free-to-access for this period. |
18:56
🔗
|
jodizzle |
So that should help grab the most valuable contents first. |
19:02
🔗
|
JAA |
Sounds good. |
19:02
🔗
|
JAA |
Let me know if you want me to requeue the 403s on mips. |
19:12
🔗
|
jodizzle |
We definitely should, but we can probably wait on it for a little longer. |
19:47
🔗
|
|
manjaro-u has joined #archiveteam-bs |
19:59
🔗
|
|
wyatt8740 has joined #archiveteam-bs |
20:09
🔗
|
JAA |
jodizzle: Yeah, that makes sense. Perhaps the easiest will actually be to extract the 403s when the job is done and rerun them. Recursion isn't needed, so requeueing them while the job is running isn't necessary (and really shouldn't be done anyway since it messes with all kinds of things). I won't be around for that tomorrow until the late evening (UTC) though. I can look at it then unless Fusl wants |
20:09
🔗
|
JAA |
to do it earlier. |
20:17
🔗
|
JAA |
markedL: So my bit.ly ban expired at some point (it did last quite long though), and I can confirm that I can't trigger a ban with existing redirects. I wonder if that is the case for standard bit.ly as well. Might test that at some point. |
20:18
🔗
|
markedL |
cool, sounds right |
20:47
🔗
|
|
nepeat has quit IRC (Read error: Operation timed out) |
20:55
🔗
|
|
nepeat has joined #archiveteam-bs |
20:57
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
21:02
🔗
|
jodizzle |
JAA: One problem is that it's unclear if the free access period ends on the 27th or runs through the 27th. |
21:02
🔗
|
jodizzle |
If it ends right at the beginning of the 27th, then we're almost out of time. Not much we can do about that, unfortunately. |
21:03
🔗
|
jodizzle |
Maybe we should crank the concurrency up to 3? |
21:08
🔗
|
jodizzle |
Hm, might not be necessary, actually. If the list is sorted correctly, then only the first 47,760 URLs are limited-time free-to-access. |
21:09
🔗
|
|
coderobe9 is now known as coderobe |
21:09
🔗
|
jodizzle |
Ideally we should still go through the whole list, though. |
21:28
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:15
🔗
|
|
BlueMax has joined #archiveteam-bs |