#archiveteam-bs 2018-05-29,Tue

↑back Search

Time Nickname Message
00:40 🔗 ta9le has quit IRC (Quit: Connection closed for inactivity)
00:46 🔗 ndiddylap has joined #archiveteam-bs
02:07 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
02:07 🔗 Mateon1 has joined #archiveteam-bs
02:14 🔗 ndiddy_ has joined #archiveteam-bs
02:18 🔗 ndiddylap has quit IRC (Read error: Operation timed out)
02:19 🔗 ndiddylap has joined #archiveteam-bs
02:21 🔗 apache2 has quit IRC (Remote host closed the connection)
02:21 🔗 apache2 has joined #archiveteam-bs
02:22 🔗 ndiddy_ has quit IRC (Read error: Operation timed out)
02:24 🔗 Tenebrae has quit IRC (Ping timeout: 260 seconds)
02:25 🔗 Tenebrae has joined #archiveteam-bs
02:25 🔗 plue has quit IRC (Ping timeout: 260 seconds)
02:26 🔗 plue has joined #archiveteam-bs
02:50 🔗 SketchCow OK
02:50 🔗 SketchCow I tend to upload when they stop growing for a while, if that matters
02:50 🔗 SketchCow But I'm fine
03:17 🔗 qw3rty117 has joined #archiveteam-bs
03:23 🔗 qw3rty116 has quit IRC (Read error: Operation timed out)
03:41 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
03:53 🔗 odemg has joined #archiveteam-bs
03:59 🔗 sep332 has quit IRC (Read error: Operation timed out)
04:32 🔗 ndiddylap has quit IRC (Read error: Operation timed out)
05:06 🔗 Pixi has quit IRC (Quit: Pixi)
05:06 🔗 Pixi has joined #archiveteam-bs
06:20 🔗 schbirid has joined #archiveteam-bs
07:15 🔗 schbirid has quit IRC (Quit: Leaving)
08:21 🔗 SmileyG_ has joined #archiveteam-bs
08:24 🔗 SmileyG has quit IRC (Ping timeout: 260 seconds)
09:44 🔗 godane SketchCow: ok
09:44 🔗 godane any word from Mank?
10:23 🔗 ta9le has joined #archiveteam-bs
11:32 🔗 lindalap JAA: It wasn't long ago when Jagex sold their business to the Chinese, or something. It's been downhill from there. They also increased costs of their subscription recently.
11:33 🔗 JAA I see.
11:47 🔗 lindalap http://www.runescape.com/robots.txt
11:47 🔗 lindalap I guess I have no words
11:56 🔗 JAA lol
12:46 🔗 lindalap I almost forgot. Jagex is also closing Ace of Spades in few days.
12:46 🔗 lindalap Ace of Spades, RuneScape Classic and FunOrb. Whee.
12:56 🔗 C4K3 has quit IRC (Read error: Operation timed out)
13:33 🔗 BlueMax has quit IRC (Leaving)
13:52 🔗 C4K3 has joined #archiveteam-bs
14:11 🔗 ndiddylap has joined #archiveteam-bs
14:13 🔗 balrog has quit IRC (Bye)
14:21 🔗 balrog has joined #archiveteam-bs
14:21 🔗 swebb sets mode: +o balrog
14:33 🔗 Frogging lmao this robots.txt
14:43 🔗 tyzoid Frogging: Where?
14:43 🔗 Frogging http://www.runescape.com/robots.txt
14:43 🔗 tyzoid oh, for runescape, lol
14:43 🔗 Frogging yeah
14:44 🔗 tyzoid damn, keep running into this: error uploading at-00156.warc: We encountered an internal error. Please try again. - uploadItem.py
14:47 🔗 tyzoid worked after four retries :/
14:48 🔗 balrog has quit IRC (Bye)
14:50 🔗 balrog has joined #archiveteam-bs
14:50 🔗 swebb sets mode: +o balrog
14:55 🔗 arkiver tyzoid: where are you uploading these?
14:55 🔗 arkiver which item
14:57 🔗 tyzoid arkiver: https://archive.org/download/tyzoid-acidplanet-audio
14:57 🔗 tyzoid I'm going through and re-uploading the ones that failed to upload before
14:58 🔗 arkiver ah you're using the new wget
14:58 🔗 arkiver it writes different WARC headers than normal
14:58 🔗 tyzoid whatever is latest on ubuntu 18.04, I assume it's new
14:58 🔗 arkiver https://archive.org/download/tyzoid-acidplanet-audio/tyzoid-acidplanet-audio.cdx.idx
14:58 🔗 DragonMon Hi
14:59 🔗 arkiver lines from the IDX
14:59 🔗 arkiver <http)/av.acidplanet.com.s3-us-west-2.amazonaws.com/ap/0288000/287820-6604-296096.asf> 20180527113823 tyzoid-acidplanet-audio.cdx.gz 563522 183423
14:59 🔗 arkiver note the <http
14:59 🔗 arkiver that is wrong
14:59 🔗 DragonMon is there a good reason why tarballs are excluded from most of the https://kernel.org archives?
14:59 🔗 arkiver we encountered this issue before, IA can't handle it (yet), I'll raise an issue
14:59 🔗 tyzoid Sounds good
15:00 🔗 DragonMon the point of archiving that website is so there are copies of older Linux Kernel sources
15:00 🔗 arkiver (this is not related to your uploading issue)
15:00 🔗 tyzoid DragonMon: It's backed by git, so you can always git checkout to the tag
15:01 🔗 DragonMon tyzoid: yes but is there a web archive of the git repo?
15:01 🔗 DragonMon lol
15:01 🔗 tyzoid DragonMon: Yes: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
15:02 🔗 DragonMon not what I meant
15:02 🔗 DragonMon I'm sure Linux is one of the most archived things around but shouldn't archive.org include the source tarballs?
15:02 🔗 tyzoid you can go all the way back if you want: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-2.6.13-rc3.tar.gz
15:03 🔗 tyzoid just grab all the snapshots
15:03 🔗 tyzoid arkiver: Yeah, no problem. I just wrapped it in a loop to retry on nonzero return status
15:04 🔗 DragonMon tyzoid: but why couldn't archive.org also get a copy of those?
15:04 🔗 DragonMon the download links are broken if navigated in archive.org
15:06 🔗 moufu_ is now known as moufu
15:08 🔗 JAA Ew, that uri definition problem from the WARC specification again.
15:08 🔗 arkiver yeah
15:08 🔗 arkiver raised an issue, shouldn't be too hard to fix in the derive process
15:09 🔗 arkiver proces*
15:09 🔗 JAA So they changed wget to comply with WARC 1.0 strictly instead of moving to 1.1?
15:09 🔗 JAA ... or just ignoring it, since all other tools don't include the angle brackets anyway.
15:10 🔗 tyzoid DragonMon: seems like it's grabbing it "200 https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.9.103.tar.xz"
15:10 🔗 DragonMon tyzoid: I see a older archiveteam archive from a few days ago on archive.org and the links are broken
15:11 🔗 JAA DragonMon: Link?
15:11 🔗 DragonMon hang on
15:12 🔗 tyzoid JAA: http://web.archive.org/web/20180529073551/https://git.kernel.org/torvalds/t/linux-4.17-rc7.tar.gz
15:12 🔗 DragonMon https://web.archive.org/web/20180521085957/https://www.kernel.org/
15:12 🔗 tyzoid link clicked on from http://web.archive.org/web/20180529073551/https://www.kernel.org/
15:13 🔗 DragonMon right
15:13 🔗 JAA DragonMon: Hmm, nobody grabbed kernel.org on that date directly. Most likely, it was just a link to kernel.org from another site.
15:13 🔗 JAA ArchiveBot grabs external links, but it doesn't recurse on them for obvious reasons.
15:14 🔗 DragonMon hmm
15:14 🔗 JAA So if you grab example.org and example.org/kernel.html has a link to kernel.org, it'll grab kernel.org but not any links on it.
15:14 🔗 DragonMon tyzoid: so it should show up on the most recent grab?
15:14 🔗 tyzoid idk
15:14 🔗 tyzoid perhaps
15:15 🔗 DragonMon it's strange if it doesn't.... It's open source, it's meant to be saved and shared
15:15 🔗 JAA It won't: https://cdn.kernel.org/robots.txt
15:15 🔗 tyzoid JAA: I would imagine that the link shouldn't be broken, though, it'll get the nearest 20x response in time to the current archive
15:15 🔗 tyzoid unless it's not archived at all
15:15 🔗 JAA Correct.
15:15 🔗 DragonMon I wonder what the idea is behind that, it seems odd
15:16 🔗 JAA But in this case, it wouldn't work ever because cdn.kernel.org blocks the access to robots.
15:16 🔗 JAA Probably to prevent unnecessary traffic from crawlers.
15:16 🔗 DragonMon I see that but why limit grabs like that? ddos maybe?
15:16 🔗 arkiver JAA: tyzoid: well it's raised, I don't expect it to be too hard to fix, will keep you informed :)
15:16 🔗 DragonMon hmm
15:16 🔗 JAA git.kernel.org has the same thing.
15:16 🔗 tyzoid arkiver: Thanks
15:16 🔗 DragonMon git* I get
15:17 🔗 JAA I wouldn't be surprised if the people over at kernel.org would be willing to add an exception for ia_archiver to robots.txt.
15:17 🔗 tyzoid JAA: Yeah, http://web.archive.org/web/*/https://cdn.kernel.org/pub/linux/kernel/v3.x/* isn't turning up any results.
15:18 🔗 JAA Yeah... Try this: https://web.archive.org/save/https://cdn.kernel.org/pub/linux/kernel/v4.x/
15:18 🔗 tyzoid "Page cannot be displayed due to robots.txt"
15:18 🔗 tyzoid https://cdn.kernel.org/robots.txt
15:19 🔗 JAA Exactly.
15:19 🔗 tyzoid kernel.org has no such restriction
15:19 🔗 tyzoid though since we've grabbed it via archivebot, it should appear, right?
15:19 🔗 JAA Well, the Wayback Machine will still block access to it.
15:19 🔗 JAA But the data will be there in the WARCs.
15:19 🔗 tyzoid ah, right.
15:20 🔗 tyzoid darn
15:20 🔗 JAA And hopefully IA will some day finally remove that robots.txt handling.
15:20 🔗 DragonMon nah, people who don't want their stuff online will go to IA nagging for their content to be removed
15:21 🔗 DragonMon which is ironic I think
15:21 🔗 tyzoid I would imagine that the wayback machine falls under fair use law in the US anyway
15:23 🔗 DragonMon I'm about to fire off a email to the Linux Foundation. What do they need to add to allow ia_archiver?
15:25 🔗 JAA User-agent: ia_archiver
15:25 🔗 JAA Disallow:
15:26 🔗 JAA I think they'd have to add that before the general disallow rule.
15:27 🔗 tyzoid not after?
15:29 🔗 tyzoid JAA: https://moz.com/learn/seo/robotstxt makes it seem like order doesn't matter
15:30 🔗 JAA I'm not really sure to be honest.
15:31 🔗 DragonMon ok well they should know, someone there setup a flag for the Google bot
15:32 🔗 JAA The original spec doesn't really mention anything about it: http://www.robotstxt.org/orig.html
15:32 🔗 DragonMon Email has been sent, I'll see what they respond with
15:32 🔗 JAA Sweet
15:34 🔗 DragonMon helpdesk@rt.linuxfoundation.org For all issues with Linux Foundation websites or systems, including questions about Linux.com email addresses.
15:34 🔗 tyzoid JAA: I'm cleaning up my archivebot box from the acid grab, so things should start moving a bit better for the archivebot
15:34 🔗 JAA By the way: https://archive.org/details/git-history-of-linux
15:35 🔗 tyzoid sweet
15:35 🔗 tyzoid I wonder how they did that without changing commit IDs
15:36 🔗 JAA Considering that git-filter-branch was used according to the description, it probably did change the commit IDs.
15:37 🔗 tyzoid yeah, it says something about git graft, though I'll have to read up more on it
15:37 🔗 JAA Yeah, looks like those three parts are sort-of merged together without affecting the history.
15:38 🔗 JAA "Graft points or grafts enable two otherwise different lines of development to be joined together. It works by letting users record fake ancestry information for commits. This way you can make git pretend the set of parents a commit has is different from what was recorded when the commit was created." https://git.wiki.kernel.org/index.php/GraftPoint
15:38 🔗 JAA Never heard of it before. Very interesting.
15:54 🔗 DragonMon whelp I quick fired the email to the wrong place, sent a new one to webmaster@kernel.org
15:54 🔗 DragonMon someone did get back to me from Linux Foundation pointing me to that email
16:21 🔗 wp494 has quit IRC (Ping timeout: 633 seconds)
16:22 🔗 wp494 has joined #archiveteam-bs
16:23 🔗 svchfoo1 sets mode: +o wp494
16:28 🔗 DragonMon JAA: tyzoid https://i.imgur.com/EINMXfh.png this is their reply
16:29 🔗 tyzoid DragonMon: No, we are referring to the tarballs on cdn.kernel.org
16:29 🔗 tyzoid https://cdn.kernel.org/robots.txt
16:29 🔗 tyzoid which are the release tarballs
16:29 🔗 tyzoid and those are located at https://cdn.kernel.org/pub
16:30 🔗 tyzoid and the kernel mirror denies all bots too (where www.kernel.org/pub redirects to)
16:30 🔗 tyzoid https://mirrors.edge.kernel.org/robots.txt
16:37 🔗 DragonMon tyzoid: ok I bounced another email "When Internet Archive goes to archive that website https://kernel.org/pub it gets redirected to https://mirrors.edge.kernel.org/pub which has a restriction https://mirrors.edge.kernel.org/robots.txt Can that be fixed so anything under the sub folder pub can be archived?"
16:38 🔗 tyzoid Well, we don't necessarily want to mirror the entire software mirror
16:38 🔗 tyzoid It's really the stuff under cdn.kernel.org which we're after
16:39 🔗 DragonMon Well the they are saying that /pub is supposed to be restriction free
16:39 🔗 DragonMon hmm
16:40 🔗 DragonMon I should have been more clear to not restrict the source code tarballs
16:55 🔗 DragonMon "<tyzoid> DragonMon: if you're concerned about those release tarballs, we got 'em when we grabbed kernel.org" tyzoid: where would they be available then?
16:56 🔗 JAA Soon, at an archive near you. ;-)
16:56 🔗 JAA There's a delay of some hours to few days until the archives from ArchiveBot end up on IA.
16:57 🔗 JAA Until the robots.txt is fixed, you'll have to access the files directly from the WARCs, not through the Wayback Machine.
16:57 🔗 DragonMon so the collections and not as apart of the main archive.org. So the links will be broken
16:58 🔗 JAA The links will be broken because of robots.txt, not because of where the archives are stored.
16:58 🔗 JAA ArchiveBot WARCs do get ingested into the Wayback Machine, but robots.txt prevents the access for these specific URLs.
16:58 🔗 jschwart has joined #archiveteam-bs
16:59 🔗 DragonMon wait, that seems somehow worse. I thought the files got chucked because of robot.txt So there's potentially tons of data archive.org has but cannot make readily easy to access?
17:00 🔗 JAA Yep
17:00 🔗 DragonMon oh damn
17:03 🔗 JAA Better to have the data somewhere behind a (complete or partial) block than to not have it at all. But yeah, it's not optimal.
17:04 🔗 arkiver but for example https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ works?
17:04 🔗 arkiver only not able to save stuff through live wayback
17:04 🔗 JAA Yeah, looks like there was no robots.txt in the past: https://web.archive.org/web/20170407001344/http://cdn.kernel.org/robots.txt
17:05 🔗 arkiver afaik IA doesn't care when robots.txt was created, it's about the latest one
17:05 🔗 JAA Yeah, I was about to ask that. That's my experience as well.
17:06 🔗 arkiver IA recently changed something with robots.txt, not sure what exactly could be viewing only and not saving through live
17:06 🔗 arkiver (remember all the angry people)
17:06 🔗 arkiver so if you want it archived do it through archivebot :)
17:08 🔗 arkiver it could be a bug that https://web.archive.org/web/*/http://cdn.kernel.org:80/pub/linux/kernel/* doesn't list anything while https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ exists
17:08 🔗 arkiver will report that too
17:09 🔗 DragonMon hmm is https://mirrors.edge.kernel.org/pub not equal to http://cdn.kernel.org:80/pub/linux/kernel/v4.x/
17:10 🔗 DragonMon I mean are the two pub folders the same content?
17:10 🔗 DragonMon seems like it is
17:17 🔗 tyzoid DragonMon: What I was saying is that https://www.kernel.org/pub redirects to mirrors.edge.kernel.org/pub
17:17 🔗 tyzoid www.kernel.org/pub is the url they mentioned in their email as specifically allowing bots, which after the redirect does not
17:18 🔗 DragonMon right ok, I did let them know. I haven't gotten a reply yet
17:23 🔗 schbirid has joined #archiveteam-bs
17:27 🔗 DragonMon tyzoid: JAA arkiver https://i.imgur.com/776k3T0.png They are updating it
17:28 🔗 tyzoid DragonMon: sweet!
17:31 🔗 DragonMon So if all these archives are getting uploaded as WARC files, is it possible to de-duplicate?
17:32 🔗 tyzoid I'm not sure how the archive handles duplicate content
17:32 🔗 DragonMon match files that are identical from other archives and uploads to reduce overall data size
17:32 🔗 DragonMon hmm
17:33 🔗 DragonMon if not that's a crazy amount of data that got uploaded when a site had restrictive robots.txt to when it didn't
17:33 🔗 DragonMon any site that did this
17:38 🔗 DragonMon geez I hope archive.org does something for data duplication. Otherwise where are they getting the cash for this?
17:39 🔗 tyzoid Donations mostly
17:39 🔗 tyzoid Hard disk space is relatively cheap
17:40 🔗 Valentine has quit IRC (Quit: Addio, adieu, adios, aloha, arrivederci, auf Wiedersehen, au revoir, bye, bye-bye, cheerio, cheers, farewell, good)
17:40 🔗 DragonMon the amount of money and time I spend for my measly 4 TB of personal data on 8TB total of drives... doesn't hold a candle to this
17:41 🔗 tyzoid Economies of Scale work to their benefit here
17:42 🔗 tyzoid They can order hard drives in bulk, saving money
17:42 🔗 tyzoid and they have very efficient storage systems, which don't require a ton of power
17:42 🔗 tyzoid i.e. 4u racks of hard drives
17:42 🔗 tyzoid http://archive.org/web/petabox.php
17:44 🔗 DragonMon I wonder what the failure rate is
17:44 🔗 tyzoid quite low, in general
17:44 🔗 DragonMon how often they need to replace a drive
17:44 🔗 Meroje I'd guess it's similar to backblaze stats
17:45 🔗 tyzoid https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/
17:45 🔗 DragonMon Has anyone tried to take them down?
17:45 🔗 tyzoid IA? I'm sure
17:45 🔗 tyzoid But being a nonprofit library means you have allies
17:46 🔗 DragonMon Hackers sure... but I'm talking about physical attacks
17:46 🔗 tyzoid oh, idk
17:46 🔗 tyzoid doubt it, though
17:47 🔗 tyzoid DragonMon: I'd expect about ~3-4% failure rate per year, estimating on the high side
17:47 🔗 DragonMon of all the crazy crap going on in the world, libraries and projects like archive.org seem to be constants
17:47 🔗 tyzoid And I would probably expect to refresh the physical servers about once every 6-8 years
17:47 🔗 DragonMon things you can rely on
17:48 🔗 DragonMon in recent history that is
17:48 🔗 tyzoid and with the IA.BAK, we can hope that it'll continue
17:48 🔗 DragonMon you never really hear about libraries getting attacked
17:48 🔗 SimpBrain has quit IRC (Read error: Operation timed out)
17:49 🔗 SimpBrain has joined #archiveteam-bs
17:49 🔗 tyzoid DragonMon: IIRC the library of Alexandria was intentionally burned down while at war.
17:50 🔗 DragonMon it's why I said 'recent history
17:50 🔗 DragonMon '
17:50 🔗 Valentine has joined #archiveteam-bs
17:50 🔗 DragonMon lol
17:51 🔗 DragonMon tyzoid: fixed
17:51 🔗 DragonMon https://mirrors.edge.kernel.org/robots.txt
17:51 🔗 DragonMon should I run another archive request?
17:51 🔗 tyzoid DragonMon: Mosul public library by Isis in 2015, Libraries in Anbar Province by Isis in 2014, Mosul private libraries by Isis in 2014, National Archives of Bosnia and Herzegovina by rioters in 2014...
17:52 🔗 tyzoid need I go on?
17:52 🔗 tyzoid https://en.wikipedia.org/wiki/List_of_destroyed_libraries
17:52 🔗 tyzoid DragonMon: The previous archivebot grab should have gotten most things. You can try, if you want.
17:53 🔗 DragonMon tyzoid: I guess I missed
17:53 🔗 lindalap ams.edge.kernel.org still has User-Agent: * Disallow: /
17:53 🔗 tyzoid lindalap: I don't think that should be a problem
17:54 🔗 DragonMon tyzoid: wouldn't the last archive done include the old robot.txt?
17:59 🔗 DragonMon tyzoid: I tried manually saving a link using archive.org itself and it's still complaining about robots.txt
17:59 🔗 arkiver IA does no deduplication
17:59 🔗 arkiver <DragonMon>So if all these archives are getting uploaded as WARC files, is it possible to de-duplicate?
18:00 🔗 arkiver well no deduplication of WARCs in items
18:00 🔗 DragonMon arkiver: so say website-fun.com/this.png was IDENTICAL to twitter.com/this.png would it still get duplicated?
18:00 🔗 arkiver yes
18:01 🔗 arkiver i'm against deduplicating it right now too
18:01 🔗 DragonMon yea it might cause some confusion if something gets corrupted
18:01 🔗 arkiver note: not necessarily IA opinion
18:01 🔗 JAA ArchiveBot should deduplicate within one job, but that's broken at the moment.
18:01 🔗 JAA Not across jobs though.
18:01 🔗 arkiver so WARCs currently hash the payload using SHA1
18:02 🔗 arkiver which can cause collision with the earlier demonstrated attack
18:02 🔗 arkiver causing stuff to be 'deduplicated'/deleted from the wayback machine if done succesful in certain circumstances
18:03 🔗 arkiver that is different WARC payloads with the same SHA1
18:03 🔗 tyzoid arkiver: Didn't google have a patch for sha1 that returned a different hash if a bad input is detected?
18:03 🔗 arkiver no idea
18:03 🔗 JAA That sounds like an awful idea.
18:03 🔗 arkiver but that is just getting trying to get rid of symptoms
18:04 🔗 tyzoid yeah, sha3 ftw
18:04 🔗 JAA Or SHA-2.
18:05 🔗 DragonMon will archive.org eventually 'sync' and unblock data once it processes the new robots.txt? Because it's still giving me an error about robots.txt after the change
18:05 🔗 tyzoid IIRC yes
18:05 🔗 JAA Yes, that's what should happen.
18:05 🔗 tyzoid problem is that the links go to cdn.kernel.org
18:05 🔗 DragonMon alright cool
18:05 🔗 DragonMon oh
18:05 🔗 tyzoid so we're still at square one
18:05 🔗 DragonMon erm
18:06 🔗 arkiver square one of what
18:06 🔗 tyzoid arkiver: Go to kernel.org, and hover over any of the download links
18:06 🔗 arkiver yeah
18:06 🔗 tyzoid it'll show that the link goes to cdn.kernel.org/pub
18:06 🔗 arkiver yeah
18:06 🔗 tyzoid or git.kernel.org/pub
18:06 🔗 JAA I thought they're going to update the cdn.kernel.org robots.txt?
18:06 🔗 tyzoid which are still blocked
18:06 🔗 tyzoid JAA: They changed mirrors.edge.kernel.org/robots.txt
18:06 🔗 JAA Hmm, why not the CDN?
18:07 🔗 arkiver I think it was already demonstrated that they are downloadable from the wayback machine if saved?
18:07 🔗 arkiver just save them through archivebot if you want them to be saved
18:07 🔗 tyzoid JAA: From what I can tell, the CDN is generated from cgit on the web
18:07 🔗 tyzoid which they claim to put strain on their systems to allow robots
18:07 🔗 JAA tyzoid: https://i.imgur.com/EINMXfh.png was only about git.kernel.org.
18:07 🔗 arkiver so I'd say not square one? answer would be archivebot and downloading from wayback seems to work, at least for that page that we checked
18:08 🔗 arkiver https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/
18:08 🔗 tyzoid JAA: cdn.kernel.org looks to be the same as git.kernel.org
18:09 🔗 tyzoid arkiver: Links are still broken on kernel.org homepage, though
18:09 🔗 JAA Give it time...
18:09 🔗 tyzoid We'll see
18:09 🔗 DragonMon should I email about cdn.kernel.org?
18:09 🔗 tyzoid I'm not convinced it'll be fixed
18:09 🔗 tyzoid but we can wait
18:09 🔗 tyzoid it's not like kernel.org is going anywhere soon
18:10 🔗 DragonMon "
18:10 🔗 DragonMon I'm seeing a similar issue with https://cdn.kernel.org/robots.txt if anything gets redirected there Internet Archive will still have issues grabbing from https://cdn.kernel.org/pub OR https://kernel.org/pub"
18:11 🔗 DragonMon I know kernel.org isn't going anywhere and I'd be surprised if their team doesn't have backups of backups buried under backups stuck in time capsules of backups somewhere. But openwrt.org recently had major corruption of their site and forums due to hardware failure
18:11 🔗 DragonMon mostly forums iirc
18:12 🔗 DragonMon https://forum.openwrt.org/ -- "The OpenWrt forum is currently offline due to a hardware problem on the hosting machine."
18:18 🔗 arkiver right
18:33 🔗 fie has quit IRC (Read error: Operation timed out)
18:45 🔗 fie has joined #archiveteam-bs
19:15 🔗 JAA arkiver: Are you aware of any efforts to replace SHA-1 in WARCs? I guess since the specification allows for any algorithm to be used, it's simply a matter of coordinating with the different authors of WARC-related software?
19:15 🔗 arkiver I'm not aware of any efforts like that
19:19 🔗 DragonMon has quit IRC (Read error: Connection reset by peer)
19:21 🔗 JAA Hmm, looks like the spec doesn't allow for multiple headers of the same type (except for WARC-Concurrent-To), so having multiple digests for the same record for backwards compatibility won't be possible (unless the spec gets modified).
19:22 🔗 JAA Oh well, I think there are more pressing issues with WARC, like adding a way to store SSL certificates.
19:22 🔗 JAA a standardised way*
19:23 🔗 arkiver everything is decided here https://github.com/iipc/warc-specifications
19:23 🔗 arkiver but it's slow and taking long and all that
19:23 🔗 JAA Yeah
19:24 🔗 arkiver however I think we are allowed to add our own random fields too, and we can store SSL stuff as for example resource records
19:24 🔗 JAA The response is also frequently "implementation first please".
19:24 🔗 arkiver i think we can do that?
19:26 🔗 JAA For sure.
19:26 🔗 arkiver we can find a good way to store SSL and other stuff (DNS?) and make an issue there. If the responses are not too negative I think we can just start using it. Then there is more of a reason for them to accept it if it's already in billions of records.
19:26 🔗 arkiver Of course only if the responses are not totally negative towards the idea
19:26 🔗 jrwr You could double up on it arkiver
19:27 🔗 jrwr have SHA1 header and then a SHA256 header
19:27 🔗 JAA Nope, the spec doesn't allow for that.
19:27 🔗 JAA "WARC named fields of the same type shall not be repeated in the same WARC record"
19:27 🔗 jrwr what about using a new header field
19:27 🔗 jrwr I understand not having more then one
19:28 🔗 jrwr but as metadata
19:28 🔗 JAA That would be possible, but ugly.
19:28 🔗 jrwr best way to ensure compatibility for now
19:28 🔗 arkiver JAA: what do you think of that idea?
19:28 🔗 arkiver not sure if it's a good approach
19:31 🔗 arkiver JAA: one thing I have been thinking about a lot and what I really really want in there is torrents support
19:31 🔗 arkiver and/or magnets
19:31 🔗 arkiver especially with webtorrents that are sometimes used to load stuff likes images and videos
19:32 🔗 arkiver JAA: We could start working with archiveteam on drafting ideas for archiving torrents and the stuff that is downloaded with them
19:32 🔗 jrwr wait, im looking at the spec
19:32 🔗 jrwr WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
19:32 🔗 jrwr WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
19:32 🔗 jrwr so it has a sha1 prefix
19:33 🔗 JAA arkiver: I like the idea of adding SSL certificates in resource records or similar. DNS would fit into response records; Heritrix already does that actually.
19:34 🔗 DragonMon has joined #archiveteam-bs
19:34 🔗 arkiver JAA: hmm didn't know that about heritrix, pretty nice
19:34 🔗 JAA jrwr: That's right. Unfortunately, whatever you do, it won't be backwards compatible.
19:34 🔗 JAA Well, except your solution but that's really ugly in my opinion.
19:34 🔗 tyzoid JAA / arkiver: I'm in favor of it, if we can find a way to be able to trace the content back up to a trusted certificate
19:35 🔗 tyzoid IIRC, that'd require storing the session secret in the warc file
19:35 🔗 JAA tyzoid: Even that won't help since the content is encrypted symmetrically.
19:35 🔗 arkiver aaand how about the webtorrents :)
19:35 🔗 tyzoid JAA: The content is symmetrically encrypted with a key that's agreed upon (usually by) Diffie Helmen
19:36 🔗 tyzoid hellman*
19:36 🔗 tyzoid Which is asymmetric
19:36 🔗 JAA tyzoid: Yeah, but the client (which writes the WARC) could modify the content at will without affecting the key.
19:36 🔗 DragonMon tyzoid: JAA ok so there was some issue with configuration, they are updating https://cdn.kernel.org/robots.txt now -- https://i.imgur.com/CswwPlL.png
19:37 🔗 tyzoid Right. I'd need to look at the protocol more in depth, but I believe there's a way to be able to store enough data to verify the message
19:37 🔗 JAA tyzoid: Yeah, *if* the right cipher is used it might work.
19:37 🔗 tyzoid JAA: Luckily, the client controls the cipher used
19:38 🔗 JAA To a degree, yeah. But it still needs to be compatible enough to grab everything.
19:38 🔗 tyzoid JAA: As long as we've got a wide enough range of supported ciphers, the server will pick one they prefer. If that doesn't work, we can just fall back to what we've got now
19:39 🔗 JAA Yeah, true.
19:39 🔗 JAA DragonMon: Excellent, thanks.
19:39 🔗 tyzoid Yes, glad that's sorted ^
20:40 🔗 schbirid has quit IRC (Quit: Leaving)
21:15 🔗 plue has quit IRC (Quit: leaving)
21:19 🔗 plue has joined #archiveteam-bs
22:11 🔗 jschwart has quit IRC (Quit: Konversation terminated!)
22:12 🔗 BlueMax has joined #archiveteam-bs
23:39 🔗 Despatche has joined #archiveteam-bs

irclogger-viewer