#archiveteam-bs 2018-05-29,Tue

↑back Search

Time	Nickname	Message
00:40 ^🔗		ta9le has quit IRC (Quit: Connection closed for inactivity)
00:46 ^🔗		ndiddylap has joined #archiveteam-bs
02:07 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
02:07 ^🔗		Mateon1 has joined #archiveteam-bs
02:14 ^🔗		ndiddy_ has joined #archiveteam-bs
02:18 ^🔗		ndiddylap has quit IRC (Read error: Operation timed out)
02:19 ^🔗		ndiddylap has joined #archiveteam-bs
02:21 ^🔗		apache2 has quit IRC (Remote host closed the connection)
02:21 ^🔗		apache2 has joined #archiveteam-bs
02:22 ^🔗		ndiddy_ has quit IRC (Read error: Operation timed out)
02:24 ^🔗		Tenebrae has quit IRC (Ping timeout: 260 seconds)
02:25 ^🔗		Tenebrae has joined #archiveteam-bs
02:25 ^🔗		plue has quit IRC (Ping timeout: 260 seconds)
02:26 ^🔗		plue has joined #archiveteam-bs
02:50 ^🔗	SketchCow	OK
02:50 ^🔗	SketchCow	I tend to upload when they stop growing for a while, if that matters
02:50 ^🔗	SketchCow	But I'm fine
03:17 ^🔗		qw3rty117 has joined #archiveteam-bs
03:23 ^🔗		qw3rty116 has quit IRC (Read error: Operation timed out)
03:41 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
03:53 ^🔗		odemg has joined #archiveteam-bs
03:59 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
04:32 ^🔗		ndiddylap has quit IRC (Read error: Operation timed out)
05:06 ^🔗		Pixi has quit IRC (Quit: Pixi)
05:06 ^🔗		Pixi has joined #archiveteam-bs
06:20 ^🔗		schbirid has joined #archiveteam-bs
07:15 ^🔗		schbirid has quit IRC (Quit: Leaving)
08:21 ^🔗		SmileyG_ has joined #archiveteam-bs
08:24 ^🔗		SmileyG has quit IRC (Ping timeout: 260 seconds)
09:44 ^🔗	godane	SketchCow: ok
09:44 ^🔗	godane	any word from Mank?
10:23 ^🔗		ta9le has joined #archiveteam-bs
11:32 ^🔗	lindalap	JAA: It wasn't long ago when Jagex sold their business to the Chinese, or something. It's been downhill from there. They also increased costs of their subscription recently.
11:33 ^🔗	JAA	I see.
11:47 ^🔗	lindalap	http://www.runescape.com/robots.txt
11:47 ^🔗	lindalap	I guess I have no words
11:56 ^🔗	JAA	lol
12:46 ^🔗	lindalap	I almost forgot. Jagex is also closing Ace of Spades in few days.
12:46 ^🔗	lindalap	Ace of Spades, RuneScape Classic and FunOrb. Whee.
12:56 ^🔗		C4K3 has quit IRC (Read error: Operation timed out)
13:33 ^🔗		BlueMax has quit IRC (Leaving)
13:52 ^🔗		C4K3 has joined #archiveteam-bs
14:11 ^🔗		ndiddylap has joined #archiveteam-bs
14:13 ^🔗		balrog has quit IRC (Bye)
14:21 ^🔗		balrog has joined #archiveteam-bs
14:21 ^🔗		swebb sets mode: +o balrog
14:33 ^🔗	Frogging	lmao this robots.txt
14:43 ^🔗	tyzoid	Frogging: Where?
14:43 ^🔗	Frogging	http://www.runescape.com/robots.txt
14:43 ^🔗	tyzoid	oh, for runescape, lol
14:43 ^🔗	Frogging	yeah
14:44 ^🔗	tyzoid	damn, keep running into this: error uploading at-00156.warc: We encountered an internal error. Please try again. - uploadItem.py
14:47 ^🔗	tyzoid	worked after four retries :/
14:48 ^🔗		balrog has quit IRC (Bye)
14:50 ^🔗		balrog has joined #archiveteam-bs
14:50 ^🔗		swebb sets mode: +o balrog
14:55 ^🔗	arkiver	tyzoid: where are you uploading these?
14:55 ^🔗	arkiver	which item
14:57 ^🔗	tyzoid	arkiver: https://archive.org/download/tyzoid-acidplanet-audio
14:57 ^🔗	tyzoid	I'm going through and re-uploading the ones that failed to upload before
14:58 ^🔗	arkiver	ah you're using the new wget
14:58 ^🔗	arkiver	it writes different WARC headers than normal
14:58 ^🔗	tyzoid	whatever is latest on ubuntu 18.04, I assume it's new
14:58 ^🔗	arkiver	https://archive.org/download/tyzoid-acidplanet-audio/tyzoid-acidplanet-audio.cdx.idx
14:58 ^🔗	DragonMon	Hi
14:59 ^🔗	arkiver	lines from the IDX
14:59 ^🔗	arkiver	<http)/av.acidplanet.com.s3-us-west-2.amazonaws.com/ap/0288000/287820-6604-296096.asf> 20180527113823 tyzoid-acidplanet-audio.cdx.gz 563522 183423
14:59 ^🔗	arkiver	note the <http
14:59 ^🔗	arkiver	that is wrong
14:59 ^🔗	DragonMon	is there a good reason why tarballs are excluded from most of the https://kernel.org archives?
14:59 ^🔗	arkiver	we encountered this issue before, IA can't handle it (yet), I'll raise an issue
14:59 ^🔗	tyzoid	Sounds good
15:00 ^🔗	DragonMon	the point of archiving that website is so there are copies of older Linux Kernel sources
15:00 ^🔗	arkiver	(this is not related to your uploading issue)
15:00 ^🔗	tyzoid	DragonMon: It's backed by git, so you can always git checkout to the tag
15:01 ^🔗	DragonMon	tyzoid: yes but is there a web archive of the git repo?
15:01 ^🔗	DragonMon	lol
15:01 ^🔗	tyzoid	DragonMon: Yes: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
15:02 ^🔗	DragonMon	not what I meant
15:02 ^🔗	DragonMon	I'm sure Linux is one of the most archived things around but shouldn't archive.org include the source tarballs?
15:02 ^🔗	tyzoid	you can go all the way back if you want: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-2.6.13-rc3.tar.gz
15:03 ^🔗	tyzoid	just grab all the snapshots
15:03 ^🔗	tyzoid	arkiver: Yeah, no problem. I just wrapped it in a loop to retry on nonzero return status
15:04 ^🔗	DragonMon	tyzoid: but why couldn't archive.org also get a copy of those?
15:04 ^🔗	DragonMon	the download links are broken if navigated in archive.org
15:06 ^🔗		moufu_ is now known as moufu
15:08 ^🔗	JAA	Ew, that uri definition problem from the WARC specification again.
15:08 ^🔗	arkiver	yeah
15:08 ^🔗	arkiver	raised an issue, shouldn't be too hard to fix in the derive process
15:09 ^🔗	arkiver	proces*
15:09 ^🔗	JAA	So they changed wget to comply with WARC 1.0 strictly instead of moving to 1.1?
15:09 ^🔗	JAA	... or just ignoring it, since all other tools don't include the angle brackets anyway.
15:10 ^🔗	tyzoid	DragonMon: seems like it's grabbing it "200 https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.9.103.tar.xz"
15:10 ^🔗	DragonMon	tyzoid: I see a older archiveteam archive from a few days ago on archive.org and the links are broken
15:11 ^🔗	JAA	DragonMon: Link?
15:11 ^🔗	DragonMon	hang on
15:12 ^🔗	tyzoid	JAA: http://web.archive.org/web/20180529073551/https://git.kernel.org/torvalds/t/linux-4.17-rc7.tar.gz
15:12 ^🔗	DragonMon	https://web.archive.org/web/20180521085957/https://www.kernel.org/
15:12 ^🔗	tyzoid	link clicked on from http://web.archive.org/web/20180529073551/https://www.kernel.org/
15:13 ^🔗	DragonMon	right
15:13 ^🔗	JAA	DragonMon: Hmm, nobody grabbed kernel.org on that date directly. Most likely, it was just a link to kernel.org from another site.
15:13 ^🔗	JAA	ArchiveBot grabs external links, but it doesn't recurse on them for obvious reasons.
15:14 ^🔗	DragonMon	hmm
15:14 ^🔗	JAA	So if you grab example.org and example.org/kernel.html has a link to kernel.org, it'll grab kernel.org but not any links on it.
15:14 ^🔗	DragonMon	tyzoid: so it should show up on the most recent grab?
15:14 ^🔗	tyzoid	idk
15:14 ^🔗	tyzoid	perhaps
15:15 ^🔗	DragonMon	it's strange if it doesn't.... It's open source, it's meant to be saved and shared
15:15 ^🔗	JAA	It won't: https://cdn.kernel.org/robots.txt
15:15 ^🔗	tyzoid	JAA: I would imagine that the link shouldn't be broken, though, it'll get the nearest 20x response in time to the current archive
15:15 ^🔗	tyzoid	unless it's not archived at all
15:15 ^🔗	JAA	Correct.
15:15 ^🔗	DragonMon	I wonder what the idea is behind that, it seems odd
15:16 ^🔗	JAA	But in this case, it wouldn't work ever because cdn.kernel.org blocks the access to robots.
15:16 ^🔗	JAA	Probably to prevent unnecessary traffic from crawlers.
15:16 ^🔗	DragonMon	I see that but why limit grabs like that? ddos maybe?
15:16 ^🔗	arkiver	JAA: tyzoid: well it's raised, I don't expect it to be too hard to fix, will keep you informed :)
15:16 ^🔗	DragonMon	hmm
15:16 ^🔗	JAA	git.kernel.org has the same thing.
15:16 ^🔗	tyzoid	arkiver: Thanks
15:16 ^🔗	DragonMon	git* I get
15:17 ^🔗	JAA	I wouldn't be surprised if the people over at kernel.org would be willing to add an exception for ia_archiver to robots.txt.
15:17 ^🔗	tyzoid	JAA: Yeah, http://web.archive.org/web//https://cdn.kernel.org/pub/linux/kernel/v3.x/ isn't turning up any results.
15:18 ^🔗	JAA	Yeah... Try this: https://web.archive.org/save/https://cdn.kernel.org/pub/linux/kernel/v4.x/
15:18 ^🔗	tyzoid	"Page cannot be displayed due to robots.txt"
15:18 ^🔗	tyzoid	https://cdn.kernel.org/robots.txt
15:19 ^🔗	JAA	Exactly.
15:19 ^🔗	tyzoid	kernel.org has no such restriction
15:19 ^🔗	tyzoid	though since we've grabbed it via archivebot, it should appear, right?
15:19 ^🔗	JAA	Well, the Wayback Machine will still block access to it.
15:19 ^🔗	JAA	But the data will be there in the WARCs.
15:19 ^🔗	tyzoid	ah, right.
15:20 ^🔗	tyzoid	darn
15:20 ^🔗	JAA	And hopefully IA will some day finally remove that robots.txt handling.
15:20 ^🔗	DragonMon	nah, people who don't want their stuff online will go to IA nagging for their content to be removed
15:21 ^🔗	DragonMon	which is ironic I think
15:21 ^🔗	tyzoid	I would imagine that the wayback machine falls under fair use law in the US anyway
15:23 ^🔗	DragonMon	I'm about to fire off a email to the Linux Foundation. What do they need to add to allow ia_archiver?
15:25 ^🔗	JAA	User-agent: ia_archiver
15:25 ^🔗	JAA	Disallow:
15:26 ^🔗	JAA	I think they'd have to add that before the general disallow rule.
15:27 ^🔗	tyzoid	not after?
15:29 ^🔗	tyzoid	JAA: https://moz.com/learn/seo/robotstxt makes it seem like order doesn't matter
15:30 ^🔗	JAA	I'm not really sure to be honest.
15:31 ^🔗	DragonMon	ok well they should know, someone there setup a flag for the Google bot
15:32 ^🔗	JAA	The original spec doesn't really mention anything about it: http://www.robotstxt.org/orig.html
15:32 ^🔗	DragonMon	Email has been sent, I'll see what they respond with
15:32 ^🔗	JAA	Sweet
15:34 ^🔗	DragonMon	helpdesk@rt.linuxfoundation.org For all issues with Linux Foundation websites or systems, including questions about Linux.com email addresses.
15:34 ^🔗	tyzoid	JAA: I'm cleaning up my archivebot box from the acid grab, so things should start moving a bit better for the archivebot
15:34 ^🔗	JAA	By the way: https://archive.org/details/git-history-of-linux
15:35 ^🔗	tyzoid	sweet
15:35 ^🔗	tyzoid	I wonder how they did that without changing commit IDs
15:36 ^🔗	JAA	Considering that git-filter-branch was used according to the description, it probably did change the commit IDs.
15:37 ^🔗	tyzoid	yeah, it says something about git graft, though I'll have to read up more on it
15:37 ^🔗	JAA	Yeah, looks like those three parts are sort-of merged together without affecting the history.
15:38 ^🔗	JAA	"Graft points or grafts enable two otherwise different lines of development to be joined together. It works by letting users record fake ancestry information for commits. This way you can make git pretend the set of parents a commit has is different from what was recorded when the commit was created." https://git.wiki.kernel.org/index.php/GraftPoint
15:38 ^🔗	JAA	Never heard of it before. Very interesting.
15:54 ^🔗	DragonMon	whelp I quick fired the email to the wrong place, sent a new one to webmaster@kernel.org
15:54 ^🔗	DragonMon	someone did get back to me from Linux Foundation pointing me to that email
16:21 ^🔗		wp494 has quit IRC (Ping timeout: 633 seconds)
16:22 ^🔗		wp494 has joined #archiveteam-bs
16:23 ^🔗		svchfoo1 sets mode: +o wp494
16:28 ^🔗	DragonMon	JAA: tyzoid https://i.imgur.com/EINMXfh.png this is their reply
16:29 ^🔗	tyzoid	DragonMon: No, we are referring to the tarballs on cdn.kernel.org
16:29 ^🔗	tyzoid	https://cdn.kernel.org/robots.txt
16:29 ^🔗	tyzoid	which are the release tarballs
16:29 ^🔗	tyzoid	and those are located at https://cdn.kernel.org/pub
16:30 ^🔗	tyzoid	and the kernel mirror denies all bots too (where www.kernel.org/pub redirects to)
16:30 ^🔗	tyzoid	https://mirrors.edge.kernel.org/robots.txt
16:37 ^🔗	DragonMon	tyzoid: ok I bounced another email "When Internet Archive goes to archive that website https://kernel.org/pub it gets redirected to https://mirrors.edge.kernel.org/pub which has a restriction https://mirrors.edge.kernel.org/robots.txt Can that be fixed so anything under the sub folder pub can be archived?"
16:38 ^🔗	tyzoid	Well, we don't necessarily want to mirror the entire software mirror
16:38 ^🔗	tyzoid	It's really the stuff under cdn.kernel.org which we're after
16:39 ^🔗	DragonMon	Well the they are saying that /pub is supposed to be restriction free
16:39 ^🔗	DragonMon	hmm
16:40 ^🔗	DragonMon	I should have been more clear to not restrict the source code tarballs
16:55 ^🔗	DragonMon	"<tyzoid> DragonMon: if you're concerned about those release tarballs, we got 'em when we grabbed kernel.org" tyzoid: where would they be available then?
16:56 ^🔗	JAA	Soon, at an archive near you. ;-)
16:56 ^🔗	JAA	There's a delay of some hours to few days until the archives from ArchiveBot end up on IA.
16:57 ^🔗	JAA	Until the robots.txt is fixed, you'll have to access the files directly from the WARCs, not through the Wayback Machine.
16:57 ^🔗	DragonMon	so the collections and not as apart of the main archive.org. So the links will be broken
16:58 ^🔗	JAA	The links will be broken because of robots.txt, not because of where the archives are stored.
16:58 ^🔗	JAA	ArchiveBot WARCs do get ingested into the Wayback Machine, but robots.txt prevents the access for these specific URLs.
16:58 ^🔗		jschwart has joined #archiveteam-bs
16:59 ^🔗	DragonMon	wait, that seems somehow worse. I thought the files got chucked because of robot.txt So there's potentially tons of data archive.org has but cannot make readily easy to access?
17:00 ^🔗	JAA	Yep
17:00 ^🔗	DragonMon	oh damn
17:03 ^🔗	JAA	Better to have the data somewhere behind a (complete or partial) block than to not have it at all. But yeah, it's not optimal.
17:04 ^🔗	arkiver	but for example https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ works?
17:04 ^🔗	arkiver	only not able to save stuff through live wayback
17:04 ^🔗	JAA	Yeah, looks like there was no robots.txt in the past: https://web.archive.org/web/20170407001344/http://cdn.kernel.org/robots.txt
17:05 ^🔗	arkiver	afaik IA doesn't care when robots.txt was created, it's about the latest one
17:05 ^🔗	JAA	Yeah, I was about to ask that. That's my experience as well.
17:06 ^🔗	arkiver	IA recently changed something with robots.txt, not sure what exactly could be viewing only and not saving through live
17:06 ^🔗	arkiver	(remember all the angry people)
17:06 ^🔗	arkiver	so if you want it archived do it through archivebot :)
17:08 ^🔗	arkiver	it could be a bug that https://web.archive.org/web//http://cdn.kernel.org:80/pub/linux/kernel/ doesn't list anything while https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ exists
17:08 ^🔗	arkiver	will report that too
17:09 ^🔗	DragonMon	hmm is https://mirrors.edge.kernel.org/pub not equal to http://cdn.kernel.org:80/pub/linux/kernel/v4.x/
17:10 ^🔗	DragonMon	I mean are the two pub folders the same content?
17:10 ^🔗	DragonMon	seems like it is
17:17 ^🔗	tyzoid	DragonMon: What I was saying is that https://www.kernel.org/pub redirects to mirrors.edge.kernel.org/pub
17:17 ^🔗	tyzoid	www.kernel.org/pub is the url they mentioned in their email as specifically allowing bots, which after the redirect does not
17:18 ^🔗	DragonMon	right ok, I did let them know. I haven't gotten a reply yet
17:23 ^🔗		schbirid has joined #archiveteam-bs
17:27 ^🔗	DragonMon	tyzoid: JAA arkiver https://i.imgur.com/776k3T0.png They are updating it
17:28 ^🔗	tyzoid	DragonMon: sweet!
17:31 ^🔗	DragonMon	So if all these archives are getting uploaded as WARC files, is it possible to de-duplicate?
17:32 ^🔗	tyzoid	I'm not sure how the archive handles duplicate content
17:32 ^🔗	DragonMon	match files that are identical from other archives and uploads to reduce overall data size
17:32 ^🔗	DragonMon	hmm
17:33 ^🔗	DragonMon	if not that's a crazy amount of data that got uploaded when a site had restrictive robots.txt to when it didn't
17:33 ^🔗	DragonMon	any site that did this
17:38 ^🔗	DragonMon	geez I hope archive.org does something for data duplication. Otherwise where are they getting the cash for this?
17:39 ^🔗	tyzoid	Donations mostly
17:39 ^🔗	tyzoid	Hard disk space is relatively cheap
17:40 ^🔗		Valentine has quit IRC (Quit: Addio, adieu, adios, aloha, arrivederci, auf Wiedersehen, au revoir, bye, bye-bye, cheerio, cheers, farewell, good)
17:40 ^🔗	DragonMon	the amount of money and time I spend for my measly 4 TB of personal data on 8TB total of drives... doesn't hold a candle to this
17:41 ^🔗	tyzoid	Economies of Scale work to their benefit here
17:42 ^🔗	tyzoid	They can order hard drives in bulk, saving money
17:42 ^🔗	tyzoid	and they have very efficient storage systems, which don't require a ton of power
17:42 ^🔗	tyzoid	i.e. 4u racks of hard drives
17:42 ^🔗	tyzoid	http://archive.org/web/petabox.php
17:44 ^🔗	DragonMon	I wonder what the failure rate is
17:44 ^🔗	tyzoid	quite low, in general
17:44 ^🔗	DragonMon	how often they need to replace a drive
17:44 ^🔗	Meroje	I'd guess it's similar to backblaze stats
17:45 ^🔗	tyzoid	https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/
17:45 ^🔗	DragonMon	Has anyone tried to take them down?
17:45 ^🔗	tyzoid	IA? I'm sure
17:45 ^🔗	tyzoid	But being a nonprofit library means you have allies
17:46 ^🔗	DragonMon	Hackers sure... but I'm talking about physical attacks
17:46 ^🔗	tyzoid	oh, idk
17:46 ^🔗	tyzoid	doubt it, though
17:47 ^🔗	tyzoid	DragonMon: I'd expect about ~3-4% failure rate per year, estimating on the high side
17:47 ^🔗	DragonMon	of all the crazy crap going on in the world, libraries and projects like archive.org seem to be constants
17:47 ^🔗	tyzoid	And I would probably expect to refresh the physical servers about once every 6-8 years
17:47 ^🔗	DragonMon	things you can rely on
17:48 ^🔗	DragonMon	in recent history that is
17:48 ^🔗	tyzoid	and with the IA.BAK, we can hope that it'll continue
17:48 ^🔗	DragonMon	you never really hear about libraries getting attacked
17:48 ^🔗		SimpBrain has quit IRC (Read error: Operation timed out)
17:49 ^🔗		SimpBrain has joined #archiveteam-bs
17:49 ^🔗	tyzoid	DragonMon: IIRC the library of Alexandria was intentionally burned down while at war.
17:50 ^🔗	DragonMon	it's why I said 'recent history
17:50 ^🔗	DragonMon	'
17:50 ^🔗		Valentine has joined #archiveteam-bs
17:50 ^🔗	DragonMon	lol
17:51 ^🔗	DragonMon	tyzoid: fixed
17:51 ^🔗	DragonMon	https://mirrors.edge.kernel.org/robots.txt
17:51 ^🔗	DragonMon	should I run another archive request?
17:51 ^🔗	tyzoid	DragonMon: Mosul public library by Isis in 2015, Libraries in Anbar Province by Isis in 2014, Mosul private libraries by Isis in 2014, National Archives of Bosnia and Herzegovina by rioters in 2014...
17:52 ^🔗	tyzoid	need I go on?
17:52 ^🔗	tyzoid	https://en.wikipedia.org/wiki/List_of_destroyed_libraries
17:52 ^🔗	tyzoid	DragonMon: The previous archivebot grab should have gotten most things. You can try, if you want.
17:53 ^🔗	DragonMon	tyzoid: I guess I missed
17:53 ^🔗	lindalap	ams.edge.kernel.org still has User-Agent: * Disallow: /
17:53 ^🔗	tyzoid	lindalap: I don't think that should be a problem
17:54 ^🔗	DragonMon	tyzoid: wouldn't the last archive done include the old robot.txt?
17:59 ^🔗	DragonMon	tyzoid: I tried manually saving a link using archive.org itself and it's still complaining about robots.txt
17:59 ^🔗	arkiver	IA does no deduplication
17:59 ^🔗	arkiver	<DragonMon>So if all these archives are getting uploaded as WARC files, is it possible to de-duplicate?
18:00 ^🔗	arkiver	well no deduplication of WARCs in items
18:00 ^🔗	DragonMon	arkiver: so say website-fun.com/this.png was IDENTICAL to twitter.com/this.png would it still get duplicated?
18:00 ^🔗	arkiver	yes
18:01 ^🔗	arkiver	i'm against deduplicating it right now too
18:01 ^🔗	DragonMon	yea it might cause some confusion if something gets corrupted
18:01 ^🔗	arkiver	note: not necessarily IA opinion
18:01 ^🔗	JAA	ArchiveBot should deduplicate within one job, but that's broken at the moment.
18:01 ^🔗	JAA	Not across jobs though.
18:01 ^🔗	arkiver	so WARCs currently hash the payload using SHA1
18:02 ^🔗	arkiver	which can cause collision with the earlier demonstrated attack
18:02 ^🔗	arkiver	causing stuff to be 'deduplicated'/deleted from the wayback machine if done succesful in certain circumstances
18:03 ^🔗	arkiver	that is different WARC payloads with the same SHA1
18:03 ^🔗	tyzoid	arkiver: Didn't google have a patch for sha1 that returned a different hash if a bad input is detected?
18:03 ^🔗	arkiver	no idea
18:03 ^🔗	JAA	That sounds like an awful idea.
18:03 ^🔗	arkiver	but that is just getting trying to get rid of symptoms
18:04 ^🔗	tyzoid	yeah, sha3 ftw
18:04 ^🔗	JAA	Or SHA-2.
18:05 ^🔗	DragonMon	will archive.org eventually 'sync' and unblock data once it processes the new robots.txt? Because it's still giving me an error about robots.txt after the change
18:05 ^🔗	tyzoid	IIRC yes
18:05 ^🔗	JAA	Yes, that's what should happen.
18:05 ^🔗	tyzoid	problem is that the links go to cdn.kernel.org
18:05 ^🔗	DragonMon	alright cool
18:05 ^🔗	DragonMon	oh
18:05 ^🔗	tyzoid	so we're still at square one
18:05 ^🔗	DragonMon	erm
18:06 ^🔗	arkiver	square one of what
18:06 ^🔗	tyzoid	arkiver: Go to kernel.org, and hover over any of the download links
18:06 ^🔗	arkiver	yeah
18:06 ^🔗	tyzoid	it'll show that the link goes to cdn.kernel.org/pub
18:06 ^🔗	arkiver	yeah
18:06 ^🔗	tyzoid	or git.kernel.org/pub
18:06 ^🔗	JAA	I thought they're going to update the cdn.kernel.org robots.txt?
18:06 ^🔗	tyzoid	which are still blocked
18:06 ^🔗	tyzoid	JAA: They changed mirrors.edge.kernel.org/robots.txt
18:06 ^🔗	JAA	Hmm, why not the CDN?
18:07 ^🔗	arkiver	I think it was already demonstrated that they are downloadable from the wayback machine if saved?
18:07 ^🔗	arkiver	just save them through archivebot if you want them to be saved
18:07 ^🔗	tyzoid	JAA: From what I can tell, the CDN is generated from cgit on the web
18:07 ^🔗	tyzoid	which they claim to put strain on their systems to allow robots
18:07 ^🔗	JAA	tyzoid: https://i.imgur.com/EINMXfh.png was only about git.kernel.org.
18:07 ^🔗	arkiver	so I'd say not square one? answer would be archivebot and downloading from wayback seems to work, at least for that page that we checked
18:08 ^🔗	arkiver	https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/
18:08 ^🔗	tyzoid	JAA: cdn.kernel.org looks to be the same as git.kernel.org
18:09 ^🔗	tyzoid	arkiver: Links are still broken on kernel.org homepage, though
18:09 ^🔗	JAA	Give it time...
18:09 ^🔗	tyzoid	We'll see
18:09 ^🔗	DragonMon	should I email about cdn.kernel.org?
18:09 ^🔗	tyzoid	I'm not convinced it'll be fixed
18:09 ^🔗	tyzoid	but we can wait
18:09 ^🔗	tyzoid	it's not like kernel.org is going anywhere soon
18:10 ^🔗	DragonMon	"
18:10 ^🔗	DragonMon	I'm seeing a similar issue with https://cdn.kernel.org/robots.txt if anything gets redirected there Internet Archive will still have issues grabbing from https://cdn.kernel.org/pub OR https://kernel.org/pub"
18:11 ^🔗	DragonMon	I know kernel.org isn't going anywhere and I'd be surprised if their team doesn't have backups of backups buried under backups stuck in time capsules of backups somewhere. But openwrt.org recently had major corruption of their site and forums due to hardware failure
18:11 ^🔗	DragonMon	mostly forums iirc
18:12 ^🔗	DragonMon	https://forum.openwrt.org/ -- "The OpenWrt forum is currently offline due to a hardware problem on the hosting machine."
18:18 ^🔗	arkiver	right
18:33 ^🔗		fie has quit IRC (Read error: Operation timed out)
18:45 ^🔗		fie has joined #archiveteam-bs
19:15 ^🔗	JAA	arkiver: Are you aware of any efforts to replace SHA-1 in WARCs? I guess since the specification allows for any algorithm to be used, it's simply a matter of coordinating with the different authors of WARC-related software?
19:15 ^🔗	arkiver	I'm not aware of any efforts like that
19:19 ^🔗		DragonMon has quit IRC (Read error: Connection reset by peer)
19:21 ^🔗	JAA	Hmm, looks like the spec doesn't allow for multiple headers of the same type (except for WARC-Concurrent-To), so having multiple digests for the same record for backwards compatibility won't be possible (unless the spec gets modified).
19:22 ^🔗	JAA	Oh well, I think there are more pressing issues with WARC, like adding a way to store SSL certificates.
19:22 ^🔗	JAA	a standardised way*
19:23 ^🔗	arkiver	everything is decided here https://github.com/iipc/warc-specifications
19:23 ^🔗	arkiver	but it's slow and taking long and all that
19:23 ^🔗	JAA	Yeah
19:24 ^🔗	arkiver	however I think we are allowed to add our own random fields too, and we can store SSL stuff as for example resource records
19:24 ^🔗	JAA	The response is also frequently "implementation first please".
19:24 ^🔗	arkiver	i think we can do that?
19:26 ^🔗	JAA	For sure.
19:26 ^🔗	arkiver	we can find a good way to store SSL and other stuff (DNS?) and make an issue there. If the responses are not too negative I think we can just start using it. Then there is more of a reason for them to accept it if it's already in billions of records.
19:26 ^🔗	arkiver	Of course only if the responses are not totally negative towards the idea
19:26 ^🔗	jrwr	You could double up on it arkiver
19:27 ^🔗	jrwr	have SHA1 header and then a SHA256 header
19:27 ^🔗	JAA	Nope, the spec doesn't allow for that.
19:27 ^🔗	JAA	"WARC named fields of the same type shall not be repeated in the same WARC record"
19:27 ^🔗	jrwr	what about using a new header field
19:27 ^🔗	jrwr	I understand not having more then one
19:28 ^🔗	jrwr	but as metadata
19:28 ^🔗	JAA	That would be possible, but ugly.
19:28 ^🔗	jrwr	best way to ensure compatibility for now
19:28 ^🔗	arkiver	JAA: what do you think of that idea?
19:28 ^🔗	arkiver	not sure if it's a good approach
19:31 ^🔗	arkiver	JAA: one thing I have been thinking about a lot and what I really really want in there is torrents support
19:31 ^🔗	arkiver	and/or magnets
19:31 ^🔗	arkiver	especially with webtorrents that are sometimes used to load stuff likes images and videos
19:32 ^🔗	arkiver	JAA: We could start working with archiveteam on drafting ideas for archiving torrents and the stuff that is downloaded with them
19:32 ^🔗	jrwr	wait, im looking at the spec
19:32 ^🔗	jrwr	WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
19:32 ^🔗	jrwr	WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
19:32 ^🔗	jrwr	so it has a sha1 prefix
19:33 ^🔗	JAA	arkiver: I like the idea of adding SSL certificates in resource records or similar. DNS would fit into response records; Heritrix already does that actually.
19:34 ^🔗		DragonMon has joined #archiveteam-bs
19:34 ^🔗	arkiver	JAA: hmm didn't know that about heritrix, pretty nice
19:34 ^🔗	JAA	jrwr: That's right. Unfortunately, whatever you do, it won't be backwards compatible.
19:34 ^🔗	JAA	Well, except your solution but that's really ugly in my opinion.
19:34 ^🔗	tyzoid	JAA / arkiver: I'm in favor of it, if we can find a way to be able to trace the content back up to a trusted certificate
19:35 ^🔗	tyzoid	IIRC, that'd require storing the session secret in the warc file
19:35 ^🔗	JAA	tyzoid: Even that won't help since the content is encrypted symmetrically.
19:35 ^🔗	arkiver	aaand how about the webtorrents :)
19:35 ^🔗	tyzoid	JAA: The content is symmetrically encrypted with a key that's agreed upon (usually by) Diffie Helmen
19:36 ^🔗	tyzoid	hellman*
19:36 ^🔗	tyzoid	Which is asymmetric
19:36 ^🔗	JAA	tyzoid: Yeah, but the client (which writes the WARC) could modify the content at will without affecting the key.
19:36 ^🔗	DragonMon	tyzoid: JAA ok so there was some issue with configuration, they are updating https://cdn.kernel.org/robots.txt now -- https://i.imgur.com/CswwPlL.png
19:37 ^🔗	tyzoid	Right. I'd need to look at the protocol more in depth, but I believe there's a way to be able to store enough data to verify the message
19:37 ^🔗	JAA	tyzoid: Yeah, if the right cipher is used it might work.
19:37 ^🔗	tyzoid	JAA: Luckily, the client controls the cipher used
19:38 ^🔗	JAA	To a degree, yeah. But it still needs to be compatible enough to grab everything.
19:38 ^🔗	tyzoid	JAA: As long as we've got a wide enough range of supported ciphers, the server will pick one they prefer. If that doesn't work, we can just fall back to what we've got now
19:39 ^🔗	JAA	Yeah, true.
19:39 ^🔗	JAA	DragonMon: Excellent, thanks.
19:39 ^🔗	tyzoid	Yes, glad that's sorted ^
20:40 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:15 ^🔗		plue has quit IRC (Quit: leaving)
21:19 ^🔗		plue has joined #archiveteam-bs
22:11 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
22:12 ^🔗		BlueMax has joined #archiveteam-bs
23:39 ^🔗		Despatche has joined #archiveteam-bs

irclogger-viewer