Time |
Nickname |
Message |
00:40
🔗
|
|
ta9le has quit IRC (Quit: Connection closed for inactivity) |
00:46
🔗
|
|
ndiddylap has joined #archiveteam-bs |
02:07
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
02:07
🔗
|
|
Mateon1 has joined #archiveteam-bs |
02:14
🔗
|
|
ndiddy_ has joined #archiveteam-bs |
02:18
🔗
|
|
ndiddylap has quit IRC (Read error: Operation timed out) |
02:19
🔗
|
|
ndiddylap has joined #archiveteam-bs |
02:21
🔗
|
|
apache2 has quit IRC (Remote host closed the connection) |
02:21
🔗
|
|
apache2 has joined #archiveteam-bs |
02:22
🔗
|
|
ndiddy_ has quit IRC (Read error: Operation timed out) |
02:24
🔗
|
|
Tenebrae has quit IRC (Ping timeout: 260 seconds) |
02:25
🔗
|
|
Tenebrae has joined #archiveteam-bs |
02:25
🔗
|
|
plue has quit IRC (Ping timeout: 260 seconds) |
02:26
🔗
|
|
plue has joined #archiveteam-bs |
02:50
🔗
|
SketchCow |
OK |
02:50
🔗
|
SketchCow |
I tend to upload when they stop growing for a while, if that matters |
02:50
🔗
|
SketchCow |
But I'm fine |
03:17
🔗
|
|
qw3rty117 has joined #archiveteam-bs |
03:23
🔗
|
|
qw3rty116 has quit IRC (Read error: Operation timed out) |
03:41
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
03:53
🔗
|
|
odemg has joined #archiveteam-bs |
03:59
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
04:32
🔗
|
|
ndiddylap has quit IRC (Read error: Operation timed out) |
05:06
🔗
|
|
Pixi has quit IRC (Quit: Pixi) |
05:06
🔗
|
|
Pixi has joined #archiveteam-bs |
06:20
🔗
|
|
schbirid has joined #archiveteam-bs |
07:15
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
08:21
🔗
|
|
SmileyG_ has joined #archiveteam-bs |
08:24
🔗
|
|
SmileyG has quit IRC (Ping timeout: 260 seconds) |
09:44
🔗
|
godane |
SketchCow: ok |
09:44
🔗
|
godane |
any word from Mank? |
10:23
🔗
|
|
ta9le has joined #archiveteam-bs |
11:32
🔗
|
lindalap |
JAA: It wasn't long ago when Jagex sold their business to the Chinese, or something. It's been downhill from there. They also increased costs of their subscription recently. |
11:33
🔗
|
JAA |
I see. |
11:47
🔗
|
lindalap |
http://www.runescape.com/robots.txt |
11:47
🔗
|
lindalap |
I guess I have no words |
11:56
🔗
|
JAA |
lol |
12:46
🔗
|
lindalap |
I almost forgot. Jagex is also closing Ace of Spades in few days. |
12:46
🔗
|
lindalap |
Ace of Spades, RuneScape Classic and FunOrb. Whee. |
12:56
🔗
|
|
C4K3 has quit IRC (Read error: Operation timed out) |
13:33
🔗
|
|
BlueMax has quit IRC (Leaving) |
13:52
🔗
|
|
C4K3 has joined #archiveteam-bs |
14:11
🔗
|
|
ndiddylap has joined #archiveteam-bs |
14:13
🔗
|
|
balrog has quit IRC (Bye) |
14:21
🔗
|
|
balrog has joined #archiveteam-bs |
14:21
🔗
|
|
swebb sets mode: +o balrog |
14:33
🔗
|
Frogging |
lmao this robots.txt |
14:43
🔗
|
tyzoid |
Frogging: Where? |
14:43
🔗
|
Frogging |
http://www.runescape.com/robots.txt |
14:43
🔗
|
tyzoid |
oh, for runescape, lol |
14:43
🔗
|
Frogging |
yeah |
14:44
🔗
|
tyzoid |
damn, keep running into this: error uploading at-00156.warc: We encountered an internal error. Please try again. - uploadItem.py |
14:47
🔗
|
tyzoid |
worked after four retries :/ |
14:48
🔗
|
|
balrog has quit IRC (Bye) |
14:50
🔗
|
|
balrog has joined #archiveteam-bs |
14:50
🔗
|
|
swebb sets mode: +o balrog |
14:55
🔗
|
arkiver |
tyzoid: where are you uploading these? |
14:55
🔗
|
arkiver |
which item |
14:57
🔗
|
tyzoid |
arkiver: https://archive.org/download/tyzoid-acidplanet-audio |
14:57
🔗
|
tyzoid |
I'm going through and re-uploading the ones that failed to upload before |
14:58
🔗
|
arkiver |
ah you're using the new wget |
14:58
🔗
|
arkiver |
it writes different WARC headers than normal |
14:58
🔗
|
tyzoid |
whatever is latest on ubuntu 18.04, I assume it's new |
14:58
🔗
|
arkiver |
https://archive.org/download/tyzoid-acidplanet-audio/tyzoid-acidplanet-audio.cdx.idx |
14:58
🔗
|
DragonMon |
Hi |
14:59
🔗
|
arkiver |
lines from the IDX |
14:59
🔗
|
arkiver |
<http)/av.acidplanet.com.s3-us-west-2.amazonaws.com/ap/0288000/287820-6604-296096.asf> 20180527113823 tyzoid-acidplanet-audio.cdx.gz 563522 183423 |
14:59
🔗
|
arkiver |
note the <http |
14:59
🔗
|
arkiver |
that is wrong |
14:59
🔗
|
DragonMon |
is there a good reason why tarballs are excluded from most of the https://kernel.org archives? |
14:59
🔗
|
arkiver |
we encountered this issue before, IA can't handle it (yet), I'll raise an issue |
14:59
🔗
|
tyzoid |
Sounds good |
15:00
🔗
|
DragonMon |
the point of archiving that website is so there are copies of older Linux Kernel sources |
15:00
🔗
|
arkiver |
(this is not related to your uploading issue) |
15:00
🔗
|
tyzoid |
DragonMon: It's backed by git, so you can always git checkout to the tag |
15:01
🔗
|
DragonMon |
tyzoid: yes but is there a web archive of the git repo? |
15:01
🔗
|
DragonMon |
lol |
15:01
🔗
|
tyzoid |
DragonMon: Yes: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ |
15:02
🔗
|
DragonMon |
not what I meant |
15:02
🔗
|
DragonMon |
I'm sure Linux is one of the most archived things around but shouldn't archive.org include the source tarballs? |
15:02
🔗
|
tyzoid |
you can go all the way back if you want: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-2.6.13-rc3.tar.gz |
15:03
🔗
|
tyzoid |
just grab all the snapshots |
15:03
🔗
|
tyzoid |
arkiver: Yeah, no problem. I just wrapped it in a loop to retry on nonzero return status |
15:04
🔗
|
DragonMon |
tyzoid: but why couldn't archive.org also get a copy of those? |
15:04
🔗
|
DragonMon |
the download links are broken if navigated in archive.org |
15:06
🔗
|
|
moufu_ is now known as moufu |
15:08
🔗
|
JAA |
Ew, that uri definition problem from the WARC specification again. |
15:08
🔗
|
arkiver |
yeah |
15:08
🔗
|
arkiver |
raised an issue, shouldn't be too hard to fix in the derive process |
15:09
🔗
|
arkiver |
proces* |
15:09
🔗
|
JAA |
So they changed wget to comply with WARC 1.0 strictly instead of moving to 1.1? |
15:09
🔗
|
JAA |
... or just ignoring it, since all other tools don't include the angle brackets anyway. |
15:10
🔗
|
tyzoid |
DragonMon: seems like it's grabbing it "200 https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.9.103.tar.xz" |
15:10
🔗
|
DragonMon |
tyzoid: I see a older archiveteam archive from a few days ago on archive.org and the links are broken |
15:11
🔗
|
JAA |
DragonMon: Link? |
15:11
🔗
|
DragonMon |
hang on |
15:12
🔗
|
tyzoid |
JAA: http://web.archive.org/web/20180529073551/https://git.kernel.org/torvalds/t/linux-4.17-rc7.tar.gz |
15:12
🔗
|
DragonMon |
https://web.archive.org/web/20180521085957/https://www.kernel.org/ |
15:12
🔗
|
tyzoid |
link clicked on from http://web.archive.org/web/20180529073551/https://www.kernel.org/ |
15:13
🔗
|
DragonMon |
right |
15:13
🔗
|
JAA |
DragonMon: Hmm, nobody grabbed kernel.org on that date directly. Most likely, it was just a link to kernel.org from another site. |
15:13
🔗
|
JAA |
ArchiveBot grabs external links, but it doesn't recurse on them for obvious reasons. |
15:14
🔗
|
DragonMon |
hmm |
15:14
🔗
|
JAA |
So if you grab example.org and example.org/kernel.html has a link to kernel.org, it'll grab kernel.org but not any links on it. |
15:14
🔗
|
DragonMon |
tyzoid: so it should show up on the most recent grab? |
15:14
🔗
|
tyzoid |
idk |
15:14
🔗
|
tyzoid |
perhaps |
15:15
🔗
|
DragonMon |
it's strange if it doesn't.... It's open source, it's meant to be saved and shared |
15:15
🔗
|
JAA |
It won't: https://cdn.kernel.org/robots.txt |
15:15
🔗
|
tyzoid |
JAA: I would imagine that the link shouldn't be broken, though, it'll get the nearest 20x response in time to the current archive |
15:15
🔗
|
tyzoid |
unless it's not archived at all |
15:15
🔗
|
JAA |
Correct. |
15:15
🔗
|
DragonMon |
I wonder what the idea is behind that, it seems odd |
15:16
🔗
|
JAA |
But in this case, it wouldn't work ever because cdn.kernel.org blocks the access to robots. |
15:16
🔗
|
JAA |
Probably to prevent unnecessary traffic from crawlers. |
15:16
🔗
|
DragonMon |
I see that but why limit grabs like that? ddos maybe? |
15:16
🔗
|
arkiver |
JAA: tyzoid: well it's raised, I don't expect it to be too hard to fix, will keep you informed :) |
15:16
🔗
|
DragonMon |
hmm |
15:16
🔗
|
JAA |
git.kernel.org has the same thing. |
15:16
🔗
|
tyzoid |
arkiver: Thanks |
15:16
🔗
|
DragonMon |
git* I get |
15:17
🔗
|
JAA |
I wouldn't be surprised if the people over at kernel.org would be willing to add an exception for ia_archiver to robots.txt. |
15:17
🔗
|
tyzoid |
JAA: Yeah, http://web.archive.org/web/*/https://cdn.kernel.org/pub/linux/kernel/v3.x/* isn't turning up any results. |
15:18
🔗
|
JAA |
Yeah... Try this: https://web.archive.org/save/https://cdn.kernel.org/pub/linux/kernel/v4.x/ |
15:18
🔗
|
tyzoid |
"Page cannot be displayed due to robots.txt" |
15:18
🔗
|
tyzoid |
https://cdn.kernel.org/robots.txt |
15:19
🔗
|
JAA |
Exactly. |
15:19
🔗
|
tyzoid |
kernel.org has no such restriction |
15:19
🔗
|
tyzoid |
though since we've grabbed it via archivebot, it should appear, right? |
15:19
🔗
|
JAA |
Well, the Wayback Machine will still block access to it. |
15:19
🔗
|
JAA |
But the data will be there in the WARCs. |
15:19
🔗
|
tyzoid |
ah, right. |
15:20
🔗
|
tyzoid |
darn |
15:20
🔗
|
JAA |
And hopefully IA will some day finally remove that robots.txt handling. |
15:20
🔗
|
DragonMon |
nah, people who don't want their stuff online will go to IA nagging for their content to be removed |
15:21
🔗
|
DragonMon |
which is ironic I think |
15:21
🔗
|
tyzoid |
I would imagine that the wayback machine falls under fair use law in the US anyway |
15:23
🔗
|
DragonMon |
I'm about to fire off a email to the Linux Foundation. What do they need to add to allow ia_archiver? |
15:25
🔗
|
JAA |
User-agent: ia_archiver |
15:25
🔗
|
JAA |
Disallow: |
15:26
🔗
|
JAA |
I think they'd have to add that before the general disallow rule. |
15:27
🔗
|
tyzoid |
not after? |
15:29
🔗
|
tyzoid |
JAA: https://moz.com/learn/seo/robotstxt makes it seem like order doesn't matter |
15:30
🔗
|
JAA |
I'm not really sure to be honest. |
15:31
🔗
|
DragonMon |
ok well they should know, someone there setup a flag for the Google bot |
15:32
🔗
|
JAA |
The original spec doesn't really mention anything about it: http://www.robotstxt.org/orig.html |
15:32
🔗
|
DragonMon |
Email has been sent, I'll see what they respond with |
15:32
🔗
|
JAA |
Sweet |
15:34
🔗
|
DragonMon |
helpdesk@rt.linuxfoundation.org For all issues with Linux Foundation websites or systems, including questions about Linux.com email addresses. |
15:34
🔗
|
tyzoid |
JAA: I'm cleaning up my archivebot box from the acid grab, so things should start moving a bit better for the archivebot |
15:34
🔗
|
JAA |
By the way: https://archive.org/details/git-history-of-linux |
15:35
🔗
|
tyzoid |
sweet |
15:35
🔗
|
tyzoid |
I wonder how they did that without changing commit IDs |
15:36
🔗
|
JAA |
Considering that git-filter-branch was used according to the description, it probably did change the commit IDs. |
15:37
🔗
|
tyzoid |
yeah, it says something about git graft, though I'll have to read up more on it |
15:37
🔗
|
JAA |
Yeah, looks like those three parts are sort-of merged together without affecting the history. |
15:38
🔗
|
JAA |
"Graft points or grafts enable two otherwise different lines of development to be joined together. It works by letting users record fake ancestry information for commits. This way you can make git pretend the set of parents a commit has is different from what was recorded when the commit was created." https://git.wiki.kernel.org/index.php/GraftPoint |
15:38
🔗
|
JAA |
Never heard of it before. Very interesting. |
15:54
🔗
|
DragonMon |
whelp I quick fired the email to the wrong place, sent a new one to webmaster@kernel.org |
15:54
🔗
|
DragonMon |
someone did get back to me from Linux Foundation pointing me to that email |
16:21
🔗
|
|
wp494 has quit IRC (Ping timeout: 633 seconds) |
16:22
🔗
|
|
wp494 has joined #archiveteam-bs |
16:23
🔗
|
|
svchfoo1 sets mode: +o wp494 |
16:28
🔗
|
DragonMon |
JAA: tyzoid https://i.imgur.com/EINMXfh.png this is their reply |
16:29
🔗
|
tyzoid |
DragonMon: No, we are referring to the tarballs on cdn.kernel.org |
16:29
🔗
|
tyzoid |
https://cdn.kernel.org/robots.txt |
16:29
🔗
|
tyzoid |
which are the release tarballs |
16:29
🔗
|
tyzoid |
and those are located at https://cdn.kernel.org/pub |
16:30
🔗
|
tyzoid |
and the kernel mirror denies all bots too (where www.kernel.org/pub redirects to) |
16:30
🔗
|
tyzoid |
https://mirrors.edge.kernel.org/robots.txt |
16:37
🔗
|
DragonMon |
tyzoid: ok I bounced another email "When Internet Archive goes to archive that website https://kernel.org/pub it gets redirected to https://mirrors.edge.kernel.org/pub which has a restriction https://mirrors.edge.kernel.org/robots.txt Can that be fixed so anything under the sub folder pub can be archived?" |
16:38
🔗
|
tyzoid |
Well, we don't necessarily want to mirror the entire software mirror |
16:38
🔗
|
tyzoid |
It's really the stuff under cdn.kernel.org which we're after |
16:39
🔗
|
DragonMon |
Well the they are saying that /pub is supposed to be restriction free |
16:39
🔗
|
DragonMon |
hmm |
16:40
🔗
|
DragonMon |
I should have been more clear to not restrict the source code tarballs |
16:55
🔗
|
DragonMon |
"<tyzoid> DragonMon: if you're concerned about those release tarballs, we got 'em when we grabbed kernel.org" tyzoid: where would they be available then? |
16:56
🔗
|
JAA |
Soon, at an archive near you. ;-) |
16:56
🔗
|
JAA |
There's a delay of some hours to few days until the archives from ArchiveBot end up on IA. |
16:57
🔗
|
JAA |
Until the robots.txt is fixed, you'll have to access the files directly from the WARCs, not through the Wayback Machine. |
16:57
🔗
|
DragonMon |
so the collections and not as apart of the main archive.org. So the links will be broken |
16:58
🔗
|
JAA |
The links will be broken because of robots.txt, not because of where the archives are stored. |
16:58
🔗
|
JAA |
ArchiveBot WARCs do get ingested into the Wayback Machine, but robots.txt prevents the access for these specific URLs. |
16:58
🔗
|
|
jschwart has joined #archiveteam-bs |
16:59
🔗
|
DragonMon |
wait, that seems somehow worse. I thought the files got chucked because of robot.txt So there's potentially tons of data archive.org has but cannot make readily easy to access? |
17:00
🔗
|
JAA |
Yep |
17:00
🔗
|
DragonMon |
oh damn |
17:03
🔗
|
JAA |
Better to have the data somewhere behind a (complete or partial) block than to not have it at all. But yeah, it's not optimal. |
17:04
🔗
|
arkiver |
but for example https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ works? |
17:04
🔗
|
arkiver |
only not able to save stuff through live wayback |
17:04
🔗
|
JAA |
Yeah, looks like there was no robots.txt in the past: https://web.archive.org/web/20170407001344/http://cdn.kernel.org/robots.txt |
17:05
🔗
|
arkiver |
afaik IA doesn't care when robots.txt was created, it's about the latest one |
17:05
🔗
|
JAA |
Yeah, I was about to ask that. That's my experience as well. |
17:06
🔗
|
arkiver |
IA recently changed something with robots.txt, not sure what exactly could be viewing only and not saving through live |
17:06
🔗
|
arkiver |
(remember all the angry people) |
17:06
🔗
|
arkiver |
so if you want it archived do it through archivebot :) |
17:08
🔗
|
arkiver |
it could be a bug that https://web.archive.org/web/*/http://cdn.kernel.org:80/pub/linux/kernel/* doesn't list anything while https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ exists |
17:08
🔗
|
arkiver |
will report that too |
17:09
🔗
|
DragonMon |
hmm is https://mirrors.edge.kernel.org/pub not equal to http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ |
17:10
🔗
|
DragonMon |
I mean are the two pub folders the same content? |
17:10
🔗
|
DragonMon |
seems like it is |
17:17
🔗
|
tyzoid |
DragonMon: What I was saying is that https://www.kernel.org/pub redirects to mirrors.edge.kernel.org/pub |
17:17
🔗
|
tyzoid |
www.kernel.org/pub is the url they mentioned in their email as specifically allowing bots, which after the redirect does not |
17:18
🔗
|
DragonMon |
right ok, I did let them know. I haven't gotten a reply yet |
17:23
🔗
|
|
schbirid has joined #archiveteam-bs |
17:27
🔗
|
DragonMon |
tyzoid: JAA arkiver https://i.imgur.com/776k3T0.png They are updating it |
17:28
🔗
|
tyzoid |
DragonMon: sweet! |
17:31
🔗
|
DragonMon |
So if all these archives are getting uploaded as WARC files, is it possible to de-duplicate? |
17:32
🔗
|
tyzoid |
I'm not sure how the archive handles duplicate content |
17:32
🔗
|
DragonMon |
match files that are identical from other archives and uploads to reduce overall data size |
17:32
🔗
|
DragonMon |
hmm |
17:33
🔗
|
DragonMon |
if not that's a crazy amount of data that got uploaded when a site had restrictive robots.txt to when it didn't |
17:33
🔗
|
DragonMon |
any site that did this |
17:38
🔗
|
DragonMon |
geez I hope archive.org does something for data duplication. Otherwise where are they getting the cash for this? |
17:39
🔗
|
tyzoid |
Donations mostly |
17:39
🔗
|
tyzoid |
Hard disk space is relatively cheap |
17:40
🔗
|
|
Valentine has quit IRC (Quit: Addio, adieu, adios, aloha, arrivederci, auf Wiedersehen, au revoir, bye, bye-bye, cheerio, cheers, farewell, good) |
17:40
🔗
|
DragonMon |
the amount of money and time I spend for my measly 4 TB of personal data on 8TB total of drives... doesn't hold a candle to this |
17:41
🔗
|
tyzoid |
Economies of Scale work to their benefit here |
17:42
🔗
|
tyzoid |
They can order hard drives in bulk, saving money |
17:42
🔗
|
tyzoid |
and they have very efficient storage systems, which don't require a ton of power |
17:42
🔗
|
tyzoid |
i.e. 4u racks of hard drives |
17:42
🔗
|
tyzoid |
http://archive.org/web/petabox.php |
17:44
🔗
|
DragonMon |
I wonder what the failure rate is |
17:44
🔗
|
tyzoid |
quite low, in general |
17:44
🔗
|
DragonMon |
how often they need to replace a drive |
17:44
🔗
|
Meroje |
I'd guess it's similar to backblaze stats |
17:45
🔗
|
tyzoid |
https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/ |
17:45
🔗
|
DragonMon |
Has anyone tried to take them down? |
17:45
🔗
|
tyzoid |
IA? I'm sure |
17:45
🔗
|
tyzoid |
But being a nonprofit library means you have allies |
17:46
🔗
|
DragonMon |
Hackers sure... but I'm talking about physical attacks |
17:46
🔗
|
tyzoid |
oh, idk |
17:46
🔗
|
tyzoid |
doubt it, though |
17:47
🔗
|
tyzoid |
DragonMon: I'd expect about ~3-4% failure rate per year, estimating on the high side |
17:47
🔗
|
DragonMon |
of all the crazy crap going on in the world, libraries and projects like archive.org seem to be constants |
17:47
🔗
|
tyzoid |
And I would probably expect to refresh the physical servers about once every 6-8 years |
17:47
🔗
|
DragonMon |
things you can rely on |
17:48
🔗
|
DragonMon |
in recent history that is |
17:48
🔗
|
tyzoid |
and with the IA.BAK, we can hope that it'll continue |
17:48
🔗
|
DragonMon |
you never really hear about libraries getting attacked |
17:48
🔗
|
|
SimpBrain has quit IRC (Read error: Operation timed out) |
17:49
🔗
|
|
SimpBrain has joined #archiveteam-bs |
17:49
🔗
|
tyzoid |
DragonMon: IIRC the library of Alexandria was intentionally burned down while at war. |
17:50
🔗
|
DragonMon |
it's why I said 'recent history |
17:50
🔗
|
DragonMon |
' |
17:50
🔗
|
|
Valentine has joined #archiveteam-bs |
17:50
🔗
|
DragonMon |
lol |
17:51
🔗
|
DragonMon |
tyzoid: fixed |
17:51
🔗
|
DragonMon |
https://mirrors.edge.kernel.org/robots.txt |
17:51
🔗
|
DragonMon |
should I run another archive request? |
17:51
🔗
|
tyzoid |
DragonMon: Mosul public library by Isis in 2015, Libraries in Anbar Province by Isis in 2014, Mosul private libraries by Isis in 2014, National Archives of Bosnia and Herzegovina by rioters in 2014... |
17:52
🔗
|
tyzoid |
need I go on? |
17:52
🔗
|
tyzoid |
https://en.wikipedia.org/wiki/List_of_destroyed_libraries |
17:52
🔗
|
tyzoid |
DragonMon: The previous archivebot grab should have gotten most things. You can try, if you want. |
17:53
🔗
|
DragonMon |
tyzoid: I guess I missed |
17:53
🔗
|
lindalap |
ams.edge.kernel.org still has User-Agent: * Disallow: / |
17:53
🔗
|
tyzoid |
lindalap: I don't think that should be a problem |
17:54
🔗
|
DragonMon |
tyzoid: wouldn't the last archive done include the old robot.txt? |
17:59
🔗
|
DragonMon |
tyzoid: I tried manually saving a link using archive.org itself and it's still complaining about robots.txt |
17:59
🔗
|
arkiver |
IA does no deduplication |
17:59
🔗
|
arkiver |
<DragonMon>So if all these archives are getting uploaded as WARC files, is it possible to de-duplicate? |
18:00
🔗
|
arkiver |
well no deduplication of WARCs in items |
18:00
🔗
|
DragonMon |
arkiver: so say website-fun.com/this.png was IDENTICAL to twitter.com/this.png would it still get duplicated? |
18:00
🔗
|
arkiver |
yes |
18:01
🔗
|
arkiver |
i'm against deduplicating it right now too |
18:01
🔗
|
DragonMon |
yea it might cause some confusion if something gets corrupted |
18:01
🔗
|
arkiver |
note: not necessarily IA opinion |
18:01
🔗
|
JAA |
ArchiveBot should deduplicate within one job, but that's broken at the moment. |
18:01
🔗
|
JAA |
Not across jobs though. |
18:01
🔗
|
arkiver |
so WARCs currently hash the payload using SHA1 |
18:02
🔗
|
arkiver |
which can cause collision with the earlier demonstrated attack |
18:02
🔗
|
arkiver |
causing stuff to be 'deduplicated'/deleted from the wayback machine if done succesful in certain circumstances |
18:03
🔗
|
arkiver |
that is different WARC payloads with the same SHA1 |
18:03
🔗
|
tyzoid |
arkiver: Didn't google have a patch for sha1 that returned a different hash if a bad input is detected? |
18:03
🔗
|
arkiver |
no idea |
18:03
🔗
|
JAA |
That sounds like an awful idea. |
18:03
🔗
|
arkiver |
but that is just getting trying to get rid of symptoms |
18:04
🔗
|
tyzoid |
yeah, sha3 ftw |
18:04
🔗
|
JAA |
Or SHA-2. |
18:05
🔗
|
DragonMon |
will archive.org eventually 'sync' and unblock data once it processes the new robots.txt? Because it's still giving me an error about robots.txt after the change |
18:05
🔗
|
tyzoid |
IIRC yes |
18:05
🔗
|
JAA |
Yes, that's what should happen. |
18:05
🔗
|
tyzoid |
problem is that the links go to cdn.kernel.org |
18:05
🔗
|
DragonMon |
alright cool |
18:05
🔗
|
DragonMon |
oh |
18:05
🔗
|
tyzoid |
so we're still at square one |
18:05
🔗
|
DragonMon |
erm |
18:06
🔗
|
arkiver |
square one of what |
18:06
🔗
|
tyzoid |
arkiver: Go to kernel.org, and hover over any of the download links |
18:06
🔗
|
arkiver |
yeah |
18:06
🔗
|
tyzoid |
it'll show that the link goes to cdn.kernel.org/pub |
18:06
🔗
|
arkiver |
yeah |
18:06
🔗
|
tyzoid |
or git.kernel.org/pub |
18:06
🔗
|
JAA |
I thought they're going to update the cdn.kernel.org robots.txt? |
18:06
🔗
|
tyzoid |
which are still blocked |
18:06
🔗
|
tyzoid |
JAA: They changed mirrors.edge.kernel.org/robots.txt |
18:06
🔗
|
JAA |
Hmm, why not the CDN? |
18:07
🔗
|
arkiver |
I think it was already demonstrated that they are downloadable from the wayback machine if saved? |
18:07
🔗
|
arkiver |
just save them through archivebot if you want them to be saved |
18:07
🔗
|
tyzoid |
JAA: From what I can tell, the CDN is generated from cgit on the web |
18:07
🔗
|
tyzoid |
which they claim to put strain on their systems to allow robots |
18:07
🔗
|
JAA |
tyzoid: https://i.imgur.com/EINMXfh.png was only about git.kernel.org. |
18:07
🔗
|
arkiver |
so I'd say not square one? answer would be archivebot and downloading from wayback seems to work, at least for that page that we checked |
18:08
🔗
|
arkiver |
https://web.archive.org/web/20170405222346/http://cdn.kernel.org:80/pub/linux/kernel/v4.x/ |
18:08
🔗
|
tyzoid |
JAA: cdn.kernel.org looks to be the same as git.kernel.org |
18:09
🔗
|
tyzoid |
arkiver: Links are still broken on kernel.org homepage, though |
18:09
🔗
|
JAA |
Give it time... |
18:09
🔗
|
tyzoid |
We'll see |
18:09
🔗
|
DragonMon |
should I email about cdn.kernel.org? |
18:09
🔗
|
tyzoid |
I'm not convinced it'll be fixed |
18:09
🔗
|
tyzoid |
but we can wait |
18:09
🔗
|
tyzoid |
it's not like kernel.org is going anywhere soon |
18:10
🔗
|
DragonMon |
" |
18:10
🔗
|
DragonMon |
I'm seeing a similar issue with https://cdn.kernel.org/robots.txt if anything gets redirected there Internet Archive will still have issues grabbing from https://cdn.kernel.org/pub OR https://kernel.org/pub" |
18:11
🔗
|
DragonMon |
I know kernel.org isn't going anywhere and I'd be surprised if their team doesn't have backups of backups buried under backups stuck in time capsules of backups somewhere. But openwrt.org recently had major corruption of their site and forums due to hardware failure |
18:11
🔗
|
DragonMon |
mostly forums iirc |
18:12
🔗
|
DragonMon |
https://forum.openwrt.org/ -- "The OpenWrt forum is currently offline due to a hardware problem on the hosting machine." |
18:18
🔗
|
arkiver |
right |
18:33
🔗
|
|
fie has quit IRC (Read error: Operation timed out) |
18:45
🔗
|
|
fie has joined #archiveteam-bs |
19:15
🔗
|
JAA |
arkiver: Are you aware of any efforts to replace SHA-1 in WARCs? I guess since the specification allows for any algorithm to be used, it's simply a matter of coordinating with the different authors of WARC-related software? |
19:15
🔗
|
arkiver |
I'm not aware of any efforts like that |
19:19
🔗
|
|
DragonMon has quit IRC (Read error: Connection reset by peer) |
19:21
🔗
|
JAA |
Hmm, looks like the spec doesn't allow for multiple headers of the same type (except for WARC-Concurrent-To), so having multiple digests for the same record for backwards compatibility won't be possible (unless the spec gets modified). |
19:22
🔗
|
JAA |
Oh well, I think there are more pressing issues with WARC, like adding a way to store SSL certificates. |
19:22
🔗
|
JAA |
a standardised way* |
19:23
🔗
|
arkiver |
everything is decided here https://github.com/iipc/warc-specifications |
19:23
🔗
|
arkiver |
but it's slow and taking long and all that |
19:23
🔗
|
JAA |
Yeah |
19:24
🔗
|
arkiver |
however I think we are allowed to add our own random fields too, and we can store SSL stuff as for example resource records |
19:24
🔗
|
JAA |
The response is also frequently "implementation first please". |
19:24
🔗
|
arkiver |
i think we can do that? |
19:26
🔗
|
JAA |
For sure. |
19:26
🔗
|
arkiver |
we can find a good way to store SSL and other stuff (DNS?) and make an issue there. If the responses are not too negative I think we can just start using it. Then there is more of a reason for them to accept it if it's already in billions of records. |
19:26
🔗
|
arkiver |
Of course only if the responses are not totally negative towards the idea |
19:26
🔗
|
jrwr |
You could double up on it arkiver |
19:27
🔗
|
jrwr |
have SHA1 header and then a SHA256 header |
19:27
🔗
|
JAA |
Nope, the spec doesn't allow for that. |
19:27
🔗
|
JAA |
"WARC named fields of the same type shall not be repeated in the same WARC record" |
19:27
🔗
|
jrwr |
what about using a new header field |
19:27
🔗
|
jrwr |
I understand not having more then one |
19:28
🔗
|
jrwr |
but as metadata |
19:28
🔗
|
JAA |
That would be possible, but ugly. |
19:28
🔗
|
jrwr |
best way to ensure compatibility for now |
19:28
🔗
|
arkiver |
JAA: what do you think of that idea? |
19:28
🔗
|
arkiver |
not sure if it's a good approach |
19:31
🔗
|
arkiver |
JAA: one thing I have been thinking about a lot and what I really really want in there is torrents support |
19:31
🔗
|
arkiver |
and/or magnets |
19:31
🔗
|
arkiver |
especially with webtorrents that are sometimes used to load stuff likes images and videos |
19:32
🔗
|
arkiver |
JAA: We could start working with archiveteam on drafting ideas for archiving torrents and the stuff that is downloaded with them |
19:32
🔗
|
jrwr |
wait, im looking at the spec |
19:32
🔗
|
jrwr |
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 |
19:32
🔗
|
jrwr |
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 |
19:32
🔗
|
jrwr |
so it has a sha1 prefix |
19:33
🔗
|
JAA |
arkiver: I like the idea of adding SSL certificates in resource records or similar. DNS would fit into response records; Heritrix already does that actually. |
19:34
🔗
|
|
DragonMon has joined #archiveteam-bs |
19:34
🔗
|
arkiver |
JAA: hmm didn't know that about heritrix, pretty nice |
19:34
🔗
|
JAA |
jrwr: That's right. Unfortunately, whatever you do, it won't be backwards compatible. |
19:34
🔗
|
JAA |
Well, except your solution but that's really ugly in my opinion. |
19:34
🔗
|
tyzoid |
JAA / arkiver: I'm in favor of it, if we can find a way to be able to trace the content back up to a trusted certificate |
19:35
🔗
|
tyzoid |
IIRC, that'd require storing the session secret in the warc file |
19:35
🔗
|
JAA |
tyzoid: Even that won't help since the content is encrypted symmetrically. |
19:35
🔗
|
arkiver |
aaand how about the webtorrents :) |
19:35
🔗
|
tyzoid |
JAA: The content is symmetrically encrypted with a key that's agreed upon (usually by) Diffie Helmen |
19:36
🔗
|
tyzoid |
hellman* |
19:36
🔗
|
tyzoid |
Which is asymmetric |
19:36
🔗
|
JAA |
tyzoid: Yeah, but the client (which writes the WARC) could modify the content at will without affecting the key. |
19:36
🔗
|
DragonMon |
tyzoid: JAA ok so there was some issue with configuration, they are updating https://cdn.kernel.org/robots.txt now -- https://i.imgur.com/CswwPlL.png |
19:37
🔗
|
tyzoid |
Right. I'd need to look at the protocol more in depth, but I believe there's a way to be able to store enough data to verify the message |
19:37
🔗
|
JAA |
tyzoid: Yeah, *if* the right cipher is used it might work. |
19:37
🔗
|
tyzoid |
JAA: Luckily, the client controls the cipher used |
19:38
🔗
|
JAA |
To a degree, yeah. But it still needs to be compatible enough to grab everything. |
19:38
🔗
|
tyzoid |
JAA: As long as we've got a wide enough range of supported ciphers, the server will pick one they prefer. If that doesn't work, we can just fall back to what we've got now |
19:39
🔗
|
JAA |
Yeah, true. |
19:39
🔗
|
JAA |
DragonMon: Excellent, thanks. |
19:39
🔗
|
tyzoid |
Yes, glad that's sorted ^ |
20:40
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:15
🔗
|
|
plue has quit IRC (Quit: leaving) |
21:19
🔗
|
|
plue has joined #archiveteam-bs |
22:11
🔗
|
|
jschwart has quit IRC (Quit: Konversation terminated!) |
22:12
🔗
|
|
BlueMax has joined #archiveteam-bs |
23:39
🔗
|
|
Despatche has joined #archiveteam-bs |