#archiveteam-bs 2017-11-04,Sat

↑back Search

Time	Nickname	Message
00:05 ^🔗		schbirid has quit IRC (Ping timeout: 256 seconds)
00:16 ^🔗		schbirid has joined #archiveteam-bs
00:35 ^🔗	SketchCow	godane: looking good (the tapes)
00:53 ^🔗	godane	i still have a ton to upload
01:01 ^🔗	godane	i uploaded 603 items so far this month: https://archive.org/details/@chris85?and[]=addeddate%3A2017-11&sort=-publicdate
01:03 ^🔗	godane	that collection is like slow sink drip for government docs
01:04 ^🔗	godane	i know for a fact there is over 500000 ids used with dtic
01:58 ^🔗		superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
02:15 ^🔗		SketchCow has quit IRC (Read error: Operation timed out)
02:19 ^🔗		SketchCow has joined #archiveteam-bs
02:19 ^🔗		swebb sets mode: +o SketchCow
02:42 ^🔗		schbirid has quit IRC (Ping timeout: 255 seconds)
02:51 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
02:52 ^🔗		drumstick has joined #archiveteam-bs
02:54 ^🔗		schbirid has joined #archiveteam-bs
03:40 ^🔗		Pixi has joined #archiveteam-bs
03:59 ^🔗		MadArchiv has joined #archiveteam-bs
04:01 ^🔗	MadArchiv	I finally managed to get site-grab to work on my computer (somehow), I'll take some much-needed sleep for now and
04:05 ^🔗	Ceryn	Damn. He must have needed it. :)
04:05 ^🔗	MadArchiv	(Sigh, I really need to stop pressing the Enter key by accident) ...and then I'll archiving the completed webcomics from Hiveworks tomorrow
04:06 ^🔗		Pixi has quit IRC (Quit: Pixi)
04:07 ^🔗	MadArchiv	Ceryn: You bet it, pal
04:07 ^🔗	Ceryn	Are you uploading to Archive.org?
04:07 ^🔗	MadArchiv	Of course.
04:08 ^🔗	Ceryn	Are dumps of a site generally just uploaded occasionally, no deduplication or continuation from the last batch or anything?
04:09 ^🔗	MadArchiv	These are webcomic sites, so I suppose there are no dumps at all. I could be wrong though.
04:09 ^🔗	Ceryn	I mean dumps in the general sense, data dumps, whatever format the data has.
04:11 ^🔗	MadArchiv	You mean the ones I would get after I crawl the site? I'm new to this, and I'm also pretty computer-illiterate, so I'm learning this stuff as I go along.
04:12 ^🔗	MadArchiv	Wait, are you still talking about webcomic sites?
04:13 ^🔗	Ceryn	Heh okay. I'm new to archiving but I'm rather computer-literate.
04:14 ^🔗	Ceryn	My question was on how Archive.org handles when a site or something is uploaded, then uploaded again later and so on, in several copies.
04:14 ^🔗		qw3rty4 has joined #archiveteam-bs
04:16 ^🔗	Ceryn	Is each copy just stored separately? Or is the data deduplicated (i.e. identical parts of the data will only be stored once, then the other copies will refer to the master copy of those parts)? Or do future uploads only add what is new compared to the last upload?
04:16 ^🔗	Ceryn	I guess these questions take some knowledge of Archive.org's workings to answer.
04:17 ^🔗	MadArchiv	From what I've seen, they just take in different copies of the same thing as if they were different items.
04:17 ^🔗	Ceryn	Yeah. It's by far the easiest to handle, but it's also the least efficient.
04:17 ^🔗	MadArchiv	Yup
04:17 ^🔗	Ceryn	Though of course they could be doing things under the hood.
04:18 ^🔗	MadArchiv	Like they do with excluded sites.
04:19 ^🔗		qw3rty3 has quit IRC (Read error: Operation timed out)
04:29 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
04:33 ^🔗	MadArchiv	By the way, I'm dowloading
04:35 ^🔗	MadArchiv	Goddamnit, I did it again, just wait a bit so I can finish writing my post
04:41 ^🔗		MadArchiv has quit IRC (Read error: Operation timed out)
04:42 ^🔗		MadArchiv has joined #archiveteam-bs
04:45 ^🔗	MadArchiv	Alright, so, (I kinda forgot what I was going to write in the first place so let's just get to the point) do you know of any other webcomics I or, ideally, we should archive? Once I'm done with the Hiveworks ones, I mean.
04:46 ^🔗	Ceryn	I don't. I've never been into comics. I'm guessing others know some, however.
04:49 ^🔗	MadArchiv	Hmmm, alright. Do you think (and this is a legitimate question, by the way) that if we're gonna make a manual list of comics that should be saved then IRC would be the best place for it? I was thinking about putting it on Reddit since it'd more accessible.
04:50 ^🔗	MadArchiv	Grammar correction: add a "by the way" before "do you think"
04:51 ^🔗	Ceryn	IRC definitely isn't the place for such a list. When you want to paste stuff on IRC you usually paste it to a paste bin service (e.g. pastebin.com) and just post the link.
04:52 ^🔗	Ceryn	If you want it to be a joint effort then a reddit thread would probably be good. If the project was rather large in scope it seems maybe you'd organise a group here, make your own IRC channel and start a wiki page or something.
04:54 ^🔗	phillipsj	Ceryn, duplicate uploads can have varying quality as well. In the spring I was considering doing a bunch of YouTube channels in DVD quality: allows broader coverage for the same space. But is kinda pointless if the same channel was uploaded in HD already.
04:55 ^🔗	Ceryn	phillipsj: Right. Except new videos. They'd make sense to upload.
04:55 ^🔗	phillipsj	Youtube seems the thottle you a lot if you try downloading faster than a human can watch though.
04:56 ^🔗	Ceryn	phillipsj: But it is not common practice, then, to extend data you have already uploaded? People don't generally keep up to date mirrors and sync them with the Archive?
04:56 ^🔗	Ceryn	Okay. Hm. I guess you can parallelise Youtube videos though? At least to some extent.
04:56 ^🔗	phillipsj	Ceryn, never bothered to upload anything becuase they were begging for money to store the stuff.
04:57 ^🔗	Ceryn	Okay.
04:57 ^🔗	Ceryn	I mostly plan to archive for my own storage, but it seems I might as well upload a copy to the Archive too.
04:57 ^🔗	phillipsj	Other things came up as well. I put the machine I was suing in storage because Youtube was becomming to distracting for me. (Have IRL stuff to do)
04:58 ^🔗	phillipsj	*using
04:58 ^🔗	Ceryn	Okay.
04:58 ^🔗	Ceryn	Archive must have so much data bloat. Stuff they could optimise away.
04:58 ^🔗	phillipsj	I have a stack of blank DVD+Rs slowly rotting away.
04:59 ^🔗	Ceryn	Heh. Very slowly.
04:59 ^🔗	phillipsj	Was thinking of using them of my local copy.
04:59 ^🔗	phillipsj	*for
05:00 ^🔗	Ceryn	How come you want to store on DVDs as opposed to HDDs?
05:01 ^🔗	Ceryn	For actual DVD player use?
05:01 ^🔗	phillipsj	I expect the DVDs to last longer.
05:01 ^🔗	Ceryn	Yeah. It just seems so cumbersome. To be able to store only 4.7 GB or however much it is.
05:02 ^🔗	phillipsj	They are also slightly cheaper per GB I blieve. (But possibly not worth the inconvenience)
05:02 ^🔗		BlueMaxim has joined #archiveteam-bs
05:02 ^🔗	Ceryn	If you want to scale then no, definitely not worth it. Because you'd have to check them periodically just to know your data was intact.
05:03 ^🔗	phillipsj	I think th espetch of a plan is to try to use my remaing disks, and if it goes well, maybe buy more.
05:03 ^🔗	phillipsj	*sketch
05:03 ^🔗	Ceryn	Heh. It seems all of our problems can be solved by "buying more disks".
05:04 ^🔗	Ceryn	Works every time.
05:04 ^🔗	phillipsj	I origanally bough them for back-ups, but the back-up verification failed at the restore step.
05:05 ^🔗	phillipsj	New back-up plan is server with doubly-redundant ZFS and ECC RAM.
05:05 ^🔗	Ceryn	:P nice.
05:05 ^🔗	Ceryn	Do you have any stats on how likely normal RAM is to screw you over?
05:06 ^🔗	phillipsj	Bonus points if I encrypt that data in transit and at rest (by turing the server off).
05:07 ^🔗	phillipsj	Not off-hand, but if I am going to the trouble of redundancy in the face of a disk failure, I don't want bad RAM to mess up my data.
05:07 ^🔗	Ceryn	You can lukscrypt each raw drive, open them, and then set up a ZFS pool on the encrypted volumes.
05:07 ^🔗	Ceryn	That's what I've done, at least. Seems to work pretty well.
05:07 ^🔗	Ceryn	(I'm assuming ZFS on Linux.)
05:08 ^🔗	phillipsj	I like FreeBSD, not that I am good at actaully configuring it.
05:08 ^🔗	Ceryn	I suppose. It just seems the entire setup becomes significantly more expensive if you need hardware that supports ECC RAM.
05:09 ^🔗	Ceryn	Okay.
05:10 ^🔗	phillipsj	My "workstation" currently has mirrored, striped ZFs across 4 disks (non-ECC RAM though). Boots with a simulated controller failure (cable unplugged).
05:11 ^🔗	hook54321	JAA: Are you still grabbing the Catalonia cameras?
05:12 ^🔗	phillipsj	Scared to try a scrub without a proper back-up though.
05:17 ^🔗	phillipsj	Ceryn, I had to roll-back a non-ECC mememory upgrade on one of my machines: dropped a module on the carpet, and it started manifesting problems about a month later. ECC is nice in that it tells you when it has a problem.
05:17 ^🔗	Ceryn	phillipsj: When does it tell you this? During boot?
05:19 ^🔗	phillipsj	When testing my server and forgetting to plug in a CPU fan, I got memory errors/corrections logged to dmesg. Those (fully buffered) modules also log their temperature as well.
05:21 ^🔗	Ceryn	Okay. So it would take some work keeping yourself updated with the status.
05:21 ^🔗	*	phillipsj was planning to install cowling + exhaust fan for better colling.
05:22 ^🔗	phillipsj	Ceryn, an uncorrectable error halts the machine unless you have mirroring enabled.
05:22 ^🔗	phillipsj	s/colling/cooling/
05:22 ^🔗	Ceryn	Okay.
05:23 ^🔗	phillipsj	I cheaped out on the server, so it it taking a lot of my time top make sure it is stable :P
05:24 ^🔗	Ceryn	Haha yeah. That's a huge trade-off.
05:27 ^🔗	phillipsj	Can't beleive I missed the PSU fan grinding before purchase (second hand, obviously). Was able to rpelace it with a slower speed fan of the same dimensions (but server runs close to the de-rated (based on difference in fan power draw) power load).
05:58 ^🔗		MadArchiv has quit IRC (Remote host closed the connection)
07:04 ^🔗		drumstick has quit IRC (Ping timeout: 248 seconds)
08:00 ^🔗		drumstick has joined #archiveteam-bs
09:19 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
10:18 ^🔗		jschwart has quit IRC (Read error: Operation timed out)
10:32 ^🔗		schbirid has quit IRC (Ping timeout: 255 seconds)
10:37 ^🔗		schbirid has joined #archiveteam-bs
11:03 ^🔗	godane	SketchCow: i'm breaking up the So Graham Norton tape cause its 2 episode
11:03 ^🔗	godane	*episodes
11:04 ^🔗	godane	also whats funny is episode S01E25 is before S01E18
11:04 ^🔗	godane	and its not So Graham Norton but V Graham Norton
11:13 ^🔗	godane	i'm dong BalanceBall Fitness tape
11:25 ^🔗	godane	so your getting max bitrate with this BalanceBall Fitness tape
11:26 ^🔗		drumstick has quit IRC (Ping timeout: 248 seconds)
11:38 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
11:52 ^🔗	godane	fun fact: cover of tape say Beginner's Workout but title say Total Body Workout
11:53 ^🔗	godane	if the tape label said Beginner's Workout
11:53 ^🔗	godane	so i'm going for that for label
12:01 ^🔗		Stilett0 has joined #archiveteam-bs
12:01 ^🔗	godane	i found another tv tape
12:02 ^🔗	godane	i'm going to do it at 6000k instead of 10000k cause its a tv recording
12:03 ^🔗		Stilett0 is now known as Stiletto
12:06 ^🔗	godane	anyways i made screenshots with 6000k and 10000k
12:06 ^🔗	godane	and it looks the same so i think 6000k is ok
12:08 ^🔗	godane	SketchCow: here are the images: https://imgur.com/a/5QRap
12:08 ^🔗	godane	top one is the 6000k one
12:08 ^🔗	godane	bottom one is 10000k one
13:39 ^🔗	godane	so i found another duplicate tape
13:39 ^🔗	godane	it was down for love promo tape
13:40 ^🔗		superkuh has joined #archiveteam-bs
13:41 ^🔗	godane	anyways this tape has the last 2 episodes of Felicity for Season 2
14:06 ^🔗		icedice has joined #archiveteam-bs
14:29 ^🔗	godane	so i may have partial Charm recording on TNT
14:31 ^🔗	SketchCow	godane: I appreciate your best approach, godane
14:37 ^🔗	godane	this tapes is going to have 8 minutes of black in it
14:37 ^🔗	godane	with audio
14:38 ^🔗	godane	cause there is very bad tape between the end of felicity section and Charmed recording
14:40 ^🔗	godane	also this bad sections start before the end of felicity recording
14:41 ^🔗	godane	luckly for use there is bit of over recording
14:41 ^🔗	godane	either way i will break up the felicity and charmed part at 02:08:00 mark
14:48 ^🔗		Mateon1 has quit IRC (Ping timeout: 255 seconds)
14:49 ^🔗		Mateon1 has joined #archiveteam-bs
14:58 ^🔗		odemg has quit IRC (Ping timeout: 248 seconds)
15:03 ^🔗		jtn2_ has quit IRC (Quit: restarting for irssi security update)
15:05 ^🔗		jtn2 has joined #archiveteam-bs
15:33 ^🔗		dd0a13f37 has joined #archiveteam-bs
15:33 ^🔗	dd0a13f37	Is there anything like Library Genesis but for newspapers?
15:34 ^🔗	dd0a13f37	They link to magzdb.org, but it's in Russian and seems like it's broken
15:46 ^🔗		icedice has quit IRC (Quit: Leaving)
15:51 ^🔗	schbirid	Kaz: check out https://pypi.python.org/pypi/fake-useragent anyways ;)
15:59 ^🔗	Kaz	schbirid: will probably drop that in when we get blocked again.. UAs we're using are from as far back as Chrome 40
16:06 ^🔗	dd0a13f37	Could you please add googlebot user agents?
16:06 ^🔗	dd0a13f37	A lot of sites with paywalls give unrestricted access to google
16:10 ^🔗		jschwart has joined #archiveteam-bs
16:14 ^🔗	dd0a13f37	Did they remove the old addons from AMO yet?
16:14 ^🔗	dd0a13f37	It looks different
16:22 ^🔗	JAA	hook54321: Yes, those cam grabs are still running.
16:22 ^🔗	JAA	dd0a13f37: They will be removed in June.
16:26 ^🔗	dd0a13f37	Figured out a way to grab them all
16:28 ^🔗	JAA	We're on it already.
16:28 ^🔗	dd0a13f37	Wiki page is off though
16:28 ^🔗	dd0a13f37	>The total number of addons should be approximately 20,000.
16:28 ^🔗	dd0a13f37	there are 760k .xpi files
16:29 ^🔗	JAA	There's an ArchiveBot which has been running for a while (over 2 months), and I think Somebody2 did something as well.
16:29 ^🔗	JAA	There's a difference between "number of addons" and "number of .xpi files".
16:30 ^🔗	JAA	The latter includes different platforms and previous versions.
16:30 ^🔗	dd0a13f37	500k addons though
16:31 ^🔗	dd0a13f37	>459,938 add-ons found
16:31 ^🔗	dd0a13f37	The job for !a https://addons.mozilla.org/ is a slow approach
16:32 ^🔗	dd0a13f37	you can just make a list https://addons.mozilla.org/firefox/downloads/file/760000/ for !ao
16:32 ^🔗	JAA	I think that's what Somebody2 did (outside of ArchiveBot).
16:32 ^🔗	dd0a13f37	ah okay, doesn't say so on the wiki page
16:32 ^🔗	JAA	However, this doesn't grab older versions or different platforms.
16:32 ^🔗	JAA	Yes, the wiki is often not exactly up to date.
16:33 ^🔗	dd0a13f37	It does.
16:33 ^🔗	dd0a13f37	look here
16:33 ^🔗	dd0a13f37	https://addons.mozilla.org/en-US/firefox/addon/weather-extension/versions/
16:33 ^🔗	dd0a13f37	Hover over "Add to Firefox" links
16:33 ^🔗	JAA	Huh, I see.
16:34 ^🔗	JAA	Do different platforms get individual IDs as well?
16:34 ^🔗	dd0a13f37	1 id = 1 xpi
16:37 ^🔗	JAA	I see.
16:38 ^🔗	dd0a13f37	If the file exists, you get 302, if not, 404
16:38 ^🔗	dd0a13f37	Should I !ao it?
16:38 ^🔗	JAA	You're right that !a AMO is not exactly efficient. However, it does also archive various data around the actual addons: descriptions, screenshots, reviews, collections. Plus it provides a browsable interface.
16:39 ^🔗	JAA	No, let's not.
16:40 ^🔗	JAA	As mentioned, Somebody2 has done something similar already (not sure what he did exactly) and the !a AMO job is also pretty far. But we could do that sometime next year, shortly before they purge all legacy addons.
16:40 ^🔗	JAA	760k URLs should be pretty quick anyway.
16:44 ^🔗	Ceryn	For reference, you get 86400 requests per hour at one request per second.
16:45 ^🔗	JAA	Yeah, obviously.
16:46 ^🔗	Ceryn	And apropos, is there a general crawling rate you prefer to avoid getting rate-limited on sites? I know Reddit only allows a request every 2 seconds.
16:46 ^🔗	Ceryn	:)
16:46 ^🔗	JAA	We've been hammering them with five connections and a very low delay for weeks now.
16:46 ^🔗	JAA	They do?
16:46 ^🔗	JAA	Isn't that for the API?
16:46 ^🔗	Ceryn	Oh, yes.
16:47 ^🔗	Ceryn	Huh. Maybe they don't do that for web scraping. But I'm pretty sure they don't want you to query more often.
16:48 ^🔗	JAA	They didn't seem to care when I grabbed a number of subreddits through ArchiveBot a while ago (after Charlottesville).
16:48 ^🔗	Ceryn	Cool. How much did you grab? The entire thing? Did it work out well?
16:49 ^🔗	dd0a13f37	Large sites probably don't care, IMO it's better to start extremely high (e.g. max out your bandwidth) and see if you get blocked
16:50 ^🔗	Ceryn	Hm. Hopefully the block would be very temporary, then.
16:51 ^🔗	dd0a13f37	Just switch IPs
16:51 ^🔗	JAA	No, not the entire thing, just select subreddits, in particular far-right ones.
16:51 ^🔗	JAA	Stuff like /r/EuropeanNationalism etc.
16:51 ^🔗	JAA	Some of them got banned recently.
16:51 ^🔗	Ceryn	Yeah, I meant the entire subreddits. Cool.
16:51 ^🔗	Ceryn	Oh? Did you expect that to happen?
16:51 ^🔗	JAA	Well yeah, as far as it let me.
16:52 ^🔗	JAA	I think you can only get the last 1000 posts for a particular subreddit the normal way.
16:52 ^🔗	JAA	For anything older, you have to use the search with a special syntax.
16:52 ^🔗	JAA	Yeah, I wasn't at all surprised that they finally closed those shitholes.
16:52 ^🔗	JAA	They've been giving them bad press.
16:53 ^🔗	JAA	That seems like the only thing they care about.
16:54 ^🔗	Ceryn	Oh okay. Didn't know that.
16:54 ^🔗	dd0a13f37	What about archiving voat?
16:54 ^🔗	JAA	By the way, there's a full Reddit archive available somewhere also.
16:55 ^🔗	Ceryn	Do you have any idea how many of the posts you managed to get, then? 90+%? (And how many is that?)
16:55 ^🔗	JAA	They're continuously grabbing all comments etc.
16:55 ^🔗	JAA	No clue.
16:55 ^🔗	Ceryn	Oh sweet!
16:55 ^🔗	JAA	Note that these grabs can't get all comments in large threads.
16:55 ^🔗	JAA	(That full archive I just mentioned should contain those.)
16:56 ^🔗	dd0a13f37	If it's already fully archived, why bother?
16:56 ^🔗	Ceryn	Why not? You don't follow links to further discussion?
16:56 ^🔗	JAA	dd0a13f37: Yeah, I've been thinking about grabbing Voat. I don't have time to set something proper (i.e. using the API) up currently though.
16:57 ^🔗	dd0a13f37	Ceryn: What do you mean?
16:57 ^🔗	JAA	And the reason is that that full archive is not easily accessible. I haven't looked at it in detail, but I think it's a database.
16:57 ^🔗	JAA	You can't browse it in the Wayback Machine, for example.
16:57 ^🔗	dd0a13f37	Couldn't you generate html pages from the database?
16:57 ^🔗	JAA	Sure
16:57 ^🔗	JAA	The data is all there.
16:57 ^🔗	JAA	But the average user or journalist isn't going to do that.
16:57 ^🔗	Ceryn	dd0a13f37: Asking why full comment trees aren't available in his grab of subreddits.
16:58 ^🔗	dd0a13f37	That seems like most of the projects anyway. What's the point of archiving something, only for it to get darked and public 70+ years later?
16:58 ^🔗	Ceryn	I'm very interested in having a look at their Reddit database. Maybe it'll be good enough so I won't have to archive what I'm interested in.
16:59 ^🔗	dd0a13f37	Or even better, darking historically important content for political reasons
16:59 ^🔗	Ceryn	Personally my archiving interest is just archiving for my own sake. I want to have it. And if I have it, I don't mind sharing it.
17:00 ^🔗	Ceryn	Generally, I think having the content available 70+ years later is part of the idea.
17:01 ^🔗	Ceryn	Obviously no one here wants data darkened.
17:01 ^🔗	JAA	Ceryn: For one, those grabs ignored the per-comment links. But even if you grab those, you still don't handle the "load more comments" stuff. So yeah, it's not easily possible to archive an entire thread (unless you use the API to generate links to each comment in the thread or something like that).
17:02 ^🔗	Ceryn	Okay. Thanks for the clarification.
17:02 ^🔗	dd0a13f37	Sure, you want it available in 70 years, but if it's not available the 69 years before that, what's the point? To be able to pride yourself in that it's "theoretically" archived, even though you can't do anything useful with it?
17:03 ^🔗		Pixi has joined #archiveteam-bs
17:03 ^🔗	JAA	I believe you might be able to get access to it in certain circumstances. Also, laws can change, and if copyright finally gets the reform it so desperately needs, it might be possible for IA to undark it.
17:03 ^🔗	Ceryn	dd0a13f37: So, if a data collection is darkened because somebody says you mustn't have it, should you delete it?
17:03 ^🔗	Ceryn	dd0a13f37: Or should you preemptively decide not to store anything because it might get darkened?
17:04 ^🔗	dd0a13f37	JAA: And it can't go the other way around? They have to delete something, and whoops, it's gone since nobody could mirror it
17:04 ^🔗	Ceryn	Darkening sucks. But I like that the data is there. If someone really needs it, I expect it is possible to get it anyway.
17:04 ^🔗	dd0a13f37	Ceryn: If IA is the only one who has it, that's what happens, in practice.
17:04 ^🔗	JAA	dd0a13f37: I don't think anyone can force them to actually delete it.
17:05 ^🔗	dd0a13f37	Now, no. What about in 30 years?
17:05 ^🔗	JAA	Hence Internet Archive Canada and the mirror in Alexandria.
17:05 ^🔗	Ceryn	How often is something darkened? Is it really that much of it?
17:05 ^🔗	dd0a13f37	Ceryn: All the IS content, for example
17:05 ^🔗	dd0a13f37	That probably isn't mirrored in many places, since it's so sensitive
17:05 ^🔗	Ceryn	And if the data is widely desirable, then peer to peer sharing will help keep it alive and distributed too.
17:06 ^🔗	dd0a13f37	Most countries have laws against even touching it
17:06 ^🔗	Ceryn	Which IS content?
17:06 ^🔗	dd0a13f37	Their videos
17:06 ^🔗	Ceryn	Oh. Okay.
17:06 ^🔗	JAA	Fine to possess in many jurisdictions, as far as I know.
17:07 ^🔗	JAA	Distributing is a different thing, obviously.
17:07 ^🔗	dd0a13f37	And since they're too incompetent to use P2P, that won't save it either.
17:08 ^🔗	Ceryn	dd0a13f37: For me, in 30 years or whatever, I want to be able to peruse all the things I found interesting or nostalgic or worth saving at any point.
17:08 ^🔗	Ceryn	dd0a13f37: So, for me, even data I cannot share is worth storing. Assuming I want it.
17:08 ^🔗	dd0a13f37	But for the 30 years leading up to that, it basically doesn't exist.
17:09 ^🔗	Ceryn	Well, whenever I want to see it I can. At any point.
17:09 ^🔗	JAA	You two are talking about different things. Ceryn means that he can keep his own copy. dd0a13f37 is talking about IA having and distributing it.
17:09 ^🔗	dd0a13f37	Sure, if you store it, that is. But archiving something at IA only for it to sit in a data center for 70 years is utterly pointless
17:09 ^🔗	Ceryn	If I am aware that others want it, I can share it most of the time. Sometimes laws don't agree with sharing it. But usually it's doable anyway.
17:09 ^🔗	Ceryn	JAA's right.
17:10 ^🔗	omglolbah	Have to admit, I was somewhat concerned having my pipeline scrape the far-right stuff.... assume I'm in a registry now :p
17:10 ^🔗	Ceryn	dd0a13f37: I think it loses much of its value if it's inaccessible to all but IA for 70 years.
17:10 ^🔗	dd0a13f37	Sure, doable, but if IA is the only place that has it, it can get traced back to them if it "leaks"
17:10 ^🔗	Ceryn	dd0a13f37: BUT. I think it's very valuable to have it after 70 years and to the end of time.
17:11 ^🔗	dd0a13f37	In a sense, yes, but it's still quite pointless. A darknet archive, boy would that be something
17:11 ^🔗	omglolbah	not sure why it would be pointless to have copies for future study?
17:11 ^🔗	JAA	IPFS?
17:11 ^🔗	dd0a13f37	IPFS is neither darknet nor production ready
17:12 ^🔗	JAA	shrug
17:12 ^🔗	dd0a13f37	bittorrent over i2p is better, just needs a nice frontend
17:12 ^🔗	dd0a13f37	omglolbah: Not entirely pointless, but it's one hell of a delayed gratification
17:13 ^🔗	*	omglolbah peers over at the national archives of Norway where shit in runes sits in storage for study
17:13 ^🔗	omglolbah	all about time-scales <.<
17:17 ^🔗	dd0a13f37	https://ia801504.us.archive.org/6/items/asaad2/asaad2.mp4 here is an example of content that will probably not be recovered
17:17 ^🔗	dd0a13f37	only reencodes available in public
17:18 ^🔗	dd0a13f37	Not illegal in the US, just IA randomly deciding to censor it
17:18 ^🔗	dd0a13f37	http://jihadology.net/2017/11/03/new-video-message-from-the-islamic-state-lions-of-the-battle-2-wilayat-%e1%b9%a3ala%e1%b8%a5-al-din/
17:22 ^🔗	JAA	Are you sure about that? I found an almost identical file (just about 60 bytes bigger) within minutes...
17:24 ^🔗	JAA	Found another one which is 12 bytes bigger.
17:25 ^🔗	Ceryn	dd0a13f37: So, to be clear, the entire argument is "what is the point of IA continuing to store things that have been darkened", right?
17:25 ^🔗	Ceryn	Because in all other cases the data is just accessible.
17:25 ^🔗	Ceryn	Or not stored and lost.
17:28 ^🔗	JAA	Hmm, found a very ... interesting site while searching for that video.
17:29 ^🔗		odemg has joined #archiveteam-bs
17:30 ^🔗	JAA	Pretty sure this is run by ISIS.
17:35 ^🔗	JAA	dd0a13f37: The only differences between those video files I found are appended NULs, by the way. Probably fools many simple filters.
17:38 ^🔗	joepie91	dd0a13f37: the primary purpose of IA is preservation; access is just a means to that end
17:38 ^🔗	joepie91	dd0a13f37: from that perspective, it absolutely makes sense to keep something sitting in a datacenter for 70 years if the alternative is total loss
17:39 ^🔗	joepie91	also, periodic reminder that IPFS is neither an archival nor a storage medium, it's a distribution medium
17:41 ^🔗	JAA	Sure, but distribution is exactly what was being discussed above.
17:48 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
17:50 ^🔗		tuluu has joined #archiveteam-bs
17:57 ^🔗		pizzaiolo has joined #archiveteam-bs
17:59 ^🔗	SketchCow	What's happening here
18:13 ^🔗	Ceryn	SketchCow: A philosophical discussion on the merits of hoarding: If the data cannot be seen, how do you know it exists?
18:24 ^🔗	godane	so i found a tape of Empire Falls but its on dvd: https://www.amazon.com/Empire-Falls-Various/dp/B0009W5IMO
18:24 ^🔗	godane	in less there is reason to digitize the hbo airing its been skipped
18:32 ^🔗	godane	i'm digitizing tape 1 of Universal vs eric corley
18:35 ^🔗	godane	deposition of robert schumann
19:16 ^🔗	dd0a13f37	JAA: They have an official tool to append NULs, upload to different mirror sites, etc. But the activity died down after Raqqa was liberated.
19:17 ^🔗	dd0a13f37	Ceryn: No, the point I'm trying to make is "what's the point in archiving something if it only gets immediately darked"
19:17 ^🔗	Ceryn	dd0a13f37: You can't know it's going to be darked, can you?
19:17 ^🔗	dd0a13f37	Then we could just shut down newsgrabber etc, wait for them to release their archives in 100 years, yet that's no good solution
19:18 ^🔗	dd0a13f37	Copyrighted content will, and if it offends their political sensibilities it will
19:19 ^🔗	Ceryn	Right. So IA does not solve availability in the forseeable future for darkened things. It does, however, solve long term data preservation in that case.
19:19 ^🔗	dd0a13f37	Yeah, and then you might as well just bury hard drives in the ground and wait for archeologists to find them.
19:20 ^🔗	dd0a13f37	Libgen is doing a much better job of archiving and distributing knowledge, which seems to be the goal here (a database dump of a site isn't good enough since you need to be able to browse it too)
19:20 ^🔗	Ceryn	The issue doesn't have anything to do with IA, really, does it? It's about other parties disallowing distribution of data.
19:20 ^🔗	Ceryn	Sure you could do something, but it probably wouldn't be legal.
19:21 ^🔗	dd0a13f37	I'm just pointing out that there's a contradiction.
19:21 ^🔗	Ceryn	I haven't read the IA manifest (yet). joepie91 states they primarily aim to preserve. In which case it makes sense for them to do what they do.
19:22 ^🔗	dd0a13f37	On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips.
19:22 ^🔗	joepie91	dd0a13f37: considering that I am completely unable to download anything from libgen due to country blocks, that might be a premature conclusion
19:22 ^🔗	joepie91	(libgen doing better at distribution)
19:22 ^🔗	dd0a13f37	What's the point of preservation if the data won't be available within a reasonable time span?
19:22 ^🔗	joepie91	they take a different approach, more legally shaky, with different tradeoffs
19:23 ^🔗	dd0a13f37	joepie91: install torbrowser
19:23 ^🔗	joepie91	you are wholly missing the point here
19:23 ^🔗	Ceryn	Preservation by its very nature is not about near future needs.
19:23 ^🔗	joepie91	you're stuck on a One True Vision of how you believe archival and distribution should work, without understanding the legal, political, technical, social implications of that approach, and without understanding that a variety of approaches is the correct solution here
19:24 ^🔗	joepie91	which is already what we have
19:24 ^🔗	dd0a13f37	No, I'm not. The content is more available if you need to spend 5 minutes downloading Tor once than if you need to wait 70 years.
19:24 ^🔗	joepie91	which means that different outlets take different approaches with different tradeoffs
19:24 ^🔗	joepie91	dd0a13f37: those two are effectively the same thing for 99.99% of the population
19:25 ^🔗	joepie91	seriously, take a step outside of your own perspective sometimes and understand the wider effects of different approaches
19:25 ^🔗	joepie91	this is getting really tiring
19:25 ^🔗	dd0a13f37	Just curious, can you access http://93.174.95.27/ ?
19:25 ^🔗	joepie91	nope, empty response
19:25 ^🔗	dd0a13f37	Fair point. But libgen still does make a larger amount of knowledge available to a larger amount of people, and for less resources.
19:26 ^🔗	joepie91	"to a larger amount of people" - this is absolutely false
19:26 ^🔗	joepie91	"a larger amount of knowledge" - this is also very likely false
19:26 ^🔗	joepie91	having to deal with legal complications limits the scalability of an archive
19:26 ^🔗	joepie91	it's no different from how it's more difficult to move to a new house if you have an attic full of stuff you want to keep
19:27 ^🔗	joepie91	the more stuff you need to keep around, the more difficult it is to move and respond to new situations
19:27 ^🔗	dd0a13f37	If we count in sci-hub, I'm not so sure. It does have a lot of users in academia.
19:27 ^🔗	joepie91	for a reliable, long-term archive - ie. not something that is existing by virtue of currently not being regulated out of existence like libgen - you do not want to create legal problems where there are none
19:27 ^🔗	joepie91	the only reason libgen is still around is because legislation and enforcement haven't been standardized between countries
19:27 ^🔗	dd0a13f37	archive.org has a larger amount of knowledge, but libgen probably disseminates a larger amount of knowledge/hour
19:28 ^🔗	joepie91	this gap is closing increasingly more
19:28 ^🔗	joepie91	dd0a13f37: what metrics are you basing that on?
19:28 ^🔗	dd0a13f37	Libgen's issues are only technological. They could easily switch over to the darknet.
19:28 ^🔗	joepie91	no, they couldn't.
19:28 ^🔗	joepie91	let me guess, you're an I2P user?
19:28 ^🔗	dd0a13f37	Why not? They already have an I2P site. An onion would be trivial
19:28 ^🔗	dd0a13f37	Nope, only Tor.
19:29 ^🔗	joepie91	so here's the thing, I've had I2P proponents try to argue this with me for literally 7-8 years now
19:29 ^🔗	joepie91	"they could just move to I2P"
19:29 ^🔗	dd0a13f37	What's wrong with I2P?
19:29 ^🔗	joepie91	the reality is that the barrier to install the necessary software is far too high for the average user, and that moving to a non-clearnet site means you lose 99% of your readership
19:29 ^🔗	joepie91	believing that you can move to a darknet site without being poorly accessible is delusional
19:29 ^🔗	Frogging	1% availability is still greater than 0%
19:30 ^🔗	joepie91	darknet and clearnet sites absolutely ARE NOT equivalent from an accessibility perspective
19:30 ^🔗	Ceryn	You're solving different needs.
19:30 ^🔗	dd0a13f37	Having the main servers behind Tor/similar is not the same thing as only being accessible from Tor/similar.
19:30 ^🔗	joepie91	Frogging: sure, so set up an alternative archive with sketchier data on a darknet site
19:30 ^🔗	joepie91	problem solved
19:30 ^🔗	joepie91	this is what I said about variety of tactics
19:30 ^🔗	dd0a13f37	pinkapp.io, for example. Or like gettor does.
19:30 ^🔗	Frogging	makes sense
19:30 ^🔗	joepie91	my problem is with people trying to argue that EVERYTHING needs to make a certain set of tradeoffs, the same one
19:30 ^🔗	joepie91	this includes "the IA shouldn't dark X"
19:30 ^🔗	joepie91	"the IA should make an I2P site"
19:30 ^🔗	joepie91	etc.
19:30 ^🔗	Frogging	so I don't know what the argument is about if we all agree that multiple methods are viable and each has pitfalls
19:30 ^🔗	joepie91	(yes, I know it's called an eepsite, but not everybody here will)
19:31 ^🔗	joepie91	Frogging: the discussion here started because dd0a13f37 is of the opinion that IA is unnecessarily darking things, it seems
19:31 ^🔗	dd0a13f37	No, not really. Although IS darking is completely unnecessary, since it doesn't violate US law, that's beside the point.
19:31 ^🔗	joepie91	which seems to be a recurring discussion that never goes anywhere and just produces a lot of noise
19:32 ^🔗	dd0a13f37	The point is that there's a contradiction. On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips.
19:32 ^🔗	joepie91	dd0a13f37: I don't see a contradiction there.
19:32 ^🔗	Frogging	IA will accept either, but database dumps are incompatible with wayback
19:34 ^🔗	dd0a13f37	You're okay with something being inaccessible except for some arcane procedure involving possessing research credentials, sending an e-mail to IA, and having luck. On the other hand, you're not okay with something being inaccessible except for browsing a database dump, despite that "The data is all there", since "the average user or journalist isn't going to do that"
19:34 ^🔗	dd0a13f37	You don't see a minor contradiction there?
19:34 ^🔗	Frogging	the IA also benefits greatly from being a recognized legitimate entity
19:34 ^🔗	dd0a13f37	That's true. LG is limited scalability-wise from resources.
19:34 ^🔗	joepie91	dd0a13f37: you're conflating 'temporarily inaccessible' with 'permanently inaccessible'
19:35 ^🔗	joepie91	dd0a13f37: something being darked does not mean it will not ever be public again
19:35 ^🔗	joepie91	it is a temporary measure
19:35 ^🔗	dd0a13f37	Within my lifetime, yes.
19:35 ^🔗	joepie91	(that is why it's not deleted)
19:35 ^🔗	joepie91	it is temporary nevertheless
19:35 ^🔗	dd0a13f37	Something in a database dump isn't permanently inaccessible either.
19:35 ^🔗	joepie91	dd0a13f37: if you can actually hear me out for a second
19:36 ^🔗	dd0a13f37	I can browse it just fine, could run a local instance of reddit and make a siterip.
19:36 ^🔗	*	joepie91 sighs
19:36 ^🔗	dd0a13f37	Alright.
19:36 ^🔗	joepie91	the reason siterips are preferred over databases is because that concerns permanent accessibility -- you cannot reliably reproduce a website's operation from just a DB dump unless you have literally all of the components and infrastructure involved
19:36 ^🔗	joepie91	nor is there a generic way to, given a database and site source code, make it accessible
19:36 ^🔗	joepie91	this means that there is a cost to accessibility of raw data that many people will not pay
19:36 ^🔗	Frogging	exact reproducability
19:36 ^🔗	joepie91	this will be perpetually true
19:36 ^🔗	joepie91	not temporarily
19:37 ^🔗	joepie91	this doesn't mean that the raw data shouldn't be archived, just that there should be a more accessible option
19:37 ^🔗	joepie91	ie. a siterip
19:37 ^🔗	dd0a13f37	That is true. However, in the case of reddit, the source code is public. And you could make a siterip from a dump if you're sure that the templates are correct.
19:37 ^🔗	joepie91	dd0a13f37: and that is an immense cost you've described there that many people are literally incapable of paying.
19:37 ^🔗	Frogging	still not sure what you're arguing because nobody said db dumps weren't allowed
19:38 ^🔗	dd0a13f37	If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
19:38 ^🔗	dd0a13f37	Frogging: sure, but a goal is apparently to archive the pages
19:38 ^🔗	dd0a13f37	If something is darked for 1000 years, is it still "temporary"?
19:39 ^🔗	Frogging	what does dumps vs siterips have to do with darking?
19:39 ^🔗	joepie91	[20:38] <dd0a13f37> If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
19:39 ^🔗	joepie91	what?
19:39 ^🔗	dd0a13f37	Dumps - inaccessible, siterips - accessible; dumps bad, siterips good; darked - inaccessible, siterips - accessible; both acceptable
19:40 ^🔗	Frogging	dumps aren't necessarily inaccessible, and siterips aren't necessarily accessible
19:40 ^🔗	dd0a13f37	Presuming they are, then.
19:40 ^🔗	Frogging	these are separate issues
19:40 ^🔗	Frogging	why would we presume that? it isn't true
19:40 ^🔗	dd0a13f37	It's still a contradiction.
19:40 ^🔗	dd0a13f37	In general, it is.
19:40 ^🔗	joepie91	dd0a13f37: it only looks like a contradiction because you're intentionally ignoring nuance that we've already pointed out
19:41 ^🔗	joepie91	and I am getting tired of this discussion repeatedly clogging up the channel, to be frank
19:41 ^🔗	joepie91	this is absolutely not in the least constructiove
19:41 ^🔗	joepie91	constructive*
19:41 ^🔗	dd0a13f37	joepie91: If I would generate a siterip from a local issue of reddit and a dump, and then upload the generated pages to WB, if we disregard the fact that the metadata might be off (e.g. page X wasn't fetched from reddit on date X but rather my local instance of reddit on date X, running the same software and DB), don't we in practice have the same thing as a 'proper' siterip, but without being limited by ratelimits?
19:41 ^🔗	Ceryn	You could make a wiki page explaining the what and why and refer to that. :)
19:42 ^🔗	Frogging	there's nothing to explain because the whole argument is based on premises that make no sense and aren't true
19:42 ^🔗	joepie91	dd0a13f37: but we don't "disregard" that, and that still requires work, and I've told you all these things before and you need to stop twisting arguments to make them sound like a contradiction
19:42 ^🔗	joepie91	right, precisely what Frogging said
19:42 ^🔗	joepie91	this is a non-discussion
19:42 ^🔗	Ceryn	Apparently there's a root of confusion somewhere, if people keep instigating similar discussions.
19:42 ^🔗	dd0a13f37	Well, they're not the exact same thing, and I'm not claiming that either. But they are similar. A siterip (uploaded to WB) is more accessible than a dump though, that's just the way it is.
19:43 ^🔗	joepie91	Ceryn: most people don't...
19:43 ^🔗	Ceryn	Okay.
19:43 ^🔗	joepie91	dd0a13f37: what are you actually trying to accomplish with this discussion?
19:43 ^🔗	dd0a13f37	I've only 'instigated' this discussion once, please don't conflate me with other people who aren't me.
19:44 ^🔗	dd0a13f37	joepie91: I'm just trying to figure out what's the deal with the contradiction, how accessibility is super important in some cases and utterly unimportant in others, as long as it's "temporary".
19:44 ^🔗	joepie91	dd0a13f37: except all of these premises are wrong and there is no contradiction, as we have repeatedly told you
19:44 ^🔗	joepie91	so why are you continuing the discussion along the same lines that we've already pointed out are false?
19:45 ^🔗	Frogging	dd0a13f37: I think IA would like to have everything accessible all the time, but it's not always legally possible. they don't dark things just because they feel like it
19:46 ^🔗	dd0a13f37	I'm not discussing the IA's actions here, I'm discussing what's the point in uploading to IA and not somewhere else if we know IA'll just dark it. However, they do dark things just because they feel like it, the IS videos are a prime example of this. They're constitutionally protected speech.
19:46 ^🔗	dd0a13f37	joepie91: How are the premises wrong? You've said it yourself that browsing a database dump is inconvenient and wayback is convenient.
19:46 ^🔗	joepie91	oh for fucks sake
19:47 ^🔗	Frogging	yeah, that's already been answered. people can upload to more than one place. there's no policy that says otherwise
19:47 ^🔗	joepie91	dd0a13f37: the premises are wrong because you keep misrepresenting the points being made and/or the reality of things to the point of inaccuracy, and it is utterly pointless to continue discussing any of this with you because you just keep piling on more presumptions and misrepresentatioins and whatnot
19:47 ^🔗	joepie91	if you don't trust IA to keep something available, upload a copy elsewhere
19:47 ^🔗	joepie91	and with that I hope we can conclude the discussion
19:48 ^🔗	dd0a13f37	Well okay, that's a fair point.
19:52 ^🔗	dd0a13f37	There was a discussion about google's rate limits earlier. Would it be possible to use startpage or some similar proxy to bypass this? They're not nearly as strict.
19:53 ^🔗	Frogging	I'm not sure how startpage interfaces with google, but anecdotally I've noticed that the results aren't always the same or as numerous on Startpage
19:53 ^🔗	Frogging	so if that matters, use caution when assuming it's a transparent proxy to google
19:53 ^🔗	JAA	Let me just point you to Searx.
19:54 ^🔗	dd0a13f37	That could be true. Still better than nothing though.
19:54 ^🔗	JAA	And also YaCy.
19:54 ^🔗	JAA	Not sure about the quality of search results on the latter.
19:54 ^🔗	dd0a13f37	How come searx doesn't get rate limited?
19:54 ^🔗	Frogging	google's ratelimiting is really touchy. I only share an IP with 3 people and I get the captcha at least once a day
19:55 ^🔗	dd0a13f37	Well, then I could just use bing (or friends, yahoo/ddg)
19:56 ^🔗	JAA	Not sure about the rate limits on Searx, to be honest.
19:56 ^🔗	JAA	But it does aggregate results from various engines (including Google, Bing, Yahoo), so that probably helps.
19:57 ^🔗	dd0a13f37	I'm just wondering. How does it avoid getting rate limited by the backend search engines?
20:03 ^🔗		arkhive has joined #archiveteam-bs
20:04 ^🔗	arkhive	I'm trying to save go90 Original videos. I'm having trouble figuring out how to save web videos. HTML5 or Flash. I'm a bit of a noob. Can someone help me?
20:05 ^🔗	arkhive	go90 is a streaming service(free) from Verizon. they have dumped millions into it and are losing tons of money. laying off employees. So i think it's time to Ctrl-C Ctrl-V
20:08 ^🔗	dd0a13f37	try youtube-dl
20:08 ^🔗	dd0a13f37	if that fails, open inspect element, network tab, play a video, and see if there's a pattern in the requests it make and if you can figure them out from the url
20:11 ^🔗	dd0a13f37	>Sorry!
20:11 ^🔗	dd0a13f37	>go90™ Mobile TV Network is only available in the US right now.
20:23 ^🔗	godane	youtube-dl will not work with go90
20:36 ^🔗		TheLovina has quit IRC (Read error: Operation timed out)
20:37 ^🔗		TheLovina has joined #archiveteam-bs
20:40 ^🔗		TheLovina has quit IRC (Read error: Connection reset by peer)
21:42 ^🔗		atrocity has quit IRC (Ping timeout: 246 seconds)
21:44 ^🔗	ranma	arkhive: get a jetpack and sub to VZW for a month?
21:44 ^🔗	ranma	is there anything of value even on go90?
21:45 ^🔗	ranma	speaking as a recent employee of VZW
21:47 ^🔗	ranma	arkhive: maybe download NOX or Bluestacks (or virtualize Android if you can), install the app, sniff the traffic to see if they're calling https or http gets?
21:48 ^🔗	ranma	oh yeah, go90 isn't bandwidth-free on prepaid
22:32 ^🔗		dd0a13f37 has quit IRC (Quit: Connection closed for inactivity)
22:51 ^🔗		drumstick has joined #archiveteam-bs
22:54 ^🔗		Asparagir has joined #archiveteam-bs
23:00 ^🔗		BlueMaxim has joined #archiveteam-bs
23:11 ^🔗		atrocity has joined #archiveteam-bs
23:35 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)

irclogger-viewer