[00:05] *** schbirid has quit IRC (Ping timeout: 256 seconds)
[00:16] *** schbirid has joined #archiveteam-bs
[00:35] <SketchCow> godane: looking good (the tapes)
[00:53] <godane> i still have a ton to upload
[01:01] <godane> i uploaded 603 items so far this month: https://archive.org/details/@chris85?and[]=addeddate%3A2017-11&sort=-publicdate
[01:03] <godane> that collection is like slow sink drip for government docs
[01:04] <godane> i know for a fact there is over 500000 ids used with dtic
[01:58] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
[02:15] *** SketchCow has quit IRC (Read error: Operation timed out)
[02:19] *** SketchCow has joined #archiveteam-bs
[02:19] *** swebb sets mode: +o SketchCow
[02:42] *** schbirid has quit IRC (Ping timeout: 255 seconds)
[02:51] *** drumstick has quit IRC (Read error: Operation timed out)
[02:52] *** drumstick has joined #archiveteam-bs
[02:54] *** schbirid has joined #archiveteam-bs
[03:40] *** Pixi has joined #archiveteam-bs
[03:59] *** MadArchiv has joined #archiveteam-bs
[04:01] <MadArchiv> I finally managed to get site-grab to work on my computer (somehow), I'll take some much-needed sleep for now and
[04:05] <Ceryn> Damn. He must have needed it. :)
[04:05] <MadArchiv> (Sigh, I *really* need to stop pressing the Enter key by accident) ...and then I'll archiving the completed webcomics from Hiveworks tomorrow
[04:06] *** Pixi has quit IRC (Quit: Pixi)
[04:07] <MadArchiv> Ceryn: You bet it, pal
[04:07] <Ceryn> Are you uploading to Archive.org?
[04:07] <MadArchiv> Of course.
[04:08] <Ceryn> Are dumps of a site generally just uploaded occasionally, no deduplication or continuation from the last batch or anything?
[04:09] <MadArchiv> These are webcomic sites, so I suppose there are no dumps at all. I could be wrong though.
[04:09] <Ceryn> I mean dumps in the general sense, data dumps, whatever format the data has.
[04:11] <MadArchiv> You mean the ones I would get after I crawl the site? I'm new to this, and I'm also pretty computer-illiterate, so I'm learning this stuff as I go along.
[04:12] <MadArchiv> Wait, are you still talking about webcomic sites?
[04:13] <Ceryn> Heh okay. I'm new to archiving but I'm rather computer-literate.
[04:14] <Ceryn> My question was on how Archive.org handles when a site or something is uploaded, then uploaded again later and so on, in several copies.
[04:14] *** qw3rty4 has joined #archiveteam-bs
[04:16] <Ceryn> Is each copy just stored separately? Or is the data deduplicated (i.e. identical parts of the data will only be stored once, then the other copies will refer to the master copy of those parts)? Or do future uploads only add what is new compared to the last upload?
[04:16] <Ceryn> I guess these questions take some knowledge of Archive.org's workings to answer.
[04:17] <MadArchiv> From what I've seen, they just take in different copies of the same thing as if they were different items.
[04:17] <Ceryn> Yeah. It's by far the easiest to handle, but it's also the least efficient.
[04:17] <MadArchiv> Yup
[04:17] <Ceryn> Though of course they could be doing things under the hood.
[04:18] <MadArchiv> Like they do with excluded sites.
[04:19] *** qw3rty3 has quit IRC (Read error: Operation timed out)
[04:29] *** BlueMaxim has quit IRC (Quit: Leaving)
[04:33] <MadArchiv> By the way, I'm dowloading
[04:35] <MadArchiv> Goddamnit, I did it again, just wait a bit so I can finish writing my post
[04:41] *** MadArchiv has quit IRC (Read error: Operation timed out)
[04:42] *** MadArchiv has joined #archiveteam-bs
[04:45] <MadArchiv> Alright, so, (I kinda forgot what I was going to write in the first place so let's just get to the point) do you know of any *other* webcomics I or, ideally, we should archive? Once I'm done with the Hiveworks ones, I mean.
[04:46] <Ceryn> I don't. I've never been into comics. I'm guessing others know some, however.
[04:49] <MadArchiv> Hmmm, alright. Do you think (and this is a legitimate question, by the way) that if we're gonna make a manual list of comics that should be saved then IRC would be the best place for it? I was thinking about putting it on Reddit since it'd more accessible.
[04:50] <MadArchiv> Grammar correction: add a "by the way" before "do you think"
[04:51] <Ceryn> IRC definitely isn't the place for such a list. When you want to paste stuff on IRC you usually paste it to a paste bin service (e.g. pastebin.com) and just post the link.
[04:52] <Ceryn> If you want it to be a joint effort then a reddit thread would probably be good. If the project was rather large in scope it seems maybe you'd organise a group here, make your own IRC channel and start a wiki page or something.
[04:54] <phillipsj> Ceryn, duplicate uploads can have varying quality as well. In the spring I was considering doing a bunch of YouTube channels in DVD quality: allows broader coverage for the same space. But is kinda pointless if the same channel was uploaded in HD already.
[04:55] <Ceryn> phillipsj: Right. Except new videos. They'd make sense to upload.
[04:55] <phillipsj> Youtube seems the thottle you a lot if you try downloading faster than a human can watch though.
[04:56] <Ceryn> phillipsj: But it is not common practice, then, to extend data you have already uploaded? People don't generally keep up to date mirrors and sync them with the Archive?
[04:56] <Ceryn> Okay. Hm. I guess you can parallelise Youtube videos though? At least to some extent.
[04:56] <phillipsj> Ceryn, never bothered to upload anything becuase they were begging for money to store the stuff.
[04:57] <Ceryn> Okay.
[04:57] <Ceryn> I mostly plan to archive for my own storage, but it seems I might as well upload a copy to the Archive too.
[04:57] <phillipsj> Other things came up as well. I put the machine I was suing in storage because Youtube was becomming to distracting for me. (Have IRL stuff to do)
[04:58] <phillipsj> *using
[04:58] <Ceryn> Okay.
[04:58] <Ceryn> Archive must have so much data bloat. Stuff they could optimise away.
[04:58] <phillipsj> I have a stack of blank DVD+Rs slowly rotting away.
[04:59] <Ceryn> Heh. Very slowly.
[04:59] <phillipsj> Was thinking of using them of my local copy.
[04:59] <phillipsj> *for
[05:00] <Ceryn> How come you want to store on DVDs as opposed to HDDs?
[05:01] <Ceryn> For actual DVD player use?
[05:01] <phillipsj> I expect the DVDs to last longer.
[05:01] <Ceryn> Yeah. It just seems so cumbersome. To be able to store only 4.7 GB or however much it is.
[05:02] <phillipsj> They are also slightly cheaper per GB I blieve. (But possibly not worth the inconvenience)
[05:02] *** BlueMaxim has joined #archiveteam-bs
[05:02] <Ceryn> If you want to scale then no, definitely not worth it. Because you'd have to check them periodically just to know your data was intact.
[05:03] <phillipsj> I think th espetch of a plan is to try to use my remaing disks, and if it goes well, maybe buy more.
[05:03] <phillipsj> *sketch
[05:03] <Ceryn> Heh. It seems all of our problems can be solved by "buying more disks".
[05:04] <Ceryn> Works every time.
[05:04] <phillipsj> I origanally bough them for back-ups, but the back-up verification failed at the restore step.
[05:05] <phillipsj> New back-up plan is server with doubly-redundant ZFS and ECC RAM.
[05:05] <Ceryn> :P nice.
[05:05] <Ceryn> Do you have any stats on how likely normal RAM is to screw you over?
[05:06] <phillipsj> Bonus points if I encrypt that data in transit and at rest (by turing the server off).
[05:07] <phillipsj> Not off-hand, but if I am going to the trouble of redundancy in the face of a disk failure, I don't want bad RAM to mess up my data.
[05:07] <Ceryn> You can lukscrypt each raw drive, open them, and then set up a ZFS pool on the encrypted volumes.
[05:07] <Ceryn> That's what I've done, at least. Seems to work pretty well.
[05:07] <Ceryn> (I'm assuming ZFS on Linux.)
[05:08] <phillipsj> I like FreeBSD, not that I am good at actaully configuring it.
[05:08] <Ceryn> I suppose. It just seems the entire setup becomes significantly more expensive if you need hardware that supports ECC RAM.
[05:09] <Ceryn> Okay.
[05:10] <phillipsj> My "workstation" currently has mirrored, striped ZFs across 4 disks (non-ECC RAM though). Boots with a simulated controller failure (cable unplugged).
[05:11] <hook54321> JAA: Are you still grabbing the Catalonia cameras?
[05:12] <phillipsj> Scared to try a scrub without a proper back-up though.
[05:17] <phillipsj> Ceryn, I had to roll-back a non-ECC mememory upgrade on one of my machines: dropped a module on the carpet, and it started manifesting problems about a month later. ECC is nice in that it tells you when it has a problem.
[05:17] <Ceryn> phillipsj: When does it tell you this? During boot?
[05:19] <phillipsj> When testing my server and forgetting to plug in a CPU fan, I got memory errors/corrections logged to dmesg. Those (fully buffered) modules also log their temperature as well.
[05:21] <Ceryn> Okay. So it would take some work keeping yourself updated with the status.
[05:21] * phillipsj was planning to install cowling + exhaust fan for better colling.
[05:22] <phillipsj> Ceryn, an uncorrectable error halts the machine unless you have mirroring enabled.
[05:22] <phillipsj> s/colling/cooling/
[05:22] <Ceryn> Okay.
[05:23] <phillipsj> I cheaped out on the server, so it it taking a lot of my time top make sure it is stable :P
[05:24] <Ceryn> Haha yeah. That's a huge trade-off.
[05:27] <phillipsj> Can't beleive I missed the PSU fan grinding before purchase (second hand, obviously). Was able to rpelace it with a slower speed fan of the same dimensions (but server runs close to the de-rated (based on difference in fan power draw) power load).
[05:58] *** MadArchiv has quit IRC (Remote host closed the connection)
[07:04] *** drumstick has quit IRC (Ping timeout: 248 seconds)
[08:00] *** drumstick has joined #archiveteam-bs
[09:19] *** BlueMaxim has quit IRC (Quit: Leaving)
[10:18] *** jschwart has quit IRC (Read error: Operation timed out)
[10:32] *** schbirid has quit IRC (Ping timeout: 255 seconds)
[10:37] *** schbirid has joined #archiveteam-bs
[11:03] <godane> SketchCow: i'm breaking up the So Graham Norton tape cause its 2 episode
[11:03] <godane> *episodes
[11:04] <godane> also whats funny is episode S01E25 is before S01E18
[11:04] <godane> and its not So Graham Norton but V Graham Norton
[11:13] <godane> i'm dong BalanceBall Fitness tape
[11:25] <godane> so your getting max bitrate with this BalanceBall Fitness tape
[11:26] *** drumstick has quit IRC (Ping timeout: 248 seconds)
[11:38] *** Stilett0 has quit IRC (Read error: Operation timed out)
[11:52] <godane> fun fact: cover of tape say Beginner's Workout but title say Total Body Workout
[11:53] <godane> if the tape label said Beginner's Workout
[11:53] <godane> so i'm going for that for label
[12:01] *** Stilett0 has joined #archiveteam-bs
[12:01] <godane> i found another tv tape
[12:02] <godane> i'm going to do it at 6000k instead of 10000k cause its a tv recording
[12:03] *** Stilett0 is now known as Stiletto
[12:06] <godane> anyways i made screenshots with 6000k and 10000k
[12:06] <godane> and it looks the same so i think 6000k is ok
[12:08] <godane> SketchCow: here are the images: https://imgur.com/a/5QRap
[12:08] <godane> top one is the 6000k one
[12:08] <godane> bottom one is 10000k one
[13:39] <godane> so i found another duplicate tape
[13:39] <godane> it was down for love promo tape
[13:40] *** superkuh has joined #archiveteam-bs
[13:41] <godane> anyways this tape has the last 2 episodes of Felicity for Season 2
[14:06] *** icedice has joined #archiveteam-bs
[14:29] <godane> so i may have partial Charm recording on TNT
[14:31] <SketchCow> godane: I appreciate your best approach, godane
[14:37] <godane> this tapes is going to have 8 minutes of black in it
[14:37] <godane> with audio
[14:38] <godane> cause there is very bad tape between the end of felicity section and Charmed recording
[14:40] <godane> also this bad sections start before the end of felicity recording
[14:41] <godane> luckly for use there is bit of over recording
[14:41] <godane> either way i will break up the felicity and charmed part at 02:08:00 mark
[14:48] *** Mateon1 has quit IRC (Ping timeout: 255 seconds)
[14:49] *** Mateon1 has joined #archiveteam-bs
[14:58] *** odemg has quit IRC (Ping timeout: 248 seconds)
[15:03] *** jtn2_ has quit IRC (Quit: restarting for irssi security update)
[15:05] *** jtn2 has joined #archiveteam-bs
[15:33] *** dd0a13f37 has joined #archiveteam-bs
[15:33] <dd0a13f37> Is there anything like Library Genesis but for newspapers?
[15:34] <dd0a13f37> They link to magzdb.org, but it's in Russian and seems like it's broken
[15:46] *** icedice has quit IRC (Quit: Leaving)
[15:51] <schbirid> Kaz: check out https://pypi.python.org/pypi/fake-useragent anyways ;)
[15:59] <Kaz> schbirid: will probably drop that in when we get blocked again.. UAs we're using are from as far back as Chrome 40
[16:06] <dd0a13f37> Could you please add googlebot user agents?
[16:06] <dd0a13f37> A lot of sites with paywalls give unrestricted access to google
[16:10] *** jschwart has joined #archiveteam-bs
[16:14] <dd0a13f37> Did they remove the old addons from AMO yet?
[16:14] <dd0a13f37> It looks different
[16:22] <JAA> hook54321: Yes, those cam grabs are still running.
[16:22] <JAA> dd0a13f37: They will be removed in June.
[16:26] <dd0a13f37> Figured out a way to grab them all
[16:28] <JAA> We're on it already.
[16:28] <dd0a13f37> Wiki page is off though
[16:28] <dd0a13f37> >The total number of addons should be approximately 20,000.
[16:28] <dd0a13f37> there are 760k .xpi files
[16:29] <JAA> There's an ArchiveBot which has been running for a while (over 2 months), and I think Somebody2 did something as well.
[16:29] <JAA> There's a difference between "number of addons" and "number of .xpi files".
[16:30] <JAA> The latter includes different platforms and previous versions.
[16:30] <dd0a13f37> 500k addons though
[16:31] <dd0a13f37> >459,938 add-ons found
[16:31] <dd0a13f37> The job for !a https://addons.mozilla.org/ is a slow approach
[16:32] <dd0a13f37> you can just make a list https://addons.mozilla.org/firefox/downloads/file/760000/ for !ao
[16:32] <JAA> I think that's what Somebody2 did (outside of ArchiveBot).
[16:32] <dd0a13f37> ah okay, doesn't say so on the wiki page
[16:32] <JAA> However, this doesn't grab older versions or different platforms.
[16:32] <JAA> Yes, the wiki is often not exactly up to date.
[16:33] <dd0a13f37> It does.
[16:33] <dd0a13f37> look here
[16:33] <dd0a13f37> https://addons.mozilla.org/en-US/firefox/addon/weather-extension/versions/
[16:33] <dd0a13f37> Hover over "Add to Firefox" links
[16:33] <JAA> Huh, I see.
[16:34] <JAA> Do different platforms get individual IDs as well?
[16:34] <dd0a13f37> 1 id = 1 xpi
[16:37] <JAA> I see.
[16:38] <dd0a13f37> If the file exists, you get 302, if not, 404
[16:38] <dd0a13f37> Should I !ao it? 
[16:38] <JAA> You're right that !a AMO is not exactly efficient. However, it does also archive various data around the actual addons: descriptions, screenshots, reviews, collections. Plus it provides a browsable interface.
[16:39] <JAA> No, let's not.
[16:40] <JAA> As mentioned, Somebody2 has done something similar already (not sure what he did *exactly*) and the !a AMO job is also pretty far. But we could do that sometime next year, shortly before they purge all legacy addons.
[16:40] <JAA> 760k URLs should be pretty quick anyway.
[16:44] <Ceryn> For reference, you get 86400 requests per hour at one request per second.
[16:45] <JAA> Yeah, obviously.
[16:46] <Ceryn> And apropos, is there a general crawling rate you prefer to avoid getting rate-limited on sites? I know Reddit only allows a request every 2 seconds.
[16:46] <Ceryn> :)
[16:46] <JAA> We've been hammering them with five connections and a very low delay for weeks now.
[16:46] <JAA> They do?
[16:46] <JAA> Isn't that for the API?
[16:46] <Ceryn> Oh, yes.
[16:47] <Ceryn> Huh. Maybe they don't do that for web scraping. But I'm pretty sure they don't want you to query more often.
[16:48] <JAA> They didn't seem to care when I grabbed a number of subreddits through ArchiveBot a while ago (after Charlottesville).
[16:48] <Ceryn> Cool. How much did you grab? The entire thing? Did it work out well?
[16:49] <dd0a13f37> Large sites probably don't care, IMO it's better to start extremely high (e.g. max out your bandwidth) and see if you get blocked
[16:50] <Ceryn> Hm. Hopefully the block would be very temporary, then.
[16:51] <dd0a13f37> Just switch IPs
[16:51] <JAA> No, not the entire thing, just select subreddits, in particular far-right ones.
[16:51] <JAA> Stuff like /r/EuropeanNationalism etc.
[16:51] <JAA> Some of them got banned recently.
[16:51] <Ceryn> Yeah, I meant the entire subreddits. Cool.
[16:51] <Ceryn> Oh? Did you expect that to happen?
[16:51] <JAA> Well yeah, as far as it let me.
[16:52] <JAA> I think you can only get the last 1000 posts for a particular subreddit the normal way.
[16:52] <JAA> For anything older, you have to use the search with a special syntax.
[16:52] <JAA> Yeah, I wasn't at all surprised that they finally closed those shitholes.
[16:52] <JAA> They've been giving them bad press.
[16:53] <JAA> That seems like the only thing they care about.
[16:54] <Ceryn> Oh okay. Didn't know that.
[16:54] <dd0a13f37> What about archiving voat?
[16:54] <JAA> By the way, there's a full Reddit archive available somewhere also.
[16:55] <Ceryn> Do you have any idea how many of the posts you managed to get, then? 90+%? (And how many is that?)
[16:55] <JAA> They're continuously grabbing all comments etc.
[16:55] <JAA> No clue.
[16:55] <Ceryn> Oh sweet!
[16:55] <JAA> Note that these grabs can't get all comments in large threads.
[16:55] <JAA> (That full archive I just mentioned should contain those.)
[16:56] <dd0a13f37> If it's already fully archived, why bother?
[16:56] <Ceryn> Why not? You don't follow links to further discussion?
[16:56] <JAA> dd0a13f37: Yeah, I've been thinking about grabbing Voat. I don't have time to set something proper (i.e. using the API) up currently though.
[16:57] <dd0a13f37> Ceryn: What do you mean?
[16:57] <JAA> And the reason is that that full archive is not easily accessible. I haven't looked at it in detail, but I think it's a database.
[16:57] <JAA> You can't browse it in the Wayback Machine, for example.
[16:57] <dd0a13f37> Couldn't you generate html pages from the database?
[16:57] <JAA> Sure
[16:57] <JAA> The data is all there.
[16:57] <JAA> But the average user or journalist isn't going to do that.
[16:57] <Ceryn> dd0a13f37: Asking why full comment trees aren't available in his grab of subreddits.
[16:58] <dd0a13f37> That seems like most of the projects anyway. What's the point of archiving something, only for it to get darked and public 70+ years later?
[16:58] <Ceryn> I'm very interested in having a look at their Reddit database. Maybe it'll be good enough so I won't have to archive what I'm interested in.
[16:59] <dd0a13f37> Or even better, darking historically important content for political reasons
[16:59] <Ceryn> Personally my archiving interest is just archiving for my own sake. I want to have it. And if I have it, I don't mind sharing it.
[17:00] <Ceryn> Generally, I think having the content available 70+ years later is part of the idea.
[17:01] <Ceryn> Obviously no one here wants data darkened.
[17:01] <JAA> Ceryn: For one, those grabs ignored the per-comment links. But even if you grab those, you still don't handle the "load more comments" stuff. So yeah, it's not easily possible to archive an entire thread (unless you use the API to generate links to each comment in the thread or something like that).
[17:02] <Ceryn> Okay. Thanks for the clarification.
[17:02] <dd0a13f37> Sure, you want it available in 70 years, but if it's not available the 69 years before that, what's the point? To be able to pride yourself in that it's "theoretically" archived, even though you can't do anything useful with it?
[17:03] *** Pixi has joined #archiveteam-bs
[17:03] <JAA> I believe you might be able to get access to it in certain circumstances. Also, laws can change, and if copyright finally gets the reform it so desperately needs, it might be possible for IA to undark it.
[17:03] <Ceryn> dd0a13f37: So, if a data collection is darkened because somebody says you mustn't have it, should you delete it?
[17:03] <Ceryn> dd0a13f37: Or should you preemptively decide not to store anything because it might get darkened?
[17:04] <dd0a13f37> JAA: And it can't go the other way around? They have to delete something, and whoops, it's gone since nobody could mirror it
[17:04] <Ceryn> Darkening sucks. But I like that the data is there. If someone really needs it, I expect it is possible to get it anyway.
[17:04] <dd0a13f37> Ceryn: If IA is the only one who has it, that's what happens, in practice.
[17:04] <JAA> dd0a13f37: I don't think anyone can force them to actually delete it.
[17:05] <dd0a13f37> Now, no. What about in 30 years?
[17:05] <JAA> Hence Internet Archive Canada and the mirror in Alexandria.
[17:05] <Ceryn> How often is something darkened? Is it really that much of it?
[17:05] <dd0a13f37> Ceryn: All the IS content, for example
[17:05] <dd0a13f37> That probably isn't mirrored in many places, since it's so sensitive
[17:05] <Ceryn> And if the data is widely desirable, then peer to peer sharing will help keep it alive and distributed too.
[17:06] <dd0a13f37> Most countries have laws against even touching it
[17:06] <Ceryn> Which IS content?
[17:06] <dd0a13f37> Their videos
[17:06] <Ceryn> Oh. Okay.
[17:06] <JAA> Fine to possess in many jurisdictions, as far as I know.
[17:07] <JAA> Distributing is a different thing, obviously.
[17:07] <dd0a13f37> And since they're too incompetent to use P2P, that won't save it either.
[17:08] <Ceryn> dd0a13f37: For me, in 30 years or whatever, I want to be able to peruse all the things I found interesting or nostalgic or worth saving at any point.
[17:08] <Ceryn> dd0a13f37: So, for me, even data I cannot share is worth storing. Assuming I want it.
[17:08] <dd0a13f37> But for the 30 years leading up to that, it basically doesn't exist.
[17:09] <Ceryn> Well, whenever I want to see it I can. At any point.
[17:09] <JAA> You two are talking about different things. Ceryn means that he can keep his own copy. dd0a13f37 is talking about IA having and distributing it.
[17:09] <dd0a13f37> Sure, if you store it, that is. But archiving something at IA only for it to sit in a data center for 70 years is utterly pointless
[17:09] <Ceryn> If I am aware that others want it, I can share it most of the time. Sometimes laws don't agree with sharing it. But usually it's doable anyway.
[17:09] <Ceryn> JAA's right.
[17:10] <omglolbah> Have to admit, I was somewhat concerned having my pipeline scrape the far-right stuff.... assume I'm in a registry now :p
[17:10] <Ceryn> dd0a13f37: I think it loses much of its value if it's inaccessible to all but IA for 70 years.
[17:10] <dd0a13f37> Sure, doable, but if IA is the only place that has it, it can get traced back to them if it "leaks"
[17:10] <Ceryn> dd0a13f37: BUT. I think it's very valuable to have it after 70 years and to the end of time.
[17:11] <dd0a13f37> In a sense, yes, but it's still quite pointless. A darknet archive, boy would that be something
[17:11] <omglolbah> not sure why it would be pointless to have copies for future study?
[17:11] <JAA> IPFS?
[17:11] <dd0a13f37> IPFS is neither darknet nor production ready
[17:12] <JAA> *shrug*
[17:12] <dd0a13f37> bittorrent over i2p is better, just needs a nice frontend
[17:12] <dd0a13f37> omglolbah: Not entirely pointless, but it's one hell of a delayed gratification
[17:13] * omglolbah peers over at the national archives of Norway where shit in runes sits in storage for study
[17:13] <omglolbah> all about time-scales <.<
[17:17] <dd0a13f37> https://ia801504.us.archive.org/6/items/asaad2/asaad2.mp4 here is an example of content that will probably not be recovered
[17:17] <dd0a13f37> only reencodes available in public
[17:18] <dd0a13f37> Not illegal in the US, just IA randomly deciding to censor it
[17:18] <dd0a13f37> http://jihadology.net/2017/11/03/new-video-message-from-the-islamic-state-lions-of-the-battle-2-wilayat-%e1%b9%a3ala%e1%b8%a5-al-din/
[17:22] <JAA> Are you sure about that? I found an *almost* identical file (just about 60 bytes bigger) within minutes...
[17:24] <JAA> Found another one which is 12 bytes bigger.
[17:25] <Ceryn> dd0a13f37: So, to be clear, the entire argument is "what is the point of IA continuing to store things that have been darkened", right?
[17:25] <Ceryn> Because in all other cases the data is just accessible.
[17:25] <Ceryn> Or not stored and lost.
[17:28] <JAA> Hmm, found a very ... interesting site while searching for that video.
[17:29] *** odemg has joined #archiveteam-bs
[17:30] <JAA> Pretty sure this is run by ISIS.
[17:35] <JAA> dd0a13f37: The only differences between those video files I found are appended NULs, by the way. Probably fools many simple filters.
[17:38] <joepie91> dd0a13f37: the primary purpose of IA is preservation; access is just a means to that end
[17:38] <joepie91> dd0a13f37: from that perspective, it absolutely makes sense to keep something sitting in a datacenter for 70 years if the alternative is total loss
[17:39] <joepie91> also, periodic reminder that IPFS is neither an archival nor a storage medium, it's a distribution medium
[17:41] <JAA> Sure, but distribution is exactly what was being discussed above.
[17:48] *** tuluu has quit IRC (Read error: Operation timed out)
[17:50] *** tuluu has joined #archiveteam-bs
[17:57] *** pizzaiolo has joined #archiveteam-bs
[17:59] <SketchCow> What's happening here
[18:13] <Ceryn> SketchCow: A philosophical discussion on the merits of hoarding: If the data cannot be seen, how do you know it exists?
[18:24] <godane> so i found a tape of Empire Falls but its on dvd: https://www.amazon.com/Empire-Falls-Various/dp/B0009W5IMO
[18:24] <godane> in less there is reason to digitize the hbo airing its been skipped
[18:32] <godane> i'm digitizing tape 1 of Universal vs eric corley
[18:35] <godane> deposition of robert schumann
[19:16] <dd0a13f37> JAA: They have an official tool to append NULs, upload to different mirror sites, etc. But the activity died down after Raqqa was liberated.
[19:17] <dd0a13f37> Ceryn: No, the point I'm trying to make is "what's the point in archiving something if it only gets immediately darked"
[19:17] <Ceryn> dd0a13f37: You can't know it's going to be darked, can you?
[19:17] <dd0a13f37> Then we could just shut down newsgrabber etc, wait for them to release their archives in 100 years, yet that's no good solution
[19:18] <dd0a13f37> Copyrighted content will, and if it offends their political sensibilities it will
[19:19] <Ceryn> Right. So IA does not solve availability in the forseeable future for darkened things. It does, however, solve long term data preservation in that case.
[19:19] <dd0a13f37> Yeah, and then you might as well just bury hard drives in the ground and wait for archeologists to find them.
[19:20] <dd0a13f37> Libgen is doing a much better job of archiving and distributing knowledge, which seems to be the goal here (a database dump of a site isn't good enough since you need to be able to browse it too)
[19:20] <Ceryn> The issue doesn't have anything to do with IA, really, does it? It's about other parties disallowing distribution of data.
[19:20] <Ceryn> Sure you could do something, but it probably wouldn't be legal.
[19:21] <dd0a13f37> I'm just pointing out that there's a contradiction.
[19:21] <Ceryn> I haven't read the IA manifest (yet). joepie91 states they primarily aim to preserve. In which case it makes sense for them to do what they do.
[19:22] <dd0a13f37> On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips.
[19:22] <joepie91> dd0a13f37: considering that I am completely unable to download anything from libgen due to country blocks, that might be a premature conclusion
[19:22] <joepie91> (libgen doing better at distribution)
[19:22] <dd0a13f37> What's the point of preservation if the data won't be available within a reasonable time span?
[19:22] <joepie91> they take a different approach, more legally shaky, with different tradeoffs
[19:23] <dd0a13f37> joepie91: install torbrowser
[19:23] <joepie91> you are wholly missing the point here
[19:23] <Ceryn> Preservation by its very nature is not about near future needs.
[19:23] <joepie91> you're stuck on a One True Vision of how you believe archival and distribution should work, without understanding the legal, political, technical, social implications of that approach, and without understanding that a *variety of approaches* is the correct solution here
[19:24] <joepie91> which is already what we have
[19:24] <dd0a13f37> No, I'm not. The content is more available if you need to spend 5 minutes downloading Tor once than if you need to wait 70 years.
[19:24] <joepie91> which means that different outlets take different approaches with different tradeoffs
[19:24] <joepie91> dd0a13f37: those two are effectively the same thing for 99.99% of the population
[19:25] <joepie91> seriously, take a step outside of your own perspective sometimes and understand the wider effects of different approaches
[19:25] <joepie91> this is getting really tiring
[19:25] <dd0a13f37> Just curious, can you access http://93.174.95.27/ ?
[19:25] <joepie91> nope, empty response
[19:25] <dd0a13f37> Fair point. But libgen still does make a larger amount of knowledge available to a larger amount of people, and for less resources.
[19:26] <joepie91> "to a larger amount of people" - this is absolutely false
[19:26] <joepie91> "a larger amount of knowledge" - this is also very likely false
[19:26] <joepie91> having to deal with legal complications limits the scalability of an archive
[19:26] <joepie91> it's no different from how it's more difficult to move to a new house if you have an attic full of stuff you want to keep
[19:27] <joepie91> the more stuff you need to keep around, the more difficult it is to move and respond to new situations
[19:27] <dd0a13f37> If we count in sci-hub, I'm not so sure. It does have a lot of users in academia.
[19:27] <joepie91> for a reliable, long-term archive - ie. not something that is existing by virtue of currently not being regulated out of existence like libgen - you do not want to create legal problems where there are none
[19:27] <joepie91> the only reason libgen is still around is because legislation and enforcement haven't been standardized between countries
[19:27] <dd0a13f37> archive.org has a larger amount of knowledge, but libgen probably disseminates a larger amount of knowledge/hour
[19:28] <joepie91> this gap is closing increasingly more
[19:28] <joepie91> dd0a13f37: what metrics are you basing that on?
[19:28] <dd0a13f37> Libgen's issues are only technological. They could easily switch over to the darknet.
[19:28] <joepie91> no, they couldn't.
[19:28] <joepie91> let me guess, you're an I2P user?
[19:28] <dd0a13f37> Why not? They already have an I2P site. An onion would be trivial
[19:28] <dd0a13f37> Nope, only Tor.
[19:29] <joepie91> so here's the thing, I've had I2P proponents try to argue this with me for literally 7-8 years now
[19:29] <joepie91> "they could just move to I2P"
[19:29] <dd0a13f37> What's wrong with I2P?
[19:29] <joepie91> the reality is that the barrier to install the necessary software is far too high for the average user, and that moving to a non-clearnet site means you lose 99% of your readership
[19:29] <joepie91> believing that you can move to a darknet site without being poorly accessible is delusional
[19:29] <Frogging> 1% availability is still greater than 0%
[19:30] <joepie91> darknet and clearnet sites absolutely ARE NOT equivalent from an accessibility perspective
[19:30] <Ceryn> You're solving different needs.
[19:30] <dd0a13f37> Having the main servers behind Tor/similar is not the same thing as only being accessible from Tor/similar.
[19:30] <joepie91> Frogging: sure, so set up an alternative archive with sketchier data on a darknet site
[19:30] <joepie91> problem solved
[19:30] <joepie91> this is what I said about variety of tactics
[19:30] <dd0a13f37> pinkapp.io, for example. Or like gettor does.
[19:30] <Frogging> makes sense
[19:30] <joepie91> my problem is with people trying to argue that EVERYTHING needs to make a certain set of tradeoffs, the same one
[19:30] <joepie91> this includes "the IA shouldn't dark X"
[19:30] <joepie91> "the IA should make an I2P site"
[19:30] <joepie91> etc.
[19:30] <Frogging> so I don't know what the argument is about if we all agree that multiple methods are viable and each has pitfalls
[19:30] <joepie91> (yes, I know it's called an eepsite, but not everybody here will)
[19:31] <joepie91> Frogging: the discussion here started because dd0a13f37 is of the opinion that IA is unnecessarily darking things, it seems
[19:31] <dd0a13f37> No, not really. Although IS darking is completely unnecessary, since it doesn't violate US law, that's beside the point.
[19:31] <joepie91> which seems to be a recurring discussion that never goes anywhere and just produces a lot of noise
[19:32] <dd0a13f37> The point is that there's a contradiction. On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips.
[19:32] <joepie91> dd0a13f37: I don't see a contradiction there.
[19:32] <Frogging> IA will accept either, but database dumps are incompatible with wayback
[19:34] <dd0a13f37> You're okay with something being inaccessible except for some arcane procedure involving possessing research credentials, sending an e-mail to IA, and having luck. On the other hand, you're not okay with something being inaccessible except for browsing a database dump, despite that "The data is all there", since "the average user or journalist isn't going to do that"
[19:34] <dd0a13f37> You don't see a minor contradiction there?
[19:34] <Frogging> the IA also benefits greatly from being a recognized legitimate entity
[19:34] <dd0a13f37> That's true. LG is limited scalability-wise from resources.
[19:34] <joepie91> dd0a13f37: you're conflating 'temporarily inaccessible' with 'permanently inaccessible'
[19:35] <joepie91> dd0a13f37: something being darked does not mean it will not ever be public again
[19:35] <joepie91> it is a temporary measure
[19:35] <dd0a13f37> Within my lifetime, yes.
[19:35] <joepie91> (that is why it's not deleted)
[19:35] <joepie91> it is temporary nevertheless
[19:35] <dd0a13f37> Something in a database dump isn't permanently inaccessible either.
[19:35] <joepie91> dd0a13f37: if you can actually hear me out for a second
[19:36] <dd0a13f37> I can browse it just fine, could run a local instance of reddit and make a siterip.
[19:36] * joepie91 sighs
[19:36] <dd0a13f37> Alright.
[19:36] <joepie91> the reason siterips are preferred over databases is because that concerns *permanent accessibility* -- you cannot reliably reproduce a website's operation from just a DB dump unless you have literally all of the components and infrastructure involved
[19:36] <joepie91> nor is there a generic way to, given a database and site source code, make it accessible
[19:36] <joepie91> this means that there is a cost to accessibility of raw data that many people will not pay
[19:36] <Frogging> exact reproducability
[19:36] <joepie91> this will be perpetually true
[19:36] <joepie91> not temporarily
[19:37] <joepie91> this doesn't mean that the raw data *shouldn't* be archived, just that there should be a more accessible option
[19:37] <joepie91> ie. a siterip
[19:37] <dd0a13f37> That is true. However, in the case of reddit, the source code is public. And you could make a siterip from a dump if you're sure that the templates are correct.
[19:37] <joepie91> dd0a13f37: and that is an immense cost you've described there that many people are literally incapable of paying.
[19:37] <Frogging> still not sure what you're arguing because nobody said db dumps weren't allowed
[19:38] <dd0a13f37> If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
[19:38] <dd0a13f37> Frogging: sure, but a goal is apparently to archive the pages
[19:38] <dd0a13f37> If something is darked for 1000 years, is it still "temporary"?
[19:39] <Frogging> what does dumps vs siterips have to do with darking?
[19:39] <joepie91> [20:38] <dd0a13f37> If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
[19:39] <joepie91> what?
[19:39] <dd0a13f37> Dumps - inaccessible, siterips - accessible; dumps bad, siterips good; darked - inaccessible, siterips - accessible; both acceptable
[19:40] <Frogging> dumps aren't necessarily inaccessible, and siterips aren't necessarily accessible
[19:40] <dd0a13f37> Presuming they are, then.
[19:40] <Frogging> these are separate issues
[19:40] <Frogging> why would we presume that? it isn't true
[19:40] <dd0a13f37> It's still a contradiction.
[19:40] <dd0a13f37> In general, it is.
[19:40] <joepie91> dd0a13f37: it only looks like a contradiction because you're intentionally ignoring nuance that we've already pointed out
[19:41] <joepie91> and I am getting tired of this discussion repeatedly clogging up the channel, to be frank
[19:41] <joepie91> this is absolutely not in the least constructiove
[19:41] <joepie91> constructive*
[19:41] <dd0a13f37> joepie91: If I would generate a siterip from a local issue of reddit and a dump, and then upload the generated pages to WB, if we disregard the fact that the metadata might be off (e.g. page X wasn't fetched from reddit on date X but rather my local instance of reddit on date X, running the same software and DB), don't we in practice have the same thing as a 'proper' siterip, but without being limited by ratelimits?
[19:41] <Ceryn> You could make a wiki page explaining the what and why and refer to that. :)
[19:42] <Frogging> there's nothing to explain because the whole argument is based on premises that make no sense and aren't true
[19:42] <joepie91> dd0a13f37: but we don't "disregard" that, and that still requires work, and I've told you all these things before and you need to stop twisting arguments to make them sound like a contradiction
[19:42] <joepie91> right, precisely what Frogging said
[19:42] <joepie91> this is a non-discussion
[19:42] <Ceryn> Apparently there's a root of confusion somewhere, if people keep instigating similar discussions.
[19:42] <dd0a13f37> Well, they're not the exact same thing, and I'm not claiming that either. But they are similar. A siterip (uploaded to WB) is more accessible than a dump though, that's just the way it is.
[19:43] <joepie91> Ceryn: most people don't...
[19:43] <Ceryn> Okay.
[19:43] <joepie91> dd0a13f37: what are you actually trying to accomplish with this discussion?
[19:43] <dd0a13f37> I've only 'instigated' this discussion once, please don't conflate me with other people who aren't me.
[19:44] <dd0a13f37> joepie91: I'm just trying to figure out what's the deal with the contradiction, how accessibility is super important in some cases and utterly unimportant in others, as long as it's "temporary".
[19:44] <joepie91> dd0a13f37: except all of these premises are wrong and there is no contradiction, as we have repeatedly told you
[19:44] <joepie91> so why are you continuing the discussion along the same lines that we've already pointed out are false?
[19:45] <Frogging> dd0a13f37: I think IA would like to have everything accessible all the time, but it's not always legally possible. they don't dark things just because they feel like it
[19:46] <dd0a13f37> I'm not discussing the IA's actions here, I'm discussing what's the point in uploading to IA and not somewhere else if we know IA'll just dark it. However, they do dark things just because they feel like it, the IS videos are a prime example of this. They're constitutionally protected speech.
[19:46] <dd0a13f37> joepie91: How are the premises wrong? You've said it yourself that browsing a database dump is inconvenient and wayback is convenient.
[19:46] <joepie91> oh for fucks sake
[19:47] <Frogging> yeah, that's already been answered. people can upload to more than one place. there's no policy that says otherwise
[19:47] <joepie91> dd0a13f37: the premises are wrong because you keep misrepresenting the points being made and/or the reality of things to the point of inaccuracy, and it is utterly pointless to continue discussing any of this with you because you just keep piling on more presumptions and misrepresentatioins and whatnot
[19:47] <joepie91> if you don't trust IA to keep something available, upload a copy elsewhere
[19:47] <joepie91> and with that I hope we can conclude the discussion
[19:48] <dd0a13f37> Well okay, that's a fair point.
[19:52] <dd0a13f37> There was a discussion about google's rate limits earlier. Would it be possible to use startpage or some similar proxy to bypass this? They're not nearly as strict.
[19:53] <Frogging> I'm not sure how startpage interfaces with google, but anecdotally I've noticed that the results aren't always the same or as numerous on Startpage
[19:53] <Frogging> so if that matters, use caution when assuming it's a transparent proxy to google
[19:53] <JAA> Let me just point you to Searx.
[19:54] <dd0a13f37> That could be true. Still better than nothing though.
[19:54] <JAA> And also YaCy.
[19:54] <JAA> Not sure about the quality of search results on the latter.
[19:54] <dd0a13f37> How come searx doesn't get rate limited?
[19:54] <Frogging> google's ratelimiting is really touchy. I only share an IP with 3 people and I get the captcha at least once a day
[19:55] <dd0a13f37> Well, then I could just use bing (or friends, yahoo/ddg)
[19:56] <JAA> Not sure about the rate limits on Searx, to be honest.
[19:56] <JAA> But it does aggregate results from various engines (including Google, Bing, Yahoo), so that probably helps.
[19:57] <dd0a13f37> I'm just wondering. How does it avoid getting rate limited by the backend search engines?
[20:03] *** arkhive has joined #archiveteam-bs
[20:04] <arkhive> I'm trying to save go90 Original videos. I'm having trouble figuring out how to save web videos. HTML5 or Flash. I'm a bit of a noob. Can someone help me?
[20:05] <arkhive> go90 is a streaming service(free) from Verizon. they have dumped millions into it and are losing tons of money. laying off employees. So i think it's time to Ctrl-C Ctrl-V
[20:08] <dd0a13f37> try youtube-dl
[20:08] <dd0a13f37> if that fails, open inspect element, network tab, play a video, and see if there's a pattern in the requests it make and if you can figure them out from the url
[20:11] <dd0a13f37> >Sorry!
[20:11] <dd0a13f37> >go90™ Mobile TV Network is only available in the US right now.
[20:23] <godane> youtube-dl will not work with go90
[20:36] *** TheLovina has quit IRC (Read error: Operation timed out)
[20:37] *** TheLovina has joined #archiveteam-bs
[20:40] *** TheLovina has quit IRC (Read error: Connection reset by peer)
[21:42] *** atrocity has quit IRC (Ping timeout: 246 seconds)
[21:44] <ranma> arkhive: get a jetpack and sub to VZW for a month?
[21:44] <ranma> is there anything of value even on go90?
[21:45] <ranma> speaking as a recent employee of VZW
[21:47] <ranma> arkhive: maybe download NOX or Bluestacks (or virtualize Android if you can), install the app, sniff the traffic to see if they're calling https or http gets?
[21:48] <ranma> oh yeah, go90 isn't bandwidth-free on prepaid
[22:32] *** dd0a13f37 has quit IRC (Quit: Connection closed for inactivity)
[22:51] *** drumstick has joined #archiveteam-bs
[22:54] *** Asparagir has joined #archiveteam-bs
[23:00] *** BlueMaxim has joined #archiveteam-bs
[23:11] *** atrocity has joined #archiveteam-bs
[23:35] *** jschwart has quit IRC (Quit: Konversation terminated!)