#archiveteam-bs 2017-11-04,Sat

↑back Search

Time Nickname Message
00:05 🔗 schbirid has quit IRC (Ping timeout: 256 seconds)
00:16 🔗 schbirid has joined #archiveteam-bs
00:35 🔗 SketchCow godane: looking good (the tapes)
00:53 🔗 godane i still have a ton to upload
01:01 🔗 godane i uploaded 603 items so far this month: https://archive.org/details/@chris85?and[]=addeddate%3A2017-11&sort=-publicdate
01:03 🔗 godane that collection is like slow sink drip for government docs
01:04 🔗 godane i know for a fact there is over 500000 ids used with dtic
01:58 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
02:15 🔗 SketchCow has quit IRC (Read error: Operation timed out)
02:19 🔗 SketchCow has joined #archiveteam-bs
02:19 🔗 swebb sets mode: +o SketchCow
02:42 🔗 schbirid has quit IRC (Ping timeout: 255 seconds)
02:51 🔗 drumstick has quit IRC (Read error: Operation timed out)
02:52 🔗 drumstick has joined #archiveteam-bs
02:54 🔗 schbirid has joined #archiveteam-bs
03:40 🔗 Pixi has joined #archiveteam-bs
03:59 🔗 MadArchiv has joined #archiveteam-bs
04:01 🔗 MadArchiv I finally managed to get site-grab to work on my computer (somehow), I'll take some much-needed sleep for now and
04:05 🔗 Ceryn Damn. He must have needed it. :)
04:05 🔗 MadArchiv (Sigh, I *really* need to stop pressing the Enter key by accident) ...and then I'll archiving the completed webcomics from Hiveworks tomorrow
04:06 🔗 Pixi has quit IRC (Quit: Pixi)
04:07 🔗 MadArchiv Ceryn: You bet it, pal
04:07 🔗 Ceryn Are you uploading to Archive.org?
04:07 🔗 MadArchiv Of course.
04:08 🔗 Ceryn Are dumps of a site generally just uploaded occasionally, no deduplication or continuation from the last batch or anything?
04:09 🔗 MadArchiv These are webcomic sites, so I suppose there are no dumps at all. I could be wrong though.
04:09 🔗 Ceryn I mean dumps in the general sense, data dumps, whatever format the data has.
04:11 🔗 MadArchiv You mean the ones I would get after I crawl the site? I'm new to this, and I'm also pretty computer-illiterate, so I'm learning this stuff as I go along.
04:12 🔗 MadArchiv Wait, are you still talking about webcomic sites?
04:13 🔗 Ceryn Heh okay. I'm new to archiving but I'm rather computer-literate.
04:14 🔗 Ceryn My question was on how Archive.org handles when a site or something is uploaded, then uploaded again later and so on, in several copies.
04:14 🔗 qw3rty4 has joined #archiveteam-bs
04:16 🔗 Ceryn Is each copy just stored separately? Or is the data deduplicated (i.e. identical parts of the data will only be stored once, then the other copies will refer to the master copy of those parts)? Or do future uploads only add what is new compared to the last upload?
04:16 🔗 Ceryn I guess these questions take some knowledge of Archive.org's workings to answer.
04:17 🔗 MadArchiv From what I've seen, they just take in different copies of the same thing as if they were different items.
04:17 🔗 Ceryn Yeah. It's by far the easiest to handle, but it's also the least efficient.
04:17 🔗 MadArchiv Yup
04:17 🔗 Ceryn Though of course they could be doing things under the hood.
04:18 🔗 MadArchiv Like they do with excluded sites.
04:19 🔗 qw3rty3 has quit IRC (Read error: Operation timed out)
04:29 🔗 BlueMaxim has quit IRC (Quit: Leaving)
04:33 🔗 MadArchiv By the way, I'm dowloading
04:35 🔗 MadArchiv Goddamnit, I did it again, just wait a bit so I can finish writing my post
04:41 🔗 MadArchiv has quit IRC (Read error: Operation timed out)
04:42 🔗 MadArchiv has joined #archiveteam-bs
04:45 🔗 MadArchiv Alright, so, (I kinda forgot what I was going to write in the first place so let's just get to the point) do you know of any *other* webcomics I or, ideally, we should archive? Once I'm done with the Hiveworks ones, I mean.
04:46 🔗 Ceryn I don't. I've never been into comics. I'm guessing others know some, however.
04:49 🔗 MadArchiv Hmmm, alright. Do you think (and this is a legitimate question, by the way) that if we're gonna make a manual list of comics that should be saved then IRC would be the best place for it? I was thinking about putting it on Reddit since it'd more accessible.
04:50 🔗 MadArchiv Grammar correction: add a "by the way" before "do you think"
04:51 🔗 Ceryn IRC definitely isn't the place for such a list. When you want to paste stuff on IRC you usually paste it to a paste bin service (e.g. pastebin.com) and just post the link.
04:52 🔗 Ceryn If you want it to be a joint effort then a reddit thread would probably be good. If the project was rather large in scope it seems maybe you'd organise a group here, make your own IRC channel and start a wiki page or something.
04:54 🔗 phillipsj Ceryn, duplicate uploads can have varying quality as well. In the spring I was considering doing a bunch of YouTube channels in DVD quality: allows broader coverage for the same space. But is kinda pointless if the same channel was uploaded in HD already.
04:55 🔗 Ceryn phillipsj: Right. Except new videos. They'd make sense to upload.
04:55 🔗 phillipsj Youtube seems the thottle you a lot if you try downloading faster than a human can watch though.
04:56 🔗 Ceryn phillipsj: But it is not common practice, then, to extend data you have already uploaded? People don't generally keep up to date mirrors and sync them with the Archive?
04:56 🔗 Ceryn Okay. Hm. I guess you can parallelise Youtube videos though? At least to some extent.
04:56 🔗 phillipsj Ceryn, never bothered to upload anything becuase they were begging for money to store the stuff.
04:57 🔗 Ceryn Okay.
04:57 🔗 Ceryn I mostly plan to archive for my own storage, but it seems I might as well upload a copy to the Archive too.
04:57 🔗 phillipsj Other things came up as well. I put the machine I was suing in storage because Youtube was becomming to distracting for me. (Have IRL stuff to do)
04:58 🔗 phillipsj *using
04:58 🔗 Ceryn Okay.
04:58 🔗 Ceryn Archive must have so much data bloat. Stuff they could optimise away.
04:58 🔗 phillipsj I have a stack of blank DVD+Rs slowly rotting away.
04:59 🔗 Ceryn Heh. Very slowly.
04:59 🔗 phillipsj Was thinking of using them of my local copy.
04:59 🔗 phillipsj *for
05:00 🔗 Ceryn How come you want to store on DVDs as opposed to HDDs?
05:01 🔗 Ceryn For actual DVD player use?
05:01 🔗 phillipsj I expect the DVDs to last longer.
05:01 🔗 Ceryn Yeah. It just seems so cumbersome. To be able to store only 4.7 GB or however much it is.
05:02 🔗 phillipsj They are also slightly cheaper per GB I blieve. (But possibly not worth the inconvenience)
05:02 🔗 BlueMaxim has joined #archiveteam-bs
05:02 🔗 Ceryn If you want to scale then no, definitely not worth it. Because you'd have to check them periodically just to know your data was intact.
05:03 🔗 phillipsj I think th espetch of a plan is to try to use my remaing disks, and if it goes well, maybe buy more.
05:03 🔗 phillipsj *sketch
05:03 🔗 Ceryn Heh. It seems all of our problems can be solved by "buying more disks".
05:04 🔗 Ceryn Works every time.
05:04 🔗 phillipsj I origanally bough them for back-ups, but the back-up verification failed at the restore step.
05:05 🔗 phillipsj New back-up plan is server with doubly-redundant ZFS and ECC RAM.
05:05 🔗 Ceryn :P nice.
05:05 🔗 Ceryn Do you have any stats on how likely normal RAM is to screw you over?
05:06 🔗 phillipsj Bonus points if I encrypt that data in transit and at rest (by turing the server off).
05:07 🔗 phillipsj Not off-hand, but if I am going to the trouble of redundancy in the face of a disk failure, I don't want bad RAM to mess up my data.
05:07 🔗 Ceryn You can lukscrypt each raw drive, open them, and then set up a ZFS pool on the encrypted volumes.
05:07 🔗 Ceryn That's what I've done, at least. Seems to work pretty well.
05:07 🔗 Ceryn (I'm assuming ZFS on Linux.)
05:08 🔗 phillipsj I like FreeBSD, not that I am good at actaully configuring it.
05:08 🔗 Ceryn I suppose. It just seems the entire setup becomes significantly more expensive if you need hardware that supports ECC RAM.
05:09 🔗 Ceryn Okay.
05:10 🔗 phillipsj My "workstation" currently has mirrored, striped ZFs across 4 disks (non-ECC RAM though). Boots with a simulated controller failure (cable unplugged).
05:11 🔗 hook54321 JAA: Are you still grabbing the Catalonia cameras?
05:12 🔗 phillipsj Scared to try a scrub without a proper back-up though.
05:17 🔗 phillipsj Ceryn, I had to roll-back a non-ECC mememory upgrade on one of my machines: dropped a module on the carpet, and it started manifesting problems about a month later. ECC is nice in that it tells you when it has a problem.
05:17 🔗 Ceryn phillipsj: When does it tell you this? During boot?
05:19 🔗 phillipsj When testing my server and forgetting to plug in a CPU fan, I got memory errors/corrections logged to dmesg. Those (fully buffered) modules also log their temperature as well.
05:21 🔗 Ceryn Okay. So it would take some work keeping yourself updated with the status.
05:21 🔗 * phillipsj was planning to install cowling + exhaust fan for better colling.
05:22 🔗 phillipsj Ceryn, an uncorrectable error halts the machine unless you have mirroring enabled.
05:22 🔗 phillipsj s/colling/cooling/
05:22 🔗 Ceryn Okay.
05:23 🔗 phillipsj I cheaped out on the server, so it it taking a lot of my time top make sure it is stable :P
05:24 🔗 Ceryn Haha yeah. That's a huge trade-off.
05:27 🔗 phillipsj Can't beleive I missed the PSU fan grinding before purchase (second hand, obviously). Was able to rpelace it with a slower speed fan of the same dimensions (but server runs close to the de-rated (based on difference in fan power draw) power load).
05:58 🔗 MadArchiv has quit IRC (Remote host closed the connection)
07:04 🔗 drumstick has quit IRC (Ping timeout: 248 seconds)
08:00 🔗 drumstick has joined #archiveteam-bs
09:19 🔗 BlueMaxim has quit IRC (Quit: Leaving)
10:18 🔗 jschwart has quit IRC (Read error: Operation timed out)
10:32 🔗 schbirid has quit IRC (Ping timeout: 255 seconds)
10:37 🔗 schbirid has joined #archiveteam-bs
11:03 🔗 godane SketchCow: i'm breaking up the So Graham Norton tape cause its 2 episode
11:03 🔗 godane *episodes
11:04 🔗 godane also whats funny is episode S01E25 is before S01E18
11:04 🔗 godane and its not So Graham Norton but V Graham Norton
11:13 🔗 godane i'm dong BalanceBall Fitness tape
11:25 🔗 godane so your getting max bitrate with this BalanceBall Fitness tape
11:26 🔗 drumstick has quit IRC (Ping timeout: 248 seconds)
11:38 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
11:52 🔗 godane fun fact: cover of tape say Beginner's Workout but title say Total Body Workout
11:53 🔗 godane if the tape label said Beginner's Workout
11:53 🔗 godane so i'm going for that for label
12:01 🔗 Stilett0 has joined #archiveteam-bs
12:01 🔗 godane i found another tv tape
12:02 🔗 godane i'm going to do it at 6000k instead of 10000k cause its a tv recording
12:03 🔗 Stilett0 is now known as Stiletto
12:06 🔗 godane anyways i made screenshots with 6000k and 10000k
12:06 🔗 godane and it looks the same so i think 6000k is ok
12:08 🔗 godane SketchCow: here are the images: https://imgur.com/a/5QRap
12:08 🔗 godane top one is the 6000k one
12:08 🔗 godane bottom one is 10000k one
13:39 🔗 godane so i found another duplicate tape
13:39 🔗 godane it was down for love promo tape
13:40 🔗 superkuh has joined #archiveteam-bs
13:41 🔗 godane anyways this tape has the last 2 episodes of Felicity for Season 2
14:06 🔗 icedice has joined #archiveteam-bs
14:29 🔗 godane so i may have partial Charm recording on TNT
14:31 🔗 SketchCow godane: I appreciate your best approach, godane
14:37 🔗 godane this tapes is going to have 8 minutes of black in it
14:37 🔗 godane with audio
14:38 🔗 godane cause there is very bad tape between the end of felicity section and Charmed recording
14:40 🔗 godane also this bad sections start before the end of felicity recording
14:41 🔗 godane luckly for use there is bit of over recording
14:41 🔗 godane either way i will break up the felicity and charmed part at 02:08:00 mark
14:48 🔗 Mateon1 has quit IRC (Ping timeout: 255 seconds)
14:49 🔗 Mateon1 has joined #archiveteam-bs
14:58 🔗 odemg has quit IRC (Ping timeout: 248 seconds)
15:03 🔗 jtn2_ has quit IRC (Quit: restarting for irssi security update)
15:05 🔗 jtn2 has joined #archiveteam-bs
15:33 🔗 dd0a13f37 has joined #archiveteam-bs
15:33 🔗 dd0a13f37 Is there anything like Library Genesis but for newspapers?
15:34 🔗 dd0a13f37 They link to magzdb.org, but it's in Russian and seems like it's broken
15:46 🔗 icedice has quit IRC (Quit: Leaving)
15:51 🔗 schbirid Kaz: check out https://pypi.python.org/pypi/fake-useragent anyways ;)
15:59 🔗 Kaz schbirid: will probably drop that in when we get blocked again.. UAs we're using are from as far back as Chrome 40
16:06 🔗 dd0a13f37 Could you please add googlebot user agents?
16:06 🔗 dd0a13f37 A lot of sites with paywalls give unrestricted access to google
16:10 🔗 jschwart has joined #archiveteam-bs
16:14 🔗 dd0a13f37 Did they remove the old addons from AMO yet?
16:14 🔗 dd0a13f37 It looks different
16:22 🔗 JAA hook54321: Yes, those cam grabs are still running.
16:22 🔗 JAA dd0a13f37: They will be removed in June.
16:26 🔗 dd0a13f37 Figured out a way to grab them all
16:28 🔗 JAA We're on it already.
16:28 🔗 dd0a13f37 Wiki page is off though
16:28 🔗 dd0a13f37 >The total number of addons should be approximately 20,000.
16:28 🔗 dd0a13f37 there are 760k .xpi files
16:29 🔗 JAA There's an ArchiveBot which has been running for a while (over 2 months), and I think Somebody2 did something as well.
16:29 🔗 JAA There's a difference between "number of addons" and "number of .xpi files".
16:30 🔗 JAA The latter includes different platforms and previous versions.
16:30 🔗 dd0a13f37 500k addons though
16:31 🔗 dd0a13f37 >459,938 add-ons found
16:31 🔗 dd0a13f37 The job for !a https://addons.mozilla.org/ is a slow approach
16:32 🔗 dd0a13f37 you can just make a list https://addons.mozilla.org/firefox/downloads/file/760000/ for !ao
16:32 🔗 JAA I think that's what Somebody2 did (outside of ArchiveBot).
16:32 🔗 dd0a13f37 ah okay, doesn't say so on the wiki page
16:32 🔗 JAA However, this doesn't grab older versions or different platforms.
16:32 🔗 JAA Yes, the wiki is often not exactly up to date.
16:33 🔗 dd0a13f37 It does.
16:33 🔗 dd0a13f37 look here
16:33 🔗 dd0a13f37 https://addons.mozilla.org/en-US/firefox/addon/weather-extension/versions/
16:33 🔗 dd0a13f37 Hover over "Add to Firefox" links
16:33 🔗 JAA Huh, I see.
16:34 🔗 JAA Do different platforms get individual IDs as well?
16:34 🔗 dd0a13f37 1 id = 1 xpi
16:37 🔗 JAA I see.
16:38 🔗 dd0a13f37 If the file exists, you get 302, if not, 404
16:38 🔗 dd0a13f37 Should I !ao it?
16:38 🔗 JAA You're right that !a AMO is not exactly efficient. However, it does also archive various data around the actual addons: descriptions, screenshots, reviews, collections. Plus it provides a browsable interface.
16:39 🔗 JAA No, let's not.
16:40 🔗 JAA As mentioned, Somebody2 has done something similar already (not sure what he did *exactly*) and the !a AMO job is also pretty far. But we could do that sometime next year, shortly before they purge all legacy addons.
16:40 🔗 JAA 760k URLs should be pretty quick anyway.
16:44 🔗 Ceryn For reference, you get 86400 requests per hour at one request per second.
16:45 🔗 JAA Yeah, obviously.
16:46 🔗 Ceryn And apropos, is there a general crawling rate you prefer to avoid getting rate-limited on sites? I know Reddit only allows a request every 2 seconds.
16:46 🔗 Ceryn :)
16:46 🔗 JAA We've been hammering them with five connections and a very low delay for weeks now.
16:46 🔗 JAA They do?
16:46 🔗 JAA Isn't that for the API?
16:46 🔗 Ceryn Oh, yes.
16:47 🔗 Ceryn Huh. Maybe they don't do that for web scraping. But I'm pretty sure they don't want you to query more often.
16:48 🔗 JAA They didn't seem to care when I grabbed a number of subreddits through ArchiveBot a while ago (after Charlottesville).
16:48 🔗 Ceryn Cool. How much did you grab? The entire thing? Did it work out well?
16:49 🔗 dd0a13f37 Large sites probably don't care, IMO it's better to start extremely high (e.g. max out your bandwidth) and see if you get blocked
16:50 🔗 Ceryn Hm. Hopefully the block would be very temporary, then.
16:51 🔗 dd0a13f37 Just switch IPs
16:51 🔗 JAA No, not the entire thing, just select subreddits, in particular far-right ones.
16:51 🔗 JAA Stuff like /r/EuropeanNationalism etc.
16:51 🔗 JAA Some of them got banned recently.
16:51 🔗 Ceryn Yeah, I meant the entire subreddits. Cool.
16:51 🔗 Ceryn Oh? Did you expect that to happen?
16:51 🔗 JAA Well yeah, as far as it let me.
16:52 🔗 JAA I think you can only get the last 1000 posts for a particular subreddit the normal way.
16:52 🔗 JAA For anything older, you have to use the search with a special syntax.
16:52 🔗 JAA Yeah, I wasn't at all surprised that they finally closed those shitholes.
16:52 🔗 JAA They've been giving them bad press.
16:53 🔗 JAA That seems like the only thing they care about.
16:54 🔗 Ceryn Oh okay. Didn't know that.
16:54 🔗 dd0a13f37 What about archiving voat?
16:54 🔗 JAA By the way, there's a full Reddit archive available somewhere also.
16:55 🔗 Ceryn Do you have any idea how many of the posts you managed to get, then? 90+%? (And how many is that?)
16:55 🔗 JAA They're continuously grabbing all comments etc.
16:55 🔗 JAA No clue.
16:55 🔗 Ceryn Oh sweet!
16:55 🔗 JAA Note that these grabs can't get all comments in large threads.
16:55 🔗 JAA (That full archive I just mentioned should contain those.)
16:56 🔗 dd0a13f37 If it's already fully archived, why bother?
16:56 🔗 Ceryn Why not? You don't follow links to further discussion?
16:56 🔗 JAA dd0a13f37: Yeah, I've been thinking about grabbing Voat. I don't have time to set something proper (i.e. using the API) up currently though.
16:57 🔗 dd0a13f37 Ceryn: What do you mean?
16:57 🔗 JAA And the reason is that that full archive is not easily accessible. I haven't looked at it in detail, but I think it's a database.
16:57 🔗 JAA You can't browse it in the Wayback Machine, for example.
16:57 🔗 dd0a13f37 Couldn't you generate html pages from the database?
16:57 🔗 JAA Sure
16:57 🔗 JAA The data is all there.
16:57 🔗 JAA But the average user or journalist isn't going to do that.
16:57 🔗 Ceryn dd0a13f37: Asking why full comment trees aren't available in his grab of subreddits.
16:58 🔗 dd0a13f37 That seems like most of the projects anyway. What's the point of archiving something, only for it to get darked and public 70+ years later?
16:58 🔗 Ceryn I'm very interested in having a look at their Reddit database. Maybe it'll be good enough so I won't have to archive what I'm interested in.
16:59 🔗 dd0a13f37 Or even better, darking historically important content for political reasons
16:59 🔗 Ceryn Personally my archiving interest is just archiving for my own sake. I want to have it. And if I have it, I don't mind sharing it.
17:00 🔗 Ceryn Generally, I think having the content available 70+ years later is part of the idea.
17:01 🔗 Ceryn Obviously no one here wants data darkened.
17:01 🔗 JAA Ceryn: For one, those grabs ignored the per-comment links. But even if you grab those, you still don't handle the "load more comments" stuff. So yeah, it's not easily possible to archive an entire thread (unless you use the API to generate links to each comment in the thread or something like that).
17:02 🔗 Ceryn Okay. Thanks for the clarification.
17:02 🔗 dd0a13f37 Sure, you want it available in 70 years, but if it's not available the 69 years before that, what's the point? To be able to pride yourself in that it's "theoretically" archived, even though you can't do anything useful with it?
17:03 🔗 Pixi has joined #archiveteam-bs
17:03 🔗 JAA I believe you might be able to get access to it in certain circumstances. Also, laws can change, and if copyright finally gets the reform it so desperately needs, it might be possible for IA to undark it.
17:03 🔗 Ceryn dd0a13f37: So, if a data collection is darkened because somebody says you mustn't have it, should you delete it?
17:03 🔗 Ceryn dd0a13f37: Or should you preemptively decide not to store anything because it might get darkened?
17:04 🔗 dd0a13f37 JAA: And it can't go the other way around? They have to delete something, and whoops, it's gone since nobody could mirror it
17:04 🔗 Ceryn Darkening sucks. But I like that the data is there. If someone really needs it, I expect it is possible to get it anyway.
17:04 🔗 dd0a13f37 Ceryn: If IA is the only one who has it, that's what happens, in practice.
17:04 🔗 JAA dd0a13f37: I don't think anyone can force them to actually delete it.
17:05 🔗 dd0a13f37 Now, no. What about in 30 years?
17:05 🔗 JAA Hence Internet Archive Canada and the mirror in Alexandria.
17:05 🔗 Ceryn How often is something darkened? Is it really that much of it?
17:05 🔗 dd0a13f37 Ceryn: All the IS content, for example
17:05 🔗 dd0a13f37 That probably isn't mirrored in many places, since it's so sensitive
17:05 🔗 Ceryn And if the data is widely desirable, then peer to peer sharing will help keep it alive and distributed too.
17:06 🔗 dd0a13f37 Most countries have laws against even touching it
17:06 🔗 Ceryn Which IS content?
17:06 🔗 dd0a13f37 Their videos
17:06 🔗 Ceryn Oh. Okay.
17:06 🔗 JAA Fine to possess in many jurisdictions, as far as I know.
17:07 🔗 JAA Distributing is a different thing, obviously.
17:07 🔗 dd0a13f37 And since they're too incompetent to use P2P, that won't save it either.
17:08 🔗 Ceryn dd0a13f37: For me, in 30 years or whatever, I want to be able to peruse all the things I found interesting or nostalgic or worth saving at any point.
17:08 🔗 Ceryn dd0a13f37: So, for me, even data I cannot share is worth storing. Assuming I want it.
17:08 🔗 dd0a13f37 But for the 30 years leading up to that, it basically doesn't exist.
17:09 🔗 Ceryn Well, whenever I want to see it I can. At any point.
17:09 🔗 JAA You two are talking about different things. Ceryn means that he can keep his own copy. dd0a13f37 is talking about IA having and distributing it.
17:09 🔗 dd0a13f37 Sure, if you store it, that is. But archiving something at IA only for it to sit in a data center for 70 years is utterly pointless
17:09 🔗 Ceryn If I am aware that others want it, I can share it most of the time. Sometimes laws don't agree with sharing it. But usually it's doable anyway.
17:09 🔗 Ceryn JAA's right.
17:10 🔗 omglolbah Have to admit, I was somewhat concerned having my pipeline scrape the far-right stuff.... assume I'm in a registry now :p
17:10 🔗 Ceryn dd0a13f37: I think it loses much of its value if it's inaccessible to all but IA for 70 years.
17:10 🔗 dd0a13f37 Sure, doable, but if IA is the only place that has it, it can get traced back to them if it "leaks"
17:10 🔗 Ceryn dd0a13f37: BUT. I think it's very valuable to have it after 70 years and to the end of time.
17:11 🔗 dd0a13f37 In a sense, yes, but it's still quite pointless. A darknet archive, boy would that be something
17:11 🔗 omglolbah not sure why it would be pointless to have copies for future study?
17:11 🔗 JAA IPFS?
17:11 🔗 dd0a13f37 IPFS is neither darknet nor production ready
17:12 🔗 JAA *shrug*
17:12 🔗 dd0a13f37 bittorrent over i2p is better, just needs a nice frontend
17:12 🔗 dd0a13f37 omglolbah: Not entirely pointless, but it's one hell of a delayed gratification
17:13 🔗 * omglolbah peers over at the national archives of Norway where shit in runes sits in storage for study
17:13 🔗 omglolbah all about time-scales <.<
17:17 🔗 dd0a13f37 https://ia801504.us.archive.org/6/items/asaad2/asaad2.mp4 here is an example of content that will probably not be recovered
17:17 🔗 dd0a13f37 only reencodes available in public
17:18 🔗 dd0a13f37 Not illegal in the US, just IA randomly deciding to censor it
17:18 🔗 dd0a13f37 http://jihadology.net/2017/11/03/new-video-message-from-the-islamic-state-lions-of-the-battle-2-wilayat-%e1%b9%a3ala%e1%b8%a5-al-din/
17:22 🔗 JAA Are you sure about that? I found an *almost* identical file (just about 60 bytes bigger) within minutes...
17:24 🔗 JAA Found another one which is 12 bytes bigger.
17:25 🔗 Ceryn dd0a13f37: So, to be clear, the entire argument is "what is the point of IA continuing to store things that have been darkened", right?
17:25 🔗 Ceryn Because in all other cases the data is just accessible.
17:25 🔗 Ceryn Or not stored and lost.
17:28 🔗 JAA Hmm, found a very ... interesting site while searching for that video.
17:29 🔗 odemg has joined #archiveteam-bs
17:30 🔗 JAA Pretty sure this is run by ISIS.
17:35 🔗 JAA dd0a13f37: The only differences between those video files I found are appended NULs, by the way. Probably fools many simple filters.
17:38 🔗 joepie91 dd0a13f37: the primary purpose of IA is preservation; access is just a means to that end
17:38 🔗 joepie91 dd0a13f37: from that perspective, it absolutely makes sense to keep something sitting in a datacenter for 70 years if the alternative is total loss
17:39 🔗 joepie91 also, periodic reminder that IPFS is neither an archival nor a storage medium, it's a distribution medium
17:41 🔗 JAA Sure, but distribution is exactly what was being discussed above.
17:48 🔗 tuluu has quit IRC (Read error: Operation timed out)
17:50 🔗 tuluu has joined #archiveteam-bs
17:57 🔗 pizzaiolo has joined #archiveteam-bs
17:59 🔗 SketchCow What's happening here
18:13 🔗 Ceryn SketchCow: A philosophical discussion on the merits of hoarding: If the data cannot be seen, how do you know it exists?
18:24 🔗 godane so i found a tape of Empire Falls but its on dvd: https://www.amazon.com/Empire-Falls-Various/dp/B0009W5IMO
18:24 🔗 godane in less there is reason to digitize the hbo airing its been skipped
18:32 🔗 godane i'm digitizing tape 1 of Universal vs eric corley
18:35 🔗 godane deposition of robert schumann
19:16 🔗 dd0a13f37 JAA: They have an official tool to append NULs, upload to different mirror sites, etc. But the activity died down after Raqqa was liberated.
19:17 🔗 dd0a13f37 Ceryn: No, the point I'm trying to make is "what's the point in archiving something if it only gets immediately darked"
19:17 🔗 Ceryn dd0a13f37: You can't know it's going to be darked, can you?
19:17 🔗 dd0a13f37 Then we could just shut down newsgrabber etc, wait for them to release their archives in 100 years, yet that's no good solution
19:18 🔗 dd0a13f37 Copyrighted content will, and if it offends their political sensibilities it will
19:19 🔗 Ceryn Right. So IA does not solve availability in the forseeable future for darkened things. It does, however, solve long term data preservation in that case.
19:19 🔗 dd0a13f37 Yeah, and then you might as well just bury hard drives in the ground and wait for archeologists to find them.
19:20 🔗 dd0a13f37 Libgen is doing a much better job of archiving and distributing knowledge, which seems to be the goal here (a database dump of a site isn't good enough since you need to be able to browse it too)
19:20 🔗 Ceryn The issue doesn't have anything to do with IA, really, does it? It's about other parties disallowing distribution of data.
19:20 🔗 Ceryn Sure you could do something, but it probably wouldn't be legal.
19:21 🔗 dd0a13f37 I'm just pointing out that there's a contradiction.
19:21 🔗 Ceryn I haven't read the IA manifest (yet). joepie91 states they primarily aim to preserve. In which case it makes sense for them to do what they do.
19:22 🔗 dd0a13f37 On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips.
19:22 🔗 joepie91 dd0a13f37: considering that I am completely unable to download anything from libgen due to country blocks, that might be a premature conclusion
19:22 🔗 joepie91 (libgen doing better at distribution)
19:22 🔗 dd0a13f37 What's the point of preservation if the data won't be available within a reasonable time span?
19:22 🔗 joepie91 they take a different approach, more legally shaky, with different tradeoffs
19:23 🔗 dd0a13f37 joepie91: install torbrowser
19:23 🔗 joepie91 you are wholly missing the point here
19:23 🔗 Ceryn Preservation by its very nature is not about near future needs.
19:23 🔗 joepie91 you're stuck on a One True Vision of how you believe archival and distribution should work, without understanding the legal, political, technical, social implications of that approach, and without understanding that a *variety of approaches* is the correct solution here
19:24 🔗 joepie91 which is already what we have
19:24 🔗 dd0a13f37 No, I'm not. The content is more available if you need to spend 5 minutes downloading Tor once than if you need to wait 70 years.
19:24 🔗 joepie91 which means that different outlets take different approaches with different tradeoffs
19:24 🔗 joepie91 dd0a13f37: those two are effectively the same thing for 99.99% of the population
19:25 🔗 joepie91 seriously, take a step outside of your own perspective sometimes and understand the wider effects of different approaches
19:25 🔗 joepie91 this is getting really tiring
19:25 🔗 dd0a13f37 Just curious, can you access http://93.174.95.27/ ?
19:25 🔗 joepie91 nope, empty response
19:25 🔗 dd0a13f37 Fair point. But libgen still does make a larger amount of knowledge available to a larger amount of people, and for less resources.
19:26 🔗 joepie91 "to a larger amount of people" - this is absolutely false
19:26 🔗 joepie91 "a larger amount of knowledge" - this is also very likely false
19:26 🔗 joepie91 having to deal with legal complications limits the scalability of an archive
19:26 🔗 joepie91 it's no different from how it's more difficult to move to a new house if you have an attic full of stuff you want to keep
19:27 🔗 joepie91 the more stuff you need to keep around, the more difficult it is to move and respond to new situations
19:27 🔗 dd0a13f37 If we count in sci-hub, I'm not so sure. It does have a lot of users in academia.
19:27 🔗 joepie91 for a reliable, long-term archive - ie. not something that is existing by virtue of currently not being regulated out of existence like libgen - you do not want to create legal problems where there are none
19:27 🔗 joepie91 the only reason libgen is still around is because legislation and enforcement haven't been standardized between countries
19:27 🔗 dd0a13f37 archive.org has a larger amount of knowledge, but libgen probably disseminates a larger amount of knowledge/hour
19:28 🔗 joepie91 this gap is closing increasingly more
19:28 🔗 joepie91 dd0a13f37: what metrics are you basing that on?
19:28 🔗 dd0a13f37 Libgen's issues are only technological. They could easily switch over to the darknet.
19:28 🔗 joepie91 no, they couldn't.
19:28 🔗 joepie91 let me guess, you're an I2P user?
19:28 🔗 dd0a13f37 Why not? They already have an I2P site. An onion would be trivial
19:28 🔗 dd0a13f37 Nope, only Tor.
19:29 🔗 joepie91 so here's the thing, I've had I2P proponents try to argue this with me for literally 7-8 years now
19:29 🔗 joepie91 "they could just move to I2P"
19:29 🔗 dd0a13f37 What's wrong with I2P?
19:29 🔗 joepie91 the reality is that the barrier to install the necessary software is far too high for the average user, and that moving to a non-clearnet site means you lose 99% of your readership
19:29 🔗 joepie91 believing that you can move to a darknet site without being poorly accessible is delusional
19:29 🔗 Frogging 1% availability is still greater than 0%
19:30 🔗 joepie91 darknet and clearnet sites absolutely ARE NOT equivalent from an accessibility perspective
19:30 🔗 Ceryn You're solving different needs.
19:30 🔗 dd0a13f37 Having the main servers behind Tor/similar is not the same thing as only being accessible from Tor/similar.
19:30 🔗 joepie91 Frogging: sure, so set up an alternative archive with sketchier data on a darknet site
19:30 🔗 joepie91 problem solved
19:30 🔗 joepie91 this is what I said about variety of tactics
19:30 🔗 dd0a13f37 pinkapp.io, for example. Or like gettor does.
19:30 🔗 Frogging makes sense
19:30 🔗 joepie91 my problem is with people trying to argue that EVERYTHING needs to make a certain set of tradeoffs, the same one
19:30 🔗 joepie91 this includes "the IA shouldn't dark X"
19:30 🔗 joepie91 "the IA should make an I2P site"
19:30 🔗 joepie91 etc.
19:30 🔗 Frogging so I don't know what the argument is about if we all agree that multiple methods are viable and each has pitfalls
19:30 🔗 joepie91 (yes, I know it's called an eepsite, but not everybody here will)
19:31 🔗 joepie91 Frogging: the discussion here started because dd0a13f37 is of the opinion that IA is unnecessarily darking things, it seems
19:31 🔗 dd0a13f37 No, not really. Although IS darking is completely unnecessary, since it doesn't violate US law, that's beside the point.
19:31 🔗 joepie91 which seems to be a recurring discussion that never goes anywhere and just produces a lot of noise
19:32 🔗 dd0a13f37 The point is that there's a contradiction. On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips.
19:32 🔗 joepie91 dd0a13f37: I don't see a contradiction there.
19:32 🔗 Frogging IA will accept either, but database dumps are incompatible with wayback
19:34 🔗 dd0a13f37 You're okay with something being inaccessible except for some arcane procedure involving possessing research credentials, sending an e-mail to IA, and having luck. On the other hand, you're not okay with something being inaccessible except for browsing a database dump, despite that "The data is all there", since "the average user or journalist isn't going to do that"
19:34 🔗 dd0a13f37 You don't see a minor contradiction there?
19:34 🔗 Frogging the IA also benefits greatly from being a recognized legitimate entity
19:34 🔗 dd0a13f37 That's true. LG is limited scalability-wise from resources.
19:34 🔗 joepie91 dd0a13f37: you're conflating 'temporarily inaccessible' with 'permanently inaccessible'
19:35 🔗 joepie91 dd0a13f37: something being darked does not mean it will not ever be public again
19:35 🔗 joepie91 it is a temporary measure
19:35 🔗 dd0a13f37 Within my lifetime, yes.
19:35 🔗 joepie91 (that is why it's not deleted)
19:35 🔗 joepie91 it is temporary nevertheless
19:35 🔗 dd0a13f37 Something in a database dump isn't permanently inaccessible either.
19:35 🔗 joepie91 dd0a13f37: if you can actually hear me out for a second
19:36 🔗 dd0a13f37 I can browse it just fine, could run a local instance of reddit and make a siterip.
19:36 🔗 * joepie91 sighs
19:36 🔗 dd0a13f37 Alright.
19:36 🔗 joepie91 the reason siterips are preferred over databases is because that concerns *permanent accessibility* -- you cannot reliably reproduce a website's operation from just a DB dump unless you have literally all of the components and infrastructure involved
19:36 🔗 joepie91 nor is there a generic way to, given a database and site source code, make it accessible
19:36 🔗 joepie91 this means that there is a cost to accessibility of raw data that many people will not pay
19:36 🔗 Frogging exact reproducability
19:36 🔗 joepie91 this will be perpetually true
19:36 🔗 joepie91 not temporarily
19:37 🔗 joepie91 this doesn't mean that the raw data *shouldn't* be archived, just that there should be a more accessible option
19:37 🔗 joepie91 ie. a siterip
19:37 🔗 dd0a13f37 That is true. However, in the case of reddit, the source code is public. And you could make a siterip from a dump if you're sure that the templates are correct.
19:37 🔗 joepie91 dd0a13f37: and that is an immense cost you've described there that many people are literally incapable of paying.
19:37 🔗 Frogging still not sure what you're arguing because nobody said db dumps weren't allowed
19:38 🔗 dd0a13f37 If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
19:38 🔗 dd0a13f37 Frogging: sure, but a goal is apparently to archive the pages
19:38 🔗 dd0a13f37 If something is darked for 1000 years, is it still "temporary"?
19:39 🔗 Frogging what does dumps vs siterips have to do with darking?
19:39 🔗 joepie91 [20:38] <dd0a13f37> If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
19:39 🔗 joepie91 what?
19:39 🔗 dd0a13f37 Dumps - inaccessible, siterips - accessible; dumps bad, siterips good; darked - inaccessible, siterips - accessible; both acceptable
19:40 🔗 Frogging dumps aren't necessarily inaccessible, and siterips aren't necessarily accessible
19:40 🔗 dd0a13f37 Presuming they are, then.
19:40 🔗 Frogging these are separate issues
19:40 🔗 Frogging why would we presume that? it isn't true
19:40 🔗 dd0a13f37 It's still a contradiction.
19:40 🔗 dd0a13f37 In general, it is.
19:40 🔗 joepie91 dd0a13f37: it only looks like a contradiction because you're intentionally ignoring nuance that we've already pointed out
19:41 🔗 joepie91 and I am getting tired of this discussion repeatedly clogging up the channel, to be frank
19:41 🔗 joepie91 this is absolutely not in the least constructiove
19:41 🔗 joepie91 constructive*
19:41 🔗 dd0a13f37 joepie91: If I would generate a siterip from a local issue of reddit and a dump, and then upload the generated pages to WB, if we disregard the fact that the metadata might be off (e.g. page X wasn't fetched from reddit on date X but rather my local instance of reddit on date X, running the same software and DB), don't we in practice have the same thing as a 'proper' siterip, but without being limited by ratelimits?
19:41 🔗 Ceryn You could make a wiki page explaining the what and why and refer to that. :)
19:42 🔗 Frogging there's nothing to explain because the whole argument is based on premises that make no sense and aren't true
19:42 🔗 joepie91 dd0a13f37: but we don't "disregard" that, and that still requires work, and I've told you all these things before and you need to stop twisting arguments to make them sound like a contradiction
19:42 🔗 joepie91 right, precisely what Frogging said
19:42 🔗 joepie91 this is a non-discussion
19:42 🔗 Ceryn Apparently there's a root of confusion somewhere, if people keep instigating similar discussions.
19:42 🔗 dd0a13f37 Well, they're not the exact same thing, and I'm not claiming that either. But they are similar. A siterip (uploaded to WB) is more accessible than a dump though, that's just the way it is.
19:43 🔗 joepie91 Ceryn: most people don't...
19:43 🔗 Ceryn Okay.
19:43 🔗 joepie91 dd0a13f37: what are you actually trying to accomplish with this discussion?
19:43 🔗 dd0a13f37 I've only 'instigated' this discussion once, please don't conflate me with other people who aren't me.
19:44 🔗 dd0a13f37 joepie91: I'm just trying to figure out what's the deal with the contradiction, how accessibility is super important in some cases and utterly unimportant in others, as long as it's "temporary".
19:44 🔗 joepie91 dd0a13f37: except all of these premises are wrong and there is no contradiction, as we have repeatedly told you
19:44 🔗 joepie91 so why are you continuing the discussion along the same lines that we've already pointed out are false?
19:45 🔗 Frogging dd0a13f37: I think IA would like to have everything accessible all the time, but it's not always legally possible. they don't dark things just because they feel like it
19:46 🔗 dd0a13f37 I'm not discussing the IA's actions here, I'm discussing what's the point in uploading to IA and not somewhere else if we know IA'll just dark it. However, they do dark things just because they feel like it, the IS videos are a prime example of this. They're constitutionally protected speech.
19:46 🔗 dd0a13f37 joepie91: How are the premises wrong? You've said it yourself that browsing a database dump is inconvenient and wayback is convenient.
19:46 🔗 joepie91 oh for fucks sake
19:47 🔗 Frogging yeah, that's already been answered. people can upload to more than one place. there's no policy that says otherwise
19:47 🔗 joepie91 dd0a13f37: the premises are wrong because you keep misrepresenting the points being made and/or the reality of things to the point of inaccuracy, and it is utterly pointless to continue discussing any of this with you because you just keep piling on more presumptions and misrepresentatioins and whatnot
19:47 🔗 joepie91 if you don't trust IA to keep something available, upload a copy elsewhere
19:47 🔗 joepie91 and with that I hope we can conclude the discussion
19:48 🔗 dd0a13f37 Well okay, that's a fair point.
19:52 🔗 dd0a13f37 There was a discussion about google's rate limits earlier. Would it be possible to use startpage or some similar proxy to bypass this? They're not nearly as strict.
19:53 🔗 Frogging I'm not sure how startpage interfaces with google, but anecdotally I've noticed that the results aren't always the same or as numerous on Startpage
19:53 🔗 Frogging so if that matters, use caution when assuming it's a transparent proxy to google
19:53 🔗 JAA Let me just point you to Searx.
19:54 🔗 dd0a13f37 That could be true. Still better than nothing though.
19:54 🔗 JAA And also YaCy.
19:54 🔗 JAA Not sure about the quality of search results on the latter.
19:54 🔗 dd0a13f37 How come searx doesn't get rate limited?
19:54 🔗 Frogging google's ratelimiting is really touchy. I only share an IP with 3 people and I get the captcha at least once a day
19:55 🔗 dd0a13f37 Well, then I could just use bing (or friends, yahoo/ddg)
19:56 🔗 JAA Not sure about the rate limits on Searx, to be honest.
19:56 🔗 JAA But it does aggregate results from various engines (including Google, Bing, Yahoo), so that probably helps.
19:57 🔗 dd0a13f37 I'm just wondering. How does it avoid getting rate limited by the backend search engines?
20:03 🔗 arkhive has joined #archiveteam-bs
20:04 🔗 arkhive I'm trying to save go90 Original videos. I'm having trouble figuring out how to save web videos. HTML5 or Flash. I'm a bit of a noob. Can someone help me?
20:05 🔗 arkhive go90 is a streaming service(free) from Verizon. they have dumped millions into it and are losing tons of money. laying off employees. So i think it's time to Ctrl-C Ctrl-V
20:08 🔗 dd0a13f37 try youtube-dl
20:08 🔗 dd0a13f37 if that fails, open inspect element, network tab, play a video, and see if there's a pattern in the requests it make and if you can figure them out from the url
20:11 🔗 dd0a13f37 >Sorry!
20:11 🔗 dd0a13f37 >go90™ Mobile TV Network is only available in the US right now.
20:23 🔗 godane youtube-dl will not work with go90
20:36 🔗 TheLovina has quit IRC (Read error: Operation timed out)
20:37 🔗 TheLovina has joined #archiveteam-bs
20:40 🔗 TheLovina has quit IRC (Read error: Connection reset by peer)
21:42 🔗 atrocity has quit IRC (Ping timeout: 246 seconds)
21:44 🔗 ranma arkhive: get a jetpack and sub to VZW for a month?
21:44 🔗 ranma is there anything of value even on go90?
21:45 🔗 ranma speaking as a recent employee of VZW
21:47 🔗 ranma arkhive: maybe download NOX or Bluestacks (or virtualize Android if you can), install the app, sniff the traffic to see if they're calling https or http gets?
21:48 🔗 ranma oh yeah, go90 isn't bandwidth-free on prepaid
22:32 🔗 dd0a13f37 has quit IRC (Quit: Connection closed for inactivity)
22:51 🔗 drumstick has joined #archiveteam-bs
22:54 🔗 Asparagir has joined #archiveteam-bs
23:00 🔗 BlueMaxim has joined #archiveteam-bs
23:11 🔗 atrocity has joined #archiveteam-bs
23:35 🔗 jschwart has quit IRC (Quit: Konversation terminated!)

irclogger-viewer