#archiveteam-bs 2017-11-04,Sat

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***schbirid has quit IRC (Ping timeout: 256 seconds) [00:05]
schbirid has joined #archiveteam-bs [00:16]
.... (idle for 19mn)
SketchCowgodane: looking good (the tapes) [00:35]
.... (idle for 18mn)
godanei still have a ton to upload [00:53]
i uploaded 603 items so far this month: https://archive.org/details/@chris85?and[]=addeddate%3A2017-11&sort=-publicdate
that collection is like slow sink drip for government docs
i know for a fact there is over 500000 ids used with dtic
[01:01]
........... (idle for 54mn)
***superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [01:58]
.... (idle for 17mn)
SketchCow has quit IRC (Read error: Operation timed out)
SketchCow has joined #archiveteam-bs
swebb sets mode: +o SketchCow
[02:15]
..... (idle for 23mn)
schbirid has quit IRC (Ping timeout: 255 seconds) [02:42]
drumstick has quit IRC (Read error: Operation timed out)
drumstick has joined #archiveteam-bs
schbirid has joined #archiveteam-bs
[02:51]
.......... (idle for 46mn)
Pixi has joined #archiveteam-bs [03:40]
.... (idle for 19mn)
MadArchiv has joined #archiveteam-bs [03:59]
MadArchivI finally managed to get site-grab to work on my computer (somehow), I'll take some much-needed sleep for now and [04:01]
CerynDamn. He must have needed it. :) [04:05]
MadArchiv(Sigh, I *really* need to stop pressing the Enter key by accident) ...and then I'll archiving the completed webcomics from Hiveworks tomorrow [04:05]
***Pixi has quit IRC (Quit: Pixi) [04:06]
MadArchivCeryn: You bet it, pal [04:07]
CerynAre you uploading to Archive.org? [04:07]
MadArchivOf course. [04:07]
CerynAre dumps of a site generally just uploaded occasionally, no deduplication or continuation from the last batch or anything? [04:08]
MadArchivThese are webcomic sites, so I suppose there are no dumps at all. I could be wrong though. [04:09]
CerynI mean dumps in the general sense, data dumps, whatever format the data has. [04:09]
MadArchivYou mean the ones I would get after I crawl the site? I'm new to this, and I'm also pretty computer-illiterate, so I'm learning this stuff as I go along.
Wait, are you still talking about webcomic sites?
[04:11]
CerynHeh okay. I'm new to archiving but I'm rather computer-literate.
My question was on how Archive.org handles when a site or something is uploaded, then uploaded again later and so on, in several copies.
[04:13]
***qw3rty4 has joined #archiveteam-bs [04:14]
CerynIs each copy just stored separately? Or is the data deduplicated (i.e. identical parts of the data will only be stored once, then the other copies will refer to the master copy of those parts)? Or do future uploads only add what is new compared to the last upload?
I guess these questions take some knowledge of Archive.org's workings to answer.
[04:16]
MadArchivFrom what I've seen, they just take in different copies of the same thing as if they were different items. [04:17]
CerynYeah. It's by far the easiest to handle, but it's also the least efficient. [04:17]
MadArchivYup [04:17]
CerynThough of course they could be doing things under the hood. [04:17]
MadArchivLike they do with excluded sites. [04:18]
***qw3rty3 has quit IRC (Read error: Operation timed out) [04:19]
BlueMaxim has quit IRC (Quit: Leaving) [04:29]
MadArchivBy the way, I'm dowloading
Goddamnit, I did it again, just wait a bit so I can finish writing my post
[04:33]
***MadArchiv has quit IRC (Read error: Operation timed out)
MadArchiv has joined #archiveteam-bs
[04:41]
MadArchivAlright, so, (I kinda forgot what I was going to write in the first place so let's just get to the point) do you know of any *other* webcomics I or, ideally, we should archive? Once I'm done with the Hiveworks ones, I mean. [04:45]
CerynI don't. I've never been into comics. I'm guessing others know some, however. [04:46]
MadArchivHmmm, alright. Do you think (and this is a legitimate question, by the way) that if we're gonna make a manual list of comics that should be saved then IRC would be the best place for it? I was thinking about putting it on Reddit since it'd more accessible.
Grammar correction: add a "by the way" before "do you think"
[04:49]
CerynIRC definitely isn't the place for such a list. When you want to paste stuff on IRC you usually paste it to a paste bin service (e.g. pastebin.com) and just post the link.
If you want it to be a joint effort then a reddit thread would probably be good. If the project was rather large in scope it seems maybe you'd organise a group here, make your own IRC channel and start a wiki page or something.
[04:51]
phillipsjCeryn, duplicate uploads can have varying quality as well. In the spring I was considering doing a bunch of YouTube channels in DVD quality: allows broader coverage for the same space. But is kinda pointless if the same channel was uploaded in HD already. [04:54]
Cerynphillipsj: Right. Except new videos. They'd make sense to upload. [04:55]
phillipsjYoutube seems the thottle you a lot if you try downloading faster than a human can watch though. [04:55]
Cerynphillipsj: But it is not common practice, then, to extend data you have already uploaded? People don't generally keep up to date mirrors and sync them with the Archive?
Okay. Hm. I guess you can parallelise Youtube videos though? At least to some extent.
[04:56]
phillipsjCeryn, never bothered to upload anything becuase they were begging for money to store the stuff. [04:56]
CerynOkay.
I mostly plan to archive for my own storage, but it seems I might as well upload a copy to the Archive too.
[04:57]
phillipsjOther things came up as well. I put the machine I was suing in storage because Youtube was becomming to distracting for me. (Have IRL stuff to do)
*using
[04:57]
CerynOkay.
Archive must have so much data bloat. Stuff they could optimise away.
[04:58]
phillipsjI have a stack of blank DVD+Rs slowly rotting away. [04:58]
CerynHeh. Very slowly. [04:59]
phillipsjWas thinking of using them of my local copy.
*for
[04:59]
CerynHow come you want to store on DVDs as opposed to HDDs?
For actual DVD player use?
[05:00]
phillipsjI expect the DVDs to last longer. [05:01]
CerynYeah. It just seems so cumbersome. To be able to store only 4.7 GB or however much it is. [05:01]
phillipsjThey are also slightly cheaper per GB I blieve. (But possibly not worth the inconvenience) [05:02]
***BlueMaxim has joined #archiveteam-bs [05:02]
CerynIf you want to scale then no, definitely not worth it. Because you'd have to check them periodically just to know your data was intact. [05:02]
phillipsjI think th espetch of a plan is to try to use my remaing disks, and if it goes well, maybe buy more.
*sketch
[05:03]
CerynHeh. It seems all of our problems can be solved by "buying more disks".
Works every time.
[05:03]
phillipsjI origanally bough them for back-ups, but the back-up verification failed at the restore step.
New back-up plan is server with doubly-redundant ZFS and ECC RAM.
[05:04]
Ceryn:P nice.
Do you have any stats on how likely normal RAM is to screw you over?
[05:05]
phillipsjBonus points if I encrypt that data in transit and at rest (by turing the server off).
Not off-hand, but if I am going to the trouble of redundancy in the face of a disk failure, I don't want bad RAM to mess up my data.
[05:06]
CerynYou can lukscrypt each raw drive, open them, and then set up a ZFS pool on the encrypted volumes.
That's what I've done, at least. Seems to work pretty well.
(I'm assuming ZFS on Linux.)
[05:07]
phillipsjI like FreeBSD, not that I am good at actaully configuring it. [05:08]
CerynI suppose. It just seems the entire setup becomes significantly more expensive if you need hardware that supports ECC RAM.
Okay.
[05:08]
phillipsjMy "workstation" currently has mirrored, striped ZFs across 4 disks (non-ECC RAM though). Boots with a simulated controller failure (cable unplugged). [05:10]
hook54321JAA: Are you still grabbing the Catalonia cameras? [05:11]
phillipsjScared to try a scrub without a proper back-up though. [05:12]
Ceryn, I had to roll-back a non-ECC mememory upgrade on one of my machines: dropped a module on the carpet, and it started manifesting problems about a month later. ECC is nice in that it tells you when it has a problem. [05:17]
Cerynphillipsj: When does it tell you this? During boot? [05:17]
phillipsjWhen testing my server and forgetting to plug in a CPU fan, I got memory errors/corrections logged to dmesg. Those (fully buffered) modules also log their temperature as well. [05:19]
CerynOkay. So it would take some work keeping yourself updated with the status. [05:21]
phillipsjphillipsj was planning to install cowling + exhaust fan for better colling.
Ceryn, an uncorrectable error halts the machine unless you have mirroring enabled.
s/colling/cooling/
[05:21]
CerynOkay. [05:22]
phillipsjI cheaped out on the server, so it it taking a lot of my time top make sure it is stable :P [05:23]
CerynHaha yeah. That's a huge trade-off. [05:24]
phillipsjCan't beleive I missed the PSU fan grinding before purchase (second hand, obviously). Was able to rpelace it with a slower speed fan of the same dimensions (but server runs close to the de-rated (based on difference in fan power draw) power load). [05:27]
....... (idle for 31mn)
***MadArchiv has quit IRC (Remote host closed the connection) [05:58]
.............. (idle for 1h6mn)
drumstick has quit IRC (Ping timeout: 248 seconds) [07:04]
............ (idle for 56mn)
drumstick has joined #archiveteam-bs [08:00]
................ (idle for 1h19mn)
BlueMaxim has quit IRC (Quit: Leaving) [09:19]
............ (idle for 59mn)
jschwart has quit IRC (Read error: Operation timed out) [10:18]
schbirid has quit IRC (Ping timeout: 255 seconds) [10:32]
schbirid has joined #archiveteam-bs [10:37]
...... (idle for 26mn)
godaneSketchCow: i'm breaking up the So Graham Norton tape cause its 2 episode
*episodes
also whats funny is episode S01E25 is before S01E18
and its not So Graham Norton but V Graham Norton
[11:03]
i'm dong BalanceBall Fitness tape [11:13]
so your getting max bitrate with this BalanceBall Fitness tape [11:25]
***drumstick has quit IRC (Ping timeout: 248 seconds) [11:26]
Stilett0 has quit IRC (Read error: Operation timed out) [11:38]
godanefun fact: cover of tape say Beginner's Workout but title say Total Body Workout
if the tape label said Beginner's Workout
so i'm going for that for label
[11:52]
***Stilett0 has joined #archiveteam-bs [12:01]
godanei found another tv tape
i'm going to do it at 6000k instead of 10000k cause its a tv recording
[12:01]
***Stilett0 is now known as Stiletto [12:03]
godaneanyways i made screenshots with 6000k and 10000k
and it looks the same so i think 6000k is ok
SketchCow: here are the images: https://imgur.com/a/5QRap
top one is the 6000k one
bottom one is 10000k one
[12:06]
................... (idle for 1h31mn)
so i found another duplicate tape
it was down for love promo tape
[13:39]
***superkuh has joined #archiveteam-bs [13:40]
godaneanyways this tape has the last 2 episodes of Felicity for Season 2 [13:41]
...... (idle for 25mn)
***icedice has joined #archiveteam-bs [14:06]
..... (idle for 23mn)
godaneso i may have partial Charm recording on TNT [14:29]
SketchCowgodane: I appreciate your best approach, godane [14:31]
godanethis tapes is going to have 8 minutes of black in it
with audio
cause there is very bad tape between the end of felicity section and Charmed recording
also this bad sections start before the end of felicity recording
luckly for use there is bit of over recording
either way i will break up the felicity and charmed part at 02:08:00 mark
[14:37]
***Mateon1 has quit IRC (Ping timeout: 255 seconds)
Mateon1 has joined #archiveteam-bs
[14:48]
odemg has quit IRC (Ping timeout: 248 seconds) [14:58]
jtn2_ has quit IRC (Quit: restarting for irssi security update)
jtn2 has joined #archiveteam-bs
[15:03]
...... (idle for 28mn)
dd0a13f37 has joined #archiveteam-bs [15:33]
dd0a13f37Is there anything like Library Genesis but for newspapers?
They link to magzdb.org, but it's in Russian and seems like it's broken
[15:33]
***icedice has quit IRC (Quit: Leaving) [15:46]
schbiridKaz: check out https://pypi.python.org/pypi/fake-useragent anyways ;) [15:51]
Kazschbirid: will probably drop that in when we get blocked again.. UAs we're using are from as far back as Chrome 40 [15:59]
dd0a13f37Could you please add googlebot user agents?
A lot of sites with paywalls give unrestricted access to google
[16:06]
***jschwart has joined #archiveteam-bs [16:10]
dd0a13f37Did they remove the old addons from AMO yet?
It looks different
[16:14]
JAAhook54321: Yes, those cam grabs are still running.
dd0a13f37: They will be removed in June.
[16:22]
dd0a13f37Figured out a way to grab them all [16:26]
JAAWe're on it already. [16:28]
dd0a13f37Wiki page is off though
>The total number of addons should be approximately 20,000.
there are 760k .xpi files
[16:28]
JAAThere's an ArchiveBot which has been running for a while (over 2 months), and I think Somebody2 did something as well.
There's a difference between "number of addons" and "number of .xpi files".
The latter includes different platforms and previous versions.
[16:29]
dd0a13f37500k addons though
>459,938 add-ons found
The job for !a https://addons.mozilla.org/ is a slow approach
you can just make a list https://addons.mozilla.org/firefox/downloads/file/760000/ for !ao
[16:30]
JAAI think that's what Somebody2 did (outside of ArchiveBot). [16:32]
dd0a13f37ah okay, doesn't say so on the wiki page [16:32]
JAAHowever, this doesn't grab older versions or different platforms.
Yes, the wiki is often not exactly up to date.
[16:32]
dd0a13f37It does.
look here
https://addons.mozilla.org/en-US/firefox/addon/weather-extension/versions/
Hover over "Add to Firefox" links
[16:33]
JAAHuh, I see.
Do different platforms get individual IDs as well?
[16:33]
dd0a13f371 id = 1 xpi [16:34]
JAAI see. [16:37]
dd0a13f37If the file exists, you get 302, if not, 404
Should I !ao it?
[16:38]
JAAYou're right that !a AMO is not exactly efficient. However, it does also archive various data around the actual addons: descriptions, screenshots, reviews, collections. Plus it provides a browsable interface.
No, let's not.
As mentioned, Somebody2 has done something similar already (not sure what he did *exactly*) and the !a AMO job is also pretty far. But we could do that sometime next year, shortly before they purge all legacy addons.
760k URLs should be pretty quick anyway.
[16:38]
CerynFor reference, you get 86400 requests per hour at one request per second. [16:44]
JAAYeah, obviously. [16:45]
CerynAnd apropos, is there a general crawling rate you prefer to avoid getting rate-limited on sites? I know Reddit only allows a request every 2 seconds.
:)
[16:46]
JAAWe've been hammering them with five connections and a very low delay for weeks now.
They do?
Isn't that for the API?
[16:46]
CerynOh, yes.
Huh. Maybe they don't do that for web scraping. But I'm pretty sure they don't want you to query more often.
[16:46]
JAAThey didn't seem to care when I grabbed a number of subreddits through ArchiveBot a while ago (after Charlottesville). [16:48]
CerynCool. How much did you grab? The entire thing? Did it work out well? [16:48]
dd0a13f37Large sites probably don't care, IMO it's better to start extremely high (e.g. max out your bandwidth) and see if you get blocked [16:49]
CerynHm. Hopefully the block would be very temporary, then. [16:50]
dd0a13f37Just switch IPs [16:51]
JAANo, not the entire thing, just select subreddits, in particular far-right ones.
Stuff like /r/EuropeanNationalism etc.
Some of them got banned recently.
[16:51]
CerynYeah, I meant the entire subreddits. Cool.
Oh? Did you expect that to happen?
[16:51]
JAAWell yeah, as far as it let me.
I think you can only get the last 1000 posts for a particular subreddit the normal way.
For anything older, you have to use the search with a special syntax.
Yeah, I wasn't at all surprised that they finally closed those shitholes.
They've been giving them bad press.
That seems like the only thing they care about.
[16:51]
CerynOh okay. Didn't know that. [16:54]
dd0a13f37What about archiving voat? [16:54]
JAABy the way, there's a full Reddit archive available somewhere also. [16:54]
CerynDo you have any idea how many of the posts you managed to get, then? 90+%? (And how many is that?) [16:55]
JAAThey're continuously grabbing all comments etc.
No clue.
[16:55]
CerynOh sweet! [16:55]
JAANote that these grabs can't get all comments in large threads.
(That full archive I just mentioned should contain those.)
[16:55]
dd0a13f37If it's already fully archived, why bother? [16:56]
CerynWhy not? You don't follow links to further discussion? [16:56]
JAAdd0a13f37: Yeah, I've been thinking about grabbing Voat. I don't have time to set something proper (i.e. using the API) up currently though. [16:56]
dd0a13f37Ceryn: What do you mean? [16:57]
JAAAnd the reason is that that full archive is not easily accessible. I haven't looked at it in detail, but I think it's a database.
You can't browse it in the Wayback Machine, for example.
[16:57]
dd0a13f37Couldn't you generate html pages from the database? [16:57]
JAASure
The data is all there.
But the average user or journalist isn't going to do that.
[16:57]
Ceryndd0a13f37: Asking why full comment trees aren't available in his grab of subreddits. [16:57]
dd0a13f37That seems like most of the projects anyway. What's the point of archiving something, only for it to get darked and public 70+ years later? [16:58]
CerynI'm very interested in having a look at their Reddit database. Maybe it'll be good enough so I won't have to archive what I'm interested in. [16:58]
dd0a13f37Or even better, darking historically important content for political reasons [16:59]
CerynPersonally my archiving interest is just archiving for my own sake. I want to have it. And if I have it, I don't mind sharing it.
Generally, I think having the content available 70+ years later is part of the idea.
Obviously no one here wants data darkened.
[16:59]
JAACeryn: For one, those grabs ignored the per-comment links. But even if you grab those, you still don't handle the "load more comments" stuff. So yeah, it's not easily possible to archive an entire thread (unless you use the API to generate links to each comment in the thread or something like that). [17:01]
CerynOkay. Thanks for the clarification. [17:02]
dd0a13f37Sure, you want it available in 70 years, but if it's not available the 69 years before that, what's the point? To be able to pride yourself in that it's "theoretically" archived, even though you can't do anything useful with it? [17:02]
***Pixi has joined #archiveteam-bs [17:03]
JAAI believe you might be able to get access to it in certain circumstances. Also, laws can change, and if copyright finally gets the reform it so desperately needs, it might be possible for IA to undark it. [17:03]
Ceryndd0a13f37: So, if a data collection is darkened because somebody says you mustn't have it, should you delete it?
dd0a13f37: Or should you preemptively decide not to store anything because it might get darkened?
[17:03]
dd0a13f37JAA: And it can't go the other way around? They have to delete something, and whoops, it's gone since nobody could mirror it [17:04]
CerynDarkening sucks. But I like that the data is there. If someone really needs it, I expect it is possible to get it anyway. [17:04]
dd0a13f37Ceryn: If IA is the only one who has it, that's what happens, in practice. [17:04]
JAAdd0a13f37: I don't think anyone can force them to actually delete it. [17:04]
dd0a13f37Now, no. What about in 30 years? [17:05]
JAAHence Internet Archive Canada and the mirror in Alexandria. [17:05]
CerynHow often is something darkened? Is it really that much of it? [17:05]
dd0a13f37Ceryn: All the IS content, for example
That probably isn't mirrored in many places, since it's so sensitive
[17:05]
CerynAnd if the data is widely desirable, then peer to peer sharing will help keep it alive and distributed too. [17:05]
dd0a13f37Most countries have laws against even touching it [17:06]
CerynWhich IS content? [17:06]
dd0a13f37Their videos [17:06]
CerynOh. Okay. [17:06]
JAAFine to possess in many jurisdictions, as far as I know.
Distributing is a different thing, obviously.
[17:06]
dd0a13f37And since they're too incompetent to use P2P, that won't save it either. [17:07]
Ceryndd0a13f37: For me, in 30 years or whatever, I want to be able to peruse all the things I found interesting or nostalgic or worth saving at any point.
dd0a13f37: So, for me, even data I cannot share is worth storing. Assuming I want it.
[17:08]
dd0a13f37But for the 30 years leading up to that, it basically doesn't exist. [17:08]
CerynWell, whenever I want to see it I can. At any point. [17:09]
JAAYou two are talking about different things. Ceryn means that he can keep his own copy. dd0a13f37 is talking about IA having and distributing it. [17:09]
dd0a13f37Sure, if you store it, that is. But archiving something at IA only for it to sit in a data center for 70 years is utterly pointless [17:09]
CerynIf I am aware that others want it, I can share it most of the time. Sometimes laws don't agree with sharing it. But usually it's doable anyway.
JAA's right.
[17:09]
omglolbahHave to admit, I was somewhat concerned having my pipeline scrape the far-right stuff.... assume I'm in a registry now :p [17:10]
Ceryndd0a13f37: I think it loses much of its value if it's inaccessible to all but IA for 70 years. [17:10]
dd0a13f37Sure, doable, but if IA is the only place that has it, it can get traced back to them if it "leaks" [17:10]
Ceryndd0a13f37: BUT. I think it's very valuable to have it after 70 years and to the end of time. [17:10]
dd0a13f37In a sense, yes, but it's still quite pointless. A darknet archive, boy would that be something [17:11]
omglolbahnot sure why it would be pointless to have copies for future study? [17:11]
JAAIPFS? [17:11]
dd0a13f37IPFS is neither darknet nor production ready [17:11]
JAA*shrug* [17:12]
dd0a13f37bittorrent over i2p is better, just needs a nice frontend
omglolbah: Not entirely pointless, but it's one hell of a delayed gratification
[17:12]
omglolbahomglolbah peers over at the national archives of Norway where shit in runes sits in storage for study
all about time-scales <.<
[17:13]
dd0a13f37https://ia801504.us.archive.org/6/items/asaad2/asaad2.mp4 here is an example of content that will probably not be recovered
only reencodes available in public
Not illegal in the US, just IA randomly deciding to censor it
http://jihadology.net/2017/11/03/new-video-message-from-the-islamic-state-lions-of-the-battle-2-wilayat-%e1%b9%a3ala%e1%b8%a5-al-din/
[17:17]
JAAAre you sure about that? I found an *almost* identical file (just about 60 bytes bigger) within minutes...
Found another one which is 12 bytes bigger.
[17:22]
Ceryndd0a13f37: So, to be clear, the entire argument is "what is the point of IA continuing to store things that have been darkened", right?
Because in all other cases the data is just accessible.
Or not stored and lost.
[17:25]
JAAHmm, found a very ... interesting site while searching for that video. [17:28]
***odemg has joined #archiveteam-bs [17:29]
JAAPretty sure this is run by ISIS. [17:30]
dd0a13f37: The only differences between those video files I found are appended NULs, by the way. Probably fools many simple filters. [17:35]
joepie91dd0a13f37: the primary purpose of IA is preservation; access is just a means to that end
dd0a13f37: from that perspective, it absolutely makes sense to keep something sitting in a datacenter for 70 years if the alternative is total loss
also, periodic reminder that IPFS is neither an archival nor a storage medium, it's a distribution medium
[17:38]
JAASure, but distribution is exactly what was being discussed above. [17:41]
***tuluu has quit IRC (Read error: Operation timed out)
tuluu has joined #archiveteam-bs
[17:48]
pizzaiolo has joined #archiveteam-bs [17:57]
SketchCowWhat's happening here [17:59]
CerynSketchCow: A philosophical discussion on the merits of hoarding: If the data cannot be seen, how do you know it exists? [18:13]
godaneso i found a tape of Empire Falls but its on dvd: https://www.amazon.com/Empire-Falls-Various/dp/B0009W5IMO
in less there is reason to digitize the hbo airing its been skipped
[18:24]
i'm digitizing tape 1 of Universal vs eric corley
deposition of robert schumann
[18:32]
......... (idle for 41mn)
dd0a13f37JAA: They have an official tool to append NULs, upload to different mirror sites, etc. But the activity died down after Raqqa was liberated.
Ceryn: No, the point I'm trying to make is "what's the point in archiving something if it only gets immediately darked"
[19:16]
Ceryndd0a13f37: You can't know it's going to be darked, can you? [19:17]
dd0a13f37Then we could just shut down newsgrabber etc, wait for them to release their archives in 100 years, yet that's no good solution
Copyrighted content will, and if it offends their political sensibilities it will
[19:17]
CerynRight. So IA does not solve availability in the forseeable future for darkened things. It does, however, solve long term data preservation in that case. [19:19]
dd0a13f37Yeah, and then you might as well just bury hard drives in the ground and wait for archeologists to find them.
Libgen is doing a much better job of archiving and distributing knowledge, which seems to be the goal here (a database dump of a site isn't good enough since you need to be able to browse it too)
[19:19]
CerynThe issue doesn't have anything to do with IA, really, does it? It's about other parties disallowing distribution of data.
Sure you could do something, but it probably wouldn't be legal.
[19:20]
dd0a13f37I'm just pointing out that there's a contradiction. [19:21]
CerynI haven't read the IA manifest (yet). joepie91 states they primarily aim to preserve. In which case it makes sense for them to do what they do. [19:21]
dd0a13f37On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips. [19:22]
joepie91dd0a13f37: considering that I am completely unable to download anything from libgen due to country blocks, that might be a premature conclusion
(libgen doing better at distribution)
[19:22]
dd0a13f37What's the point of preservation if the data won't be available within a reasonable time span? [19:22]
joepie91they take a different approach, more legally shaky, with different tradeoffs [19:22]
dd0a13f37joepie91: install torbrowser [19:23]
joepie91you are wholly missing the point here [19:23]
CerynPreservation by its very nature is not about near future needs. [19:23]
joepie91you're stuck on a One True Vision of how you believe archival and distribution should work, without understanding the legal, political, technical, social implications of that approach, and without understanding that a *variety of approaches* is the correct solution here
which is already what we have
[19:23]
dd0a13f37No, I'm not. The content is more available if you need to spend 5 minutes downloading Tor once than if you need to wait 70 years. [19:24]
joepie91which means that different outlets take different approaches with different tradeoffs
dd0a13f37: those two are effectively the same thing for 99.99% of the population
seriously, take a step outside of your own perspective sometimes and understand the wider effects of different approaches
this is getting really tiring
[19:24]
dd0a13f37Just curious, can you access http://93.174.95.27/ ? [19:25]
joepie91nope, empty response [19:25]
dd0a13f37Fair point. But libgen still does make a larger amount of knowledge available to a larger amount of people, and for less resources. [19:25]
joepie91"to a larger amount of people" - this is absolutely false
"a larger amount of knowledge" - this is also very likely false
having to deal with legal complications limits the scalability of an archive
it's no different from how it's more difficult to move to a new house if you have an attic full of stuff you want to keep
the more stuff you need to keep around, the more difficult it is to move and respond to new situations
[19:26]
dd0a13f37If we count in sci-hub, I'm not so sure. It does have a lot of users in academia. [19:27]
joepie91for a reliable, long-term archive - ie. not something that is existing by virtue of currently not being regulated out of existence like libgen - you do not want to create legal problems where there are none
the only reason libgen is still around is because legislation and enforcement haven't been standardized between countries
[19:27]
dd0a13f37archive.org has a larger amount of knowledge, but libgen probably disseminates a larger amount of knowledge/hour [19:27]
joepie91this gap is closing increasingly more
dd0a13f37: what metrics are you basing that on?
[19:28]
dd0a13f37Libgen's issues are only technological. They could easily switch over to the darknet. [19:28]
joepie91no, they couldn't.
let me guess, you're an I2P user?
[19:28]
dd0a13f37Why not? They already have an I2P site. An onion would be trivial
Nope, only Tor.
[19:28]
joepie91so here's the thing, I've had I2P proponents try to argue this with me for literally 7-8 years now
"they could just move to I2P"
[19:29]
dd0a13f37What's wrong with I2P? [19:29]
joepie91the reality is that the barrier to install the necessary software is far too high for the average user, and that moving to a non-clearnet site means you lose 99% of your readership
believing that you can move to a darknet site without being poorly accessible is delusional
[19:29]
Frogging1% availability is still greater than 0% [19:29]
joepie91darknet and clearnet sites absolutely ARE NOT equivalent from an accessibility perspective [19:30]
CerynYou're solving different needs. [19:30]
dd0a13f37Having the main servers behind Tor/similar is not the same thing as only being accessible from Tor/similar. [19:30]
joepie91Frogging: sure, so set up an alternative archive with sketchier data on a darknet site
problem solved
this is what I said about variety of tactics
[19:30]
dd0a13f37pinkapp.io, for example. Or like gettor does. [19:30]
Froggingmakes sense [19:30]
joepie91my problem is with people trying to argue that EVERYTHING needs to make a certain set of tradeoffs, the same one
this includes "the IA shouldn't dark X"
"the IA should make an I2P site"
etc.
[19:30]
Froggingso I don't know what the argument is about if we all agree that multiple methods are viable and each has pitfalls [19:30]
joepie91(yes, I know it's called an eepsite, but not everybody here will)
Frogging: the discussion here started because dd0a13f37 is of the opinion that IA is unnecessarily darking things, it seems
[19:30]
dd0a13f37No, not really. Although IS darking is completely unnecessary, since it doesn't violate US law, that's beside the point. [19:31]
joepie91which seems to be a recurring discussion that never goes anywhere and just produces a lot of noise [19:31]
dd0a13f37The point is that there's a contradiction. On one hand, you're perfectly okay with archiving something even if it gets darked. On the other hand, you're not fine with database dumps, you want siterips. [19:32]
joepie91dd0a13f37: I don't see a contradiction there. [19:32]
FroggingIA will accept either, but database dumps are incompatible with wayback [19:32]
dd0a13f37You're okay with something being inaccessible except for some arcane procedure involving possessing research credentials, sending an e-mail to IA, and having luck. On the other hand, you're not okay with something being inaccessible except for browsing a database dump, despite that "The data is all there", since "the average user or journalist isn't going to do that"
You don't see a minor contradiction there?
[19:34]
Froggingthe IA also benefits greatly from being a recognized legitimate entity [19:34]
dd0a13f37That's true. LG is limited scalability-wise from resources. [19:34]
joepie91dd0a13f37: you're conflating 'temporarily inaccessible' with 'permanently inaccessible'
dd0a13f37: something being darked does not mean it will not ever be public again
it is a temporary measure
[19:34]
dd0a13f37Within my lifetime, yes. [19:35]
joepie91(that is why it's not deleted)
it is temporary nevertheless
[19:35]
dd0a13f37Something in a database dump isn't permanently inaccessible either. [19:35]
joepie91dd0a13f37: if you can actually hear me out for a second [19:35]
dd0a13f37I can browse it just fine, could run a local instance of reddit and make a siterip. [19:36]
joepie91joepie91 sighs [19:36]
dd0a13f37Alright. [19:36]
joepie91the reason siterips are preferred over databases is because that concerns *permanent accessibility* -- you cannot reliably reproduce a website's operation from just a DB dump unless you have literally all of the components and infrastructure involved
nor is there a generic way to, given a database and site source code, make it accessible
this means that there is a cost to accessibility of raw data that many people will not pay
[19:36]
Froggingexact reproducability [19:36]
joepie91this will be perpetually true
not temporarily
this doesn't mean that the raw data *shouldn't* be archived, just that there should be a more accessible option
ie. a siterip
[19:36]
dd0a13f37That is true. However, in the case of reddit, the source code is public. And you could make a siterip from a dump if you're sure that the templates are correct. [19:37]
joepie91dd0a13f37: and that is an immense cost you've described there that many people are literally incapable of paying. [19:37]
Froggingstill not sure what you're arguing because nobody said db dumps weren't allowed [19:37]
dd0a13f37If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
Frogging: sure, but a goal is apparently to archive the pages
If something is darked for 1000 years, is it still "temporary"?
[19:38]
Froggingwhat does dumps vs siterips have to do with darking? [19:39]
joepie91[20:38] <dd0a13f37> If you'd upload the generated siterips to IA though, barring the issue of metadata, wouldn't it be the same thing, only without rate limits?
what?
[19:39]
dd0a13f37Dumps - inaccessible, siterips - accessible; dumps bad, siterips good; darked - inaccessible, siterips - accessible; both acceptable [19:39]
Froggingdumps aren't necessarily inaccessible, and siterips aren't necessarily accessible [19:40]
dd0a13f37Presuming they are, then. [19:40]
Froggingthese are separate issues
why would we presume that? it isn't true
[19:40]
dd0a13f37It's still a contradiction.
In general, it is.
[19:40]
joepie91dd0a13f37: it only looks like a contradiction because you're intentionally ignoring nuance that we've already pointed out
and I am getting tired of this discussion repeatedly clogging up the channel, to be frank
this is absolutely not in the least constructiove
constructive*
[19:40]
dd0a13f37joepie91: If I would generate a siterip from a local issue of reddit and a dump, and then upload the generated pages to WB, if we disregard the fact that the metadata might be off (e.g. page X wasn't fetched from reddit on date X but rather my local instance of reddit on date X, running the same software and DB), don't we in practice have the same thing as a 'proper' siterip, but without being limited by ratelimits? [19:41]
CerynYou could make a wiki page explaining the what and why and refer to that. :) [19:41]
Froggingthere's nothing to explain because the whole argument is based on premises that make no sense and aren't true [19:42]
joepie91dd0a13f37: but we don't "disregard" that, and that still requires work, and I've told you all these things before and you need to stop twisting arguments to make them sound like a contradiction
right, precisely what Frogging said
this is a non-discussion
[19:42]
CerynApparently there's a root of confusion somewhere, if people keep instigating similar discussions. [19:42]
dd0a13f37Well, they're not the exact same thing, and I'm not claiming that either. But they are similar. A siterip (uploaded to WB) is more accessible than a dump though, that's just the way it is. [19:42]
joepie91Ceryn: most people don't... [19:43]
CerynOkay. [19:43]
joepie91dd0a13f37: what are you actually trying to accomplish with this discussion? [19:43]
dd0a13f37I've only 'instigated' this discussion once, please don't conflate me with other people who aren't me.
joepie91: I'm just trying to figure out what's the deal with the contradiction, how accessibility is super important in some cases and utterly unimportant in others, as long as it's "temporary".
[19:43]
joepie91dd0a13f37: except all of these premises are wrong and there is no contradiction, as we have repeatedly told you
so why are you continuing the discussion along the same lines that we've already pointed out are false?
[19:44]
Froggingdd0a13f37: I think IA would like to have everything accessible all the time, but it's not always legally possible. they don't dark things just because they feel like it [19:45]
dd0a13f37I'm not discussing the IA's actions here, I'm discussing what's the point in uploading to IA and not somewhere else if we know IA'll just dark it. However, they do dark things just because they feel like it, the IS videos are a prime example of this. They're constitutionally protected speech.
joepie91: How are the premises wrong? You've said it yourself that browsing a database dump is inconvenient and wayback is convenient.
[19:46]
joepie91oh for fucks sake [19:46]
Froggingyeah, that's already been answered. people can upload to more than one place. there's no policy that says otherwise [19:47]
joepie91dd0a13f37: the premises are wrong because you keep misrepresenting the points being made and/or the reality of things to the point of inaccuracy, and it is utterly pointless to continue discussing any of this with you because you just keep piling on more presumptions and misrepresentatioins and whatnot
if you don't trust IA to keep something available, upload a copy elsewhere
and with that I hope we can conclude the discussion
[19:47]
dd0a13f37Well okay, that's a fair point.
There was a discussion about google's rate limits earlier. Would it be possible to use startpage or some similar proxy to bypass this? They're not nearly as strict.
[19:48]
FroggingI'm not sure how startpage interfaces with google, but anecdotally I've noticed that the results aren't always the same or as numerous on Startpage
so if that matters, use caution when assuming it's a transparent proxy to google
[19:53]
JAALet me just point you to Searx. [19:53]
dd0a13f37That could be true. Still better than nothing though. [19:54]
JAAAnd also YaCy.
Not sure about the quality of search results on the latter.
[19:54]
dd0a13f37How come searx doesn't get rate limited? [19:54]
Frogginggoogle's ratelimiting is really touchy. I only share an IP with 3 people and I get the captcha at least once a day [19:54]
dd0a13f37Well, then I could just use bing (or friends, yahoo/ddg) [19:55]
JAANot sure about the rate limits on Searx, to be honest.
But it does aggregate results from various engines (including Google, Bing, Yahoo), so that probably helps.
[19:56]
dd0a13f37I'm just wondering. How does it avoid getting rate limited by the backend search engines? [19:57]
***arkhive has joined #archiveteam-bs [20:03]
arkhiveI'm trying to save go90 Original videos. I'm having trouble figuring out how to save web videos. HTML5 or Flash. I'm a bit of a noob. Can someone help me?
go90 is a streaming service(free) from Verizon. they have dumped millions into it and are losing tons of money. laying off employees. So i think it's time to Ctrl-C Ctrl-V
[20:04]
dd0a13f37try youtube-dl
if that fails, open inspect element, network tab, play a video, and see if there's a pattern in the requests it make and if you can figure them out from the url
>Sorry!
>go90™ Mobile TV Network is only available in the US right now.
[20:08]
godaneyoutube-dl will not work with go90 [20:23]
***TheLovina has quit IRC (Read error: Operation timed out)
TheLovina has joined #archiveteam-bs
TheLovina has quit IRC (Read error: Connection reset by peer)
[20:36]
............. (idle for 1h2mn)
atrocity has quit IRC (Ping timeout: 246 seconds) [21:42]
ranmaarkhive: get a jetpack and sub to VZW for a month?
is there anything of value even on go90?
speaking as a recent employee of VZW
arkhive: maybe download NOX or Bluestacks (or virtualize Android if you can), install the app, sniff the traffic to see if they're calling https or http gets?
oh yeah, go90 isn't bandwidth-free on prepaid
[21:44]
......... (idle for 44mn)
***dd0a13f37 has quit IRC (Quit: Connection closed for inactivity) [22:32]
.... (idle for 19mn)
drumstick has joined #archiveteam-bs
Asparagir has joined #archiveteam-bs
[22:51]
BlueMaxim has joined #archiveteam-bs [23:00]
atrocity has joined #archiveteam-bs [23:11]
..... (idle for 24mn)
jschwart has quit IRC (Quit: Konversation terminated!) [23:35]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)