[00:45] *** BlueMax has joined #archiveteam-ot [02:26] *** godane has joined #archiveteam-ot [02:26] *** svchfoo1 sets mode: +o godane [03:44] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:56] *** odemg has joined #archiveteam-ot [04:09] *** DragonMon has joined #archiveteam-ot [04:09] hi [04:10] I hope this is a good place to ask. If I wanted to set up my own archive where should I start? [04:10] Basically I want a system that's like a bookmark but more comprehensive in the event the content is removed or the internet fails [04:10] DragonMon: I would start with a Debian running XFS and tools to download everything you want [04:10] I hope you have money for hard drives and stuff [04:12] you can go a long way with just slightly-organized filesystems [04:12] for your backup copy of your archives try something like rsync -a --delete [04:12] I hope this is slightly in the territory of what you want [04:14] ivan: right but what 'tools'? [04:14] if you do not want to buy drives and maintain your storage, you could upload your archived stuff to IA, assuming it's stuff they would want (e.g. because it's not going to be on the web soon) [04:14] for archiving websites try my grab-site or something else from https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem [04:15] for video/audio probably youtube-dl [04:15] for "linux isos" probably qbittorrent or rtorrent [04:16] ivan: do you think it would be horrible to run any of this on a Raspberry Pi 3B+? [04:16] yes, get a real computer [04:17] try to maximize built-in SATA ports on the motherboard and room for 3.5" drives [04:17] (assuming that's where you're storing your data) [04:18] I mean, unless you think your archive is going to max out at 8TB or something [04:19] well right now I have around 8.7Tb of external drives connected to a Pi 3B+ I also have a few laptops but only the Pi is doing anything server like [04:20] if you like playing with fire i.e. the cloud Google lets you upload 3750GB/day for $50/mo of gsuite, but who knows how long that will last [04:20] I did get Nextcloud, Syncthing, and Wallabag working just fine [04:20] if your threat model is "Internet fails" then that probably won't be satisfying except as a backup or temporary holding area [04:23] external drives can be iffy when they do transparent encryption that you can't disable (you won't be able to recover the data if the USB controller fails) [04:24] they're also very unwieldy to power-supply after you have a bunch of them [04:24] it makes sense to use internal 3.5" drives, it sometimes even makes sense to throw away the warranty on external drives by shucking them [04:26] *** Arrhenius has joined #archiveteam-ot [04:26] *** Arrhenius has left [04:28] ivan: I don't know how things will go, but I don't think I'd have a terribly huge archive [04:29] I wish you much luck stopping before you become a petabyte-scale hoarder [04:31] ivan: not a good sign that I'm in #DataHorder on freenode then :p [04:33] When I bookmark something I'd like to make a one page copy of what I bookmarked. I suspect most of that can be directly processed through Wallabag but for complex pages or those one or two sites I want to go and make a bigger archive [04:34] I saw grab-site, but does that include some way to view a page directly? [04:34] near the bottom of the README there are some instructions for viewing the WARCs [04:35] another simple single-page archiving solution is starting google-chrome with --save-page-as-mhtml and using ctrl-s to save .mhtml files (they include all the DOM after JavaScript execution) [04:36] another thing is to add a bookmarklet that does hits https://web.archive.org/save/URL [04:36] another is pinboard does this as a service. you could write a tool that archives a URL to multiple places. there's a page on gwern.net that describes such a setup (not sure if it still involves crazy Haskell code) [04:38] if you really want a bookmark-only workflow, you could query your browser's bookmarks frequently and feed them into such an archiver [04:42] https://webrecorder.io/ may also be useful [04:48] ivan: I already tried that, might be useful for sites that don't archive well using other means [04:51] *** MrRadar2 has quit IRC (se.hub irc.efnet.nl) [04:51] *** Kaz has quit IRC (se.hub irc.efnet.nl) [04:51] *** hook54321 has quit IRC (se.hub irc.efnet.nl) [04:51] *** SketchCow has quit IRC (se.hub irc.efnet.nl) [04:51] *** BnAboyZ has quit IRC (se.hub irc.efnet.nl) [04:51] *** Tenebrae has quit IRC (se.hub irc.efnet.nl) [04:51] *** Sue has quit IRC (se.hub irc.efnet.nl) [04:51] *** BnARobin has quit IRC (se.hub irc.efnet.nl) [04:51] unless I'm missing something that would make it more automatic [04:53] https://github.com/PromyLOPh/crocoite can be automated [04:53] have you seen ArchiveBot and #archivebot? [04:55] *** MrRadar2 has joined #archiveteam-ot [04:55] *** Kaz has joined #archiveteam-ot [04:55] *** hook54321 has joined #archiveteam-ot [04:55] *** SketchCow has joined #archiveteam-ot [04:55] *** irc.efnet.nl sets mode: +oooo MrRadar2 Kaz hook54321 SketchCow [04:55] *** BnAboyZ has joined #archiveteam-ot [04:55] *** Tenebrae has joined #archiveteam-ot [04:55] *** Sue has joined #archiveteam-ot [04:55] *** BnARobin has joined #archiveteam-ot [04:55] *** svchfoo3 sets mode: +o MrRadar2 [05:01] ivan: I just stumbled here, I don't know too much [05:01] first thing that popped up when I searched for "irc internet archive" [05:02] well, you can use it to archive websites and have the WARCs end up at IA [05:02] I was originally looking for a chatroom with discussions for archive.org [05:02] there's #internetarchive but that's for moaning about API issues and such, no IA people there [05:03] lol [05:05] I don't mind IA picking up the same stuff I'd like to personally keep a copy of, I still want a personal archive though [05:30] ivan: thank you for pointing out some solutions. I guess I'll try messing around and seeing what works for me [05:31] DragonMon: You could occasionally generate a list of URLs in your bookmarks, and then put them through archivebot, which does end up being uploaded to archive.org, but you can download the WARC file from archive.org after it's uploaded [05:31] *** hook54321 sets mode: +o DrasticAc [05:32] *** hook54321 sets mode: +o ivan [05:33] not including stuff like links to go to gmail, facebook, etc [05:34] hook54321: could I do that headless? [05:34] What do you mean? [05:38] hook54321: I mean, can I set up my server to feed links to archivebot? [05:38] you can upload a list of URLs somewhere and tell ArchiveBot !ao < url-to-url-list [05:39] I might be able to parse through links synced to nextcloud [05:40] Upside to this is that it would be available on the wayback machine (potentially a downside in some situations), downside would be that it probably wouldn't be immediately when you bookmark it, but if something is urgent then you could always queue it manually. [05:41] It would probably be every few weeks or every month or something like that I would guess. [05:42] ivan: through irc only though? [05:42] correct [05:42] in #archivebot [05:43] I was reading that I'd have to get permission. Whats the process for getting access? [05:44] !ao (and I think !ao <) don't require being voiced, but !a does. [05:44] Asking someone to voice you [05:47] hook54321: would it be acceptable to have my server send commands every time I save a bookmark? [05:48] Idk, it would depend I guess. [05:48] I don't run the server stuff [05:49] I'm lazy, the more things I can automate in a reasonably secure manor the better for me lol [05:50] If you automate it like that you'll end up having to track down many WARC files, which might create more work. [05:50] Unless you found a way to automate that [05:52] I'm still learning all of this [05:52] in fact I have to research more into WARC files [05:53] my current understanding is: WARC files are like archives (zip, tar, etc.) that contain structured file systems that hold a website copy [05:54] You could also generate an initial bookmarks list, run that through archivebot, and then generate another bookmarks list every once and awhile, remove the duplicates that have already been grabbed, and then run it through archivebot again. [05:57] If you host the text files on your own domain then it would be easier to track down all the WARC files for them on the viewer. http://archive.fart.website/archivebot/viewer/ [05:57] that URL is amazing. [05:59] There's also http://dashboard.at.ninjawedding.org/ [06:00] Not as new of a TLD though [06:18] god you could really go down a rabbit hole of archiving everything [06:19] *** Despatche has quit IRC (Read error: Connection reset by peer) [06:24] hook54321: everything is going to IA? [06:24] how does that happen exactly? [06:45] one sec [06:47] DragonMon: https://www.archiveteam.org/index.php/ArchiveBot#Components [06:47] Part where that happens specifically is under the staging server section [06:48] https://archive.org/details/archivebot [06:48] so the team here got permission? Or can anyone upload their own self-hosted archive [06:48] ah [06:48] hm [06:49] Got permission. There are a couple of employees here. (at least 2 that I know of) [06:51] Technically anyone can upload WARC files, but they might or might not be accepted into the wayback machine. [06:52] hook54321: but this project is trusted so it's more likely to make it into the official archive? [06:55] DragonMon: I guess that's a way to put it, also if they accepted random people's WARC files into the wayback machine then there's a good chance of someone trying to modify stuff before uploading it. [06:56] exactly what I was thinking [06:56] how do they verify it? [06:58] As far as I know, there's not a way to verify that it hasn't been tampered with. I think JAA looked into whether something like that would be possible awhile ago. [06:59] I mean, there are things that could be an indicator of it potentially being tampered with I guess. [07:03] hook54321: hmm, scary. I have a domain but it's just for personal nextcloud crap. But I'd hate to find out someone made it look like I once hosted a porno site or something [07:09] *** Despatche has joined #archiveteam-ot [10:08] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:36] Yeah, there is no way to verify that WARCs haven't been modified. [11:36] I looked a bit into whether it's possible to store raw TLS data to create verifiable archives (at least for the payload part), but I didn't get far there. [11:39] Specifically: TLS is symmetric encryption at its core, with the key negotiated through public-key cryptography and related algorithms (DHE, ECDHE, etc.). So a recording of the handshake could be verifiable, but the payload could just be anything since the client could simply encrypt their malicious data using the symmetric key, I believe. [11:39] I might be wrong though. [11:41] On a higher level, you can sort-of verify that a WARC hasn't been tampered with by comparing the payload to an independent, known good capture. But obviously, that's not always realistic, and it also doesn't detect all modifications. [11:42] DragonMon, hook54321: ^ [12:02] *** jeekl has joined #archiveteam-ot [12:05] JAA: people are dumb, I might know I never hosted some content but if someone was brazen enough to try and succeed I wouldn't know how to prove otherwise [12:06] I mean succeed in uploading a modified archive [12:07] JAA: do you have any idea what someone could do to prove a archive wasn't proper? [12:09] DragonMon: Pretty much nothing if the "attacker" is competent, I think. No matter what you do, it would always be "he says, she says" at that point. [12:10] JAA: For example, a agency comes to me saying: "archive.org shows your website had illegal content between X and Y dates. Records show you had ownership of that domain between those dates" If the agency trusted the archive over my word, there isn't much I could do? [12:11] The more and more I mess around with code, software, and tech the more I see how one could pin all sorts of crap on someone else. [12:12] It may be possible to fight that legally. No idea how well the archive would hold up in court. [12:12] I don't have the money to fight it or pay for any possible fines [12:13] I know that snapshots from IA have been used in lawsuits in the past, but I believe those were snapshots retrieved by IA itself (either through their crawlers or through the "save now" thingy). [12:13] I would have to hope any pro bono lawyer I get is competent otherwise I'm screwed [12:14] That's probably correct. [12:14] JAA: god I hope IA keeps good track of what archive came from where [12:14] I'm sure that IA would help you with something like this though. [12:14] Yes, they do. [12:14] You can see some information as a normal user, even. [12:14] It's a scary thought no matter how it goes down. [12:15] I don't want to spend the time fighting it. I don't have the money to travel to a out of city court if I had to. [12:18] JAA: "The Archive does not endorse or sponsor any content in the Collections, nor does it guarantee or warrant that the content available in the Collections is accurate, complete, noninfringing, or legally accessible in your jurisdiction" I just caught this in the TOS, this calms my fears a bit [12:19] I'd probably still have to find a lawyer but that line *should* help out big time [12:26] DragonMon: https://www.techdirt.com/articles/20160518/08175934474/federal-judge-says-internet-archives-wayback-machine-perfectly-legitimate-source-evidence.shtml [12:26] JAA: damn [12:27] But I wouldn't worry about it for two reasons: it's very unlikely that anything like this will ever happen to you, and I'm sure that IA would help you in case it does happen (e.g. provide testimony that the relevant archives were uploaded by a third party or similar). [12:32] JAA could ruin a job though. Do you think IA would help me get a job back if I was fired over content in a bad archive of my site or social media? [12:35] well I suppose that would lead to a lawsuit and a new job elsewhere. [12:36] but jeez [13:27] few third party warcs make their way to the wayback [13:47] *** wp494 has quit IRC (Read error: Operation timed out) [13:47] *** wp494 has joined #archiveteam-ot [13:48] *** svchfoo3 sets mode: +o wp494 [15:22] *** medowar has quit IRC (Ping timeout: 252 seconds) [15:26] *** DragonMon has quit IRC (Read error: Operation timed out) [15:28] *** medowar has joined #archiveteam-ot [18:49] *** schbirid has joined #archiveteam-ot [19:32] *** SketchCow has quit IRC (Read error: Connection reset by peer) [20:07] *** xmc is now known as astrid [20:24] *** ola_norsk has joined #archiveteam-ot [20:33] *** ola_norsk has quit IRC (leaving) [20:55] *** schbirid has quit IRC (Quit: Leaving) [21:39] *** SketchCow has joined #archiveteam-ot [21:40] *** svchfoo1 sets mode: +o SketchCow [21:56] *** Gfy has quit IRC (se.hub efnet.portlane.se) [21:56] *** svchfoo1 has quit IRC (se.hub efnet.portlane.se) [21:56] *** dxrt_ has quit IRC (se.hub efnet.portlane.se) [23:17] *** BlueMax has joined #archiveteam-ot