#archiveteam-ot 2018-05-16,Wed

↑back Search

Time Nickname Message
00:45 🔗 BlueMax has joined #archiveteam-ot
02:26 🔗 godane has joined #archiveteam-ot
02:26 🔗 svchfoo1 sets mode: +o godane
03:44 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
03:56 🔗 odemg has joined #archiveteam-ot
04:09 🔗 DragonMon has joined #archiveteam-ot
04:09 🔗 DragonMon hi
04:10 🔗 DragonMon I hope this is a good place to ask. If I wanted to set up my own archive where should I start?
04:10 🔗 DragonMon Basically I want a system that's like a bookmark but more comprehensive in the event the content is removed or the internet fails
04:10 🔗 ivan DragonMon: I would start with a Debian running XFS and tools to download everything you want
04:10 🔗 ivan I hope you have money for hard drives and stuff
04:12 🔗 ivan you can go a long way with just slightly-organized filesystems
04:12 🔗 ivan for your backup copy of your archives try something like rsync -a --delete
04:12 🔗 ivan I hope this is slightly in the territory of what you want
04:14 🔗 DragonMon ivan: right but what 'tools'?
04:14 🔗 ivan if you do not want to buy drives and maintain your storage, you could upload your archived stuff to IA, assuming it's stuff they would want (e.g. because it's not going to be on the web soon)
04:14 🔗 ivan for archiving websites try my grab-site or something else from https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
04:15 🔗 ivan for video/audio probably youtube-dl
04:15 🔗 ivan for "linux isos" probably qbittorrent or rtorrent
04:16 🔗 DragonMon ivan: do you think it would be horrible to run any of this on a Raspberry Pi 3B+?
04:16 🔗 ivan yes, get a real computer
04:17 🔗 ivan try to maximize built-in SATA ports on the motherboard and room for 3.5" drives
04:17 🔗 ivan (assuming that's where you're storing your data)
04:18 🔗 ivan I mean, unless you think your archive is going to max out at 8TB or something
04:19 🔗 DragonMon well right now I have around 8.7Tb of external drives connected to a Pi 3B+ I also have a few laptops but only the Pi is doing anything server like
04:20 🔗 ivan if you like playing with fire i.e. the cloud Google lets you upload 3750GB/day for $50/mo of gsuite, but who knows how long that will last
04:20 🔗 DragonMon I did get Nextcloud, Syncthing, and Wallabag working just fine
04:20 🔗 ivan if your threat model is "Internet fails" then that probably won't be satisfying except as a backup or temporary holding area
04:23 🔗 ivan external drives can be iffy when they do transparent encryption that you can't disable (you won't be able to recover the data if the USB controller fails)
04:24 🔗 ivan they're also very unwieldy to power-supply after you have a bunch of them
04:24 🔗 ivan it makes sense to use internal 3.5" drives, it sometimes even makes sense to throw away the warranty on external drives by shucking them
04:26 🔗 Arrhenius has joined #archiveteam-ot
04:26 🔗 Arrhenius has left
04:28 🔗 DragonMon ivan: I don't know how things will go, but I don't think I'd have a terribly huge archive
04:29 🔗 ivan I wish you much luck stopping before you become a petabyte-scale hoarder
04:31 🔗 DragonMon ivan: not a good sign that I'm in #DataHorder on freenode then :p
04:33 🔗 DragonMon When I bookmark something I'd like to make a one page copy of what I bookmarked. I suspect most of that can be directly processed through Wallabag but for complex pages or those one or two sites I want to go and make a bigger archive
04:34 🔗 DragonMon I saw grab-site, but does that include some way to view a page directly?
04:34 🔗 ivan near the bottom of the README there are some instructions for viewing the WARCs
04:35 🔗 ivan another simple single-page archiving solution is starting google-chrome with --save-page-as-mhtml and using ctrl-s to save .mhtml files (they include all the DOM after JavaScript execution)
04:36 🔗 ivan another thing is to add a bookmarklet that does hits https://web.archive.org/save/URL
04:36 🔗 ivan another is pinboard does this as a service. you could write a tool that archives a URL to multiple places. there's a page on gwern.net that describes such a setup (not sure if it still involves crazy Haskell code)
04:38 🔗 ivan if you really want a bookmark-only workflow, you could query your browser's bookmarks frequently and feed them into such an archiver
04:42 🔗 ivan https://webrecorder.io/ may also be useful
04:48 🔗 DragonMon ivan: I already tried that, might be useful for sites that don't archive well using other means
04:51 🔗 MrRadar2 has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 Kaz has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 hook54321 has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 SketchCow has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 BnAboyZ has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 Tenebrae has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 Sue has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 BnARobin has quit IRC (se.hub irc.efnet.nl)
04:51 🔗 DragonMon unless I'm missing something that would make it more automatic
04:53 🔗 ivan https://github.com/PromyLOPh/crocoite can be automated
04:53 🔗 ivan have you seen ArchiveBot and #archivebot?
04:55 🔗 MrRadar2 has joined #archiveteam-ot
04:55 🔗 Kaz has joined #archiveteam-ot
04:55 🔗 hook54321 has joined #archiveteam-ot
04:55 🔗 SketchCow has joined #archiveteam-ot
04:55 🔗 irc.efnet.nl sets mode: +oooo MrRadar2 Kaz hook54321 SketchCow
04:55 🔗 BnAboyZ has joined #archiveteam-ot
04:55 🔗 Tenebrae has joined #archiveteam-ot
04:55 🔗 Sue has joined #archiveteam-ot
04:55 🔗 BnARobin has joined #archiveteam-ot
04:55 🔗 svchfoo3 sets mode: +o MrRadar2
05:01 🔗 DragonMon ivan: I just stumbled here, I don't know too much
05:01 🔗 DragonMon first thing that popped up when I searched for "irc internet archive"
05:02 🔗 ivan well, you can use it to archive websites and have the WARCs end up at IA
05:02 🔗 DragonMon I was originally looking for a chatroom with discussions for archive.org
05:02 🔗 ivan there's #internetarchive but that's for moaning about API issues and such, no IA people there
05:03 🔗 DragonMon lol
05:05 🔗 DragonMon I don't mind IA picking up the same stuff I'd like to personally keep a copy of, I still want a personal archive though
05:30 🔗 DragonMon ivan: thank you for pointing out some solutions. I guess I'll try messing around and seeing what works for me
05:31 🔗 hook54321 DragonMon: You could occasionally generate a list of URLs in your bookmarks, and then put them through archivebot, which does end up being uploaded to archive.org, but you can download the WARC file from archive.org after it's uploaded
05:31 🔗 hook54321 sets mode: +o DrasticAc
05:32 🔗 hook54321 sets mode: +o ivan
05:33 🔗 hook54321 not including stuff like links to go to gmail, facebook, etc
05:34 🔗 DragonMon hook54321: could I do that headless?
05:34 🔗 hook54321 What do you mean?
05:38 🔗 DragonMon hook54321: I mean, can I set up my server to feed links to archivebot?
05:38 🔗 ivan you can upload a list of URLs somewhere and tell ArchiveBot !ao < url-to-url-list
05:39 🔗 DragonMon I might be able to parse through links synced to nextcloud
05:40 🔗 hook54321 Upside to this is that it would be available on the wayback machine (potentially a downside in some situations), downside would be that it probably wouldn't be immediately when you bookmark it, but if something is urgent then you could always queue it manually.
05:41 🔗 hook54321 It would probably be every few weeks or every month or something like that I would guess.
05:42 🔗 DragonMon ivan: through irc only though?
05:42 🔗 hook54321 correct
05:42 🔗 hook54321 in #archivebot
05:43 🔗 DragonMon I was reading that I'd have to get permission. Whats the process for getting access?
05:44 🔗 hook54321 !ao (and I think !ao <) don't require being voiced, but !a does.
05:44 🔗 hook54321 Asking someone to voice you
05:47 🔗 DragonMon hook54321: would it be acceptable to have my server send commands every time I save a bookmark?
05:48 🔗 hook54321 Idk, it would depend I guess.
05:48 🔗 hook54321 I don't run the server stuff
05:49 🔗 DragonMon I'm lazy, the more things I can automate in a reasonably secure manor the better for me lol
05:50 🔗 hook54321 If you automate it like that you'll end up having to track down many WARC files, which might create more work.
05:50 🔗 hook54321 Unless you found a way to automate that
05:52 🔗 DragonMon I'm still learning all of this
05:52 🔗 DragonMon in fact I have to research more into WARC files
05:53 🔗 DragonMon my current understanding is: WARC files are like archives (zip, tar, etc.) that contain structured file systems that hold a website copy
05:54 🔗 hook54321 You could also generate an initial bookmarks list, run that through archivebot, and then generate another bookmarks list every once and awhile, remove the duplicates that have already been grabbed, and then run it through archivebot again.
05:57 🔗 hook54321 If you host the text files on your own domain then it would be easier to track down all the WARC files for them on the viewer. http://archive.fart.website/archivebot/viewer/
05:57 🔗 BlueMax that URL is amazing.
05:59 🔗 hook54321 There's also http://dashboard.at.ninjawedding.org/
06:00 🔗 hook54321 Not as new of a TLD though
06:18 🔗 DragonMon god you could really go down a rabbit hole of archiving everything
06:19 🔗 Despatche has quit IRC (Read error: Connection reset by peer)
06:24 🔗 DragonMon hook54321: everything is going to IA?
06:24 🔗 DragonMon how does that happen exactly?
06:45 🔗 hook54321 one sec
06:47 🔗 hook54321 DragonMon: https://www.archiveteam.org/index.php/ArchiveBot#Components
06:47 🔗 hook54321 Part where that happens specifically is under the staging server section
06:48 🔗 hook54321 https://archive.org/details/archivebot
06:48 🔗 DragonMon so the team here got permission? Or can anyone upload their own self-hosted archive
06:48 🔗 DragonMon ah
06:48 🔗 DragonMon hm
06:49 🔗 hook54321 Got permission. There are a couple of employees here. (at least 2 that I know of)
06:51 🔗 hook54321 Technically anyone can upload WARC files, but they might or might not be accepted into the wayback machine.
06:52 🔗 DragonMon hook54321: but this project is trusted so it's more likely to make it into the official archive?
06:55 🔗 hook54321 DragonMon: I guess that's a way to put it, also if they accepted random people's WARC files into the wayback machine then there's a good chance of someone trying to modify stuff before uploading it.
06:56 🔗 DragonMon exactly what I was thinking
06:56 🔗 DragonMon how do they verify it?
06:58 🔗 hook54321 As far as I know, there's not a way to verify that it hasn't been tampered with. I think JAA looked into whether something like that would be possible awhile ago.
06:59 🔗 hook54321 I mean, there are things that could be an indicator of it potentially being tampered with I guess.
07:03 🔗 DragonMon hook54321: hmm, scary. I have a domain but it's just for personal nextcloud crap. But I'd hate to find out someone made it look like I once hosted a porno site or something
07:09 🔗 Despatche has joined #archiveteam-ot
10:08 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:36 🔗 JAA Yeah, there is no way to verify that WARCs haven't been modified.
11:36 🔗 JAA I looked a bit into whether it's possible to store raw TLS data to create verifiable archives (at least for the payload part), but I didn't get far there.
11:39 🔗 JAA Specifically: TLS is symmetric encryption at its core, with the key negotiated through public-key cryptography and related algorithms (DHE, ECDHE, etc.). So a recording of the handshake could be verifiable, but the payload could just be anything since the client could simply encrypt their malicious data using the symmetric key, I believe.
11:39 🔗 JAA I might be wrong though.
11:41 🔗 JAA On a higher level, you can sort-of verify that a WARC hasn't been tampered with by comparing the payload to an independent, known good capture. But obviously, that's not always realistic, and it also doesn't detect all modifications.
11:42 🔗 JAA DragonMon, hook54321: ^
12:02 🔗 jeekl has joined #archiveteam-ot
12:05 🔗 DragonMon JAA: people are dumb, I might know I never hosted some content but if someone was brazen enough to try and succeed I wouldn't know how to prove otherwise
12:06 🔗 DragonMon I mean succeed in uploading a modified archive
12:07 🔗 DragonMon JAA: do you have any idea what someone could do to prove a archive wasn't proper?
12:09 🔗 JAA DragonMon: Pretty much nothing if the "attacker" is competent, I think. No matter what you do, it would always be "he says, she says" at that point.
12:10 🔗 DragonMon JAA: For example, a agency comes to me saying: "archive.org shows your website had illegal content between X and Y dates. Records show you had ownership of that domain between those dates" If the agency trusted the archive over my word, there isn't much I could do?
12:11 🔗 DragonMon The more and more I mess around with code, software, and tech the more I see how one could pin all sorts of crap on someone else.
12:12 🔗 JAA It may be possible to fight that legally. No idea how well the archive would hold up in court.
12:12 🔗 DragonMon I don't have the money to fight it or pay for any possible fines
12:13 🔗 JAA I know that snapshots from IA have been used in lawsuits in the past, but I believe those were snapshots retrieved by IA itself (either through their crawlers or through the "save now" thingy).
12:13 🔗 DragonMon I would have to hope any pro bono lawyer I get is competent otherwise I'm screwed
12:14 🔗 JAA That's probably correct.
12:14 🔗 DragonMon JAA: god I hope IA keeps good track of what archive came from where
12:14 🔗 JAA I'm sure that IA would help you with something like this though.
12:14 🔗 JAA Yes, they do.
12:14 🔗 JAA You can see some information as a normal user, even.
12:14 🔗 DragonMon It's a scary thought no matter how it goes down.
12:15 🔗 DragonMon I don't want to spend the time fighting it. I don't have the money to travel to a out of city court if I had to.
12:18 🔗 DragonMon JAA: "The Archive does not endorse or sponsor any content in the Collections, nor does it guarantee or warrant that the content available in the Collections is accurate, complete, noninfringing, or legally accessible in your jurisdiction" I just caught this in the TOS, this calms my fears a bit
12:19 🔗 DragonMon I'd probably still have to find a lawyer but that line *should* help out big time
12:26 🔗 JAA DragonMon: https://www.techdirt.com/articles/20160518/08175934474/federal-judge-says-internet-archives-wayback-machine-perfectly-legitimate-source-evidence.shtml
12:26 🔗 DragonMon JAA: damn
12:27 🔗 JAA But I wouldn't worry about it for two reasons: it's very unlikely that anything like this will ever happen to you, and I'm sure that IA would help you in case it does happen (e.g. provide testimony that the relevant archives were uploaded by a third party or similar).
12:32 🔗 DragonMon JAA could ruin a job though. Do you think IA would help me get a job back if I was fired over content in a bad archive of my site or social media?
12:35 🔗 DragonMon well I suppose that would lead to a lawsuit and a new job elsewhere.
12:36 🔗 DragonMon but jeez
13:27 🔗 Meroje few third party warcs make their way to the wayback
13:47 🔗 wp494 has quit IRC (Read error: Operation timed out)
13:47 🔗 wp494 has joined #archiveteam-ot
13:48 🔗 svchfoo3 sets mode: +o wp494
15:22 🔗 medowar has quit IRC (Ping timeout: 252 seconds)
15:26 🔗 DragonMon has quit IRC (Read error: Operation timed out)
15:28 🔗 medowar has joined #archiveteam-ot
18:49 🔗 schbirid has joined #archiveteam-ot
19:32 🔗 SketchCow has quit IRC (Read error: Connection reset by peer)
20:07 🔗 xmc is now known as astrid
20:24 🔗 ola_norsk has joined #archiveteam-ot
20:33 🔗 ola_norsk has quit IRC (leaving)
20:55 🔗 schbirid has quit IRC (Quit: Leaving)
21:39 🔗 SketchCow has joined #archiveteam-ot
21:40 🔗 svchfoo1 sets mode: +o SketchCow
21:56 🔗 Gfy has quit IRC (se.hub efnet.portlane.se)
21:56 🔗 svchfoo1 has quit IRC (se.hub efnet.portlane.se)
21:56 🔗 dxrt_ has quit IRC (se.hub efnet.portlane.se)
23:17 🔗 BlueMax has joined #archiveteam-ot

irclogger-viewer