#archiveteam-bs 2019-11-12,Tue

↑back Search

Time Nickname Message
00:36 🔗 JAA Average FBO response time is now close to 30 seconds. Sigh...
00:36 🔗 JAA At least the FTP data is safe.
00:37 🔗 JAA https://archive.org/details/ftp.fbo.gov_20191111
00:42 🔗 JAA The first pass should be done in 4 hours or so, then there'll be another round that will probably take another 15-ish hours. And then obviously the actual entries. I hope they don't shut down too early morning on the 12th.
00:43 🔗 JAA Though I'm not actually sure when fbo.gov will really be shut down. The notice at the top isn't very clear.
00:51 🔗 robogoat has quit IRC (Ping timeout: 258 seconds)
01:01 🔗 JAA My SuperiorPics forums grab is done. Now I just need to deal with the ~26 million images and outlinks it found. (Cc ibachandl)
01:03 🔗 JAA (That includes lots of duplicates most likely. Proper numbers soon.)
01:03 🔗 ibachandl nice!
01:03 🔗 ibachandl was it with archivebot or a warrior?
01:04 🔗 ibachandl or something else
01:04 🔗 JAA Neither. I use qwarc for many of my independent archivals nowadays.
01:05 🔗 JAA If there are no rate limits and the site can handle it, that lets me easily do 10k to 20k requests per minute with a single CPU core.
01:06 🔗 JAA I love it when I'm limited by disk or network I/O. :-)
01:09 🔗 robogoat has joined #archiveteam-bs
01:10 🔗 JAA Welp, the FBO pagination just died.
01:11 🔗 JAA That's a shame. No easy way to resume it either because of how shitty that site is.
01:12 🔗 JAA Looks like their search also just broke.
01:15 🔗 JAA "Your search resulted in an error. Please try again or modify your search criteria"
01:16 🔗 JAA I did discover about 2705269 entries though, which should be about 86 %.
01:17 🔗 JAA s/about //
01:18 🔗 JAA I wonder how much will break when I start retrieving those...
01:22 🔗 JAA That lovely site also has permalinks that aren't permanent.
01:23 🔗 JAA Ah no, it's just more madness: https://www.fbo.gov/spg/DON/NAVSUP/N000104/N0010419RK167/listing.html -> "The requested url/solicitation number is found in multiple base notices." + the same search error as above
01:38 🔗 Ivy has joined #archiveteam-bs
02:06 🔗 HP_Archiv has joined #archiveteam-bs
02:06 🔗 HP_Archiv Hey guys. I forgot which one of you was trying to help me the other night for archiving HP-Games.net and then associated/linked out files in a GDrive account. Is that person here?
02:12 🔗 JAA So I'm on track to grab those 2.7M entries on FBO in about 2 days now. Which is too slow, but I can't go much faster as the response time is already elevated. Shitty government sites be shitty.
02:26 🔗 HP_Archiv @JAA, do you think you can help with a workaround for what I'm trying to do?
02:27 🔗 JAA HP_Archiv: Sorry, no experience with Google Drive downloads. There should be tools for that out there though.
02:31 🔗 HP_Archiv @JAA, no worries. I forget the handle of the person that was helping me with this the other night. He was going to see if a script might pull them down and re-upload into archivebot. Not sure how to do it though, or where to look for said tools
03:08 🔗 manjaro-u has quit IRC (Read error: Operation timed out)
03:25 🔗 markedL Looks like you were talking with betamax and Igloo
03:35 🔗 HP_Archiv @markedL thank you, couldn't remember their names..
03:36 🔗 HP_Archiv @betamax and Igloo, either you think you can assist further with archiving HP-Games.net ?
04:12 🔗 HP_Archiv @markedL, by the way, how were you able to find previous chat history?
04:14 🔗 markedL I didn't note what kind of computer you're on, but most IRC clients will have an option to keep a log of your prior communications
04:19 🔗 Raccoon (and short of power outages, i bet more than a few can simply scroll up the last 5000, 20,000 lines.
04:20 🔗 astrid i have in fact closed my irc client at least once this year
04:29 🔗 synm0nger has joined #archiveteam-bs
04:30 🔗 odemgi has joined #archiveteam-bs
04:30 🔗 SynMonger has quit IRC (Ping timeout: 246 seconds)
04:34 🔗 odemgi_ has quit IRC (Read error: Operation timed out)
04:36 🔗 qw3rty has joined #archiveteam-bs
04:43 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:53 🔗 icedice2 has joined #archiveteam-bs
04:53 🔗 fredgido_ has joined #archiveteam-bs
04:53 🔗 Damme_ has joined #archiveteam-bs
04:54 🔗 yano_ has joined #archiveteam-bs
04:55 🔗 benjinsmi has joined #archiveteam-bs
04:56 🔗 TC01_ has joined #archiveteam-bs
04:56 🔗 af10b3e5e has joined #archiveteam-bs
04:56 🔗 girst_ has joined #archiveteam-bs
04:57 🔗 odemgi_ has joined #archiveteam-bs
04:57 🔗 Maylay_ has joined #archiveteam-bs
04:57 🔗 Maylay_ has quit IRC (Remote host closed the connection!)
04:57 🔗 thejsa_ has joined #archiveteam-bs
04:58 🔗 Maylay_ has joined #archiveteam-bs
04:58 🔗 Dark_Star has joined #archiveteam-bs
04:58 🔗 tuluu_ has joined #archiveteam-bs
05:00 🔗 chfoo_ has joined #archiveteam-bs
05:00 🔗 Fusl__ sets mode: +o chfoo_
05:00 🔗 Fusl sets mode: +o chfoo_
05:00 🔗 Fusl_ sets mode: +o chfoo_
05:00 🔗 omglolba- has joined #archiveteam-bs
05:05 🔗 ibachandl has quit IRC (Quit: Page closed)
05:12 🔗 odemgi has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 tuluu has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 icedice has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 omglolbah has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 Damme has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 halt_ has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 d5f4a3622 has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 dashcloud has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 benjins has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 girst has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 thejsa has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 fredgido has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 ctrl has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 Maylay has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 nepeat has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 fuzzy8021 has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 chfoo has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 ndiddy has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 wp494 has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 zerkalo has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 TC01 has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 Dark-Star has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 yuitimoth has quit IRC (irc.efnet.nl efnet.deic.eu)
05:12 🔗 yano has quit IRC (irc.efnet.nl efnet.deic.eu)
05:15 🔗 zerkalo_ has joined #archiveteam-bs
05:18 🔗 IAmbience has quit IRC (Quit: Connection closed for inactivity)
05:28 🔗 yuitimoth has joined #archiveteam-bs
05:29 🔗 fuzzy8021 has joined #archiveteam-bs
05:31 🔗 HP_Archiv Odd, unless I'm missing something, I can't scroll up
05:32 🔗 HP_Archiv I've left the IRC and come back a few times since then. So maybe it's only current-session only?
05:33 🔗 astrid yes probably. depending on your client. some load history from before; some do not.
05:35 🔗 HP_Archiv Hm, okay. No matter in this chat or in #archivebot, I can't go past my actual login point to view history.
05:35 🔗 markedL What client are you using now, and what type of computer are you on? People here will set you up with something
05:35 🔗 HP_Archiv I'm using Chrome, on a Windows 10 machine
05:38 🔗 jake_test Are you using the basic web client? That wouldn't store history at all.
05:38 🔗 ndiddy has joined #archiveteam-bs
05:39 🔗 HP_Archiv Yeah I am
05:39 🔗 HP_Archiv And I figured as much ^^
05:39 🔗 HP_Archiv I didn't know there were other ways to sign into the chat other than through the web client
05:42 🔗 HP_Archiv How else do I sign in?
05:42 🔗 jake_test You would have to grab a IRC client, there are so many for practically every operating system, the people here may have some better suggestions?
05:45 🔗 HP_Archiv I'm seeing HexChat as one option, also mIRC is another
05:45 🔗 HP_Archiv I didn't realize I'd have to use my own chat client, but it's fine. @jake_text, which one do you use?
05:48 🔗 HP_Archiv @jake_test*
05:48 🔗 jodizzle Can we take this to #archiveteam-ot?
05:50 🔗 HP_Archiv Done ^^ Thanks @jodizzle
05:56 🔗 HP_Archiv Also, I'd like to get privileges for proper site-wide archiving and ingestion into archivebot. I was told previously that voice and something else commands are required. How do I become authorized to do that myself?
06:21 🔗 ctrl has joined #archiveteam-bs
06:56 🔗 jodizzle It's usually up to someone with ops (an '@' next to their name, at least on my IRC client) in the #archivebot channel to decide whether to give you the necessary permissions.
06:56 🔗 nepeat has joined #archiveteam-bs
06:57 🔗 HP_Archiv @jodizzle, yeah, as I'm finding out ^^
06:57 🔗 jodizzle Which you usually get by hanging around and wanting to archive things.
07:41 🔗 Deewiant has quit IRC (Ping timeout: 186 seconds)
08:19 🔗 markedH has joined #archiveteam-bs
09:58 🔗 godane SketchCow: so i looked at Success Magazine and i maybe be able to get a back from july 2011 to now
10:03 🔗 BlueMax has quit IRC (Remote host closed the connection)
10:03 🔗 BlueMax has joined #archiveteam-bs
10:04 🔗 BlueMax has quit IRC (Remote host closed the connection)
10:05 🔗 BlueMax has joined #archiveteam-bs
10:19 🔗 tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.)
10:19 🔗 tuluu has joined #archiveteam-bs
10:46 🔗 icedice2 has quit IRC (Leaving)
11:35 🔗 Deewiant has joined #archiveteam-bs
11:38 🔗 BlueMax has quit IRC (Remote host closed the connection)
11:38 🔗 BlueMax has joined #archiveteam-bs
11:54 🔗 BlueMax has quit IRC (Remote host closed the connection)
11:54 🔗 BlueMax has joined #archiveteam-bs
11:55 🔗 BlueMax has quit IRC (Remote host closed the connection)
11:56 🔗 BlueMax has joined #archiveteam-bs
12:04 🔗 BlueMax has quit IRC (Remote host closed the connection)
12:05 🔗 BlueMax has joined #archiveteam-bs
12:09 🔗 BlueMax has quit IRC (Remote host closed the connection)
12:09 🔗 BlueMax has joined #archiveteam-bs
12:10 🔗 BlueMax has quit IRC (Remote host closed the connection)
12:11 🔗 BlueMax has joined #archiveteam-bs
12:47 🔗 mls_ has quit IRC (Remote host closed the connection)
13:06 🔗 mtntmnky_ has quit IRC (Remote host closed the connection)
13:06 🔗 mtntmnky_ has joined #archiveteam-bs
13:33 🔗 synm0nger has quit IRC (Quit: Wait, what?)
13:34 🔗 SynMonger has joined #archiveteam-bs
13:39 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
13:55 🔗 yano_ is now known as yano
13:59 🔗 Damme_ has quit IRC (Read error: Connection reset by peer)
13:59 🔗 Damme_ has joined #archiveteam-bs
14:08 🔗 synm0nger has joined #archiveteam-bs
14:08 🔗 SynMonger has quit IRC (Read error: Operation timed out)
14:08 🔗 HP_Archiv has quit IRC (Quit: Page closed)
14:51 🔗 phillipsj has quit IRC (Remote host closed the connection)
14:51 🔗 phillipsj has joined #archiveteam-bs
15:12 🔗 omglolba- has quit IRC (Read error: No route to host)
15:16 🔗 omglolbah has joined #archiveteam-bs
15:28 🔗 jc86035 has joined #archiveteam-bs
15:31 🔗 JAA FBO just shut down a few minutes ago.
15:33 🔗 JAA Looks like I was able to retrieve only about 390k entries of the over 3 million total/2.7 million discovered.
15:33 🔗 JAA And none of the downloads.
15:34 🔗 prq has joined #archiveteam-bs
15:39 🔗 JAA Also, I have a list of about 8.8 million images and 9.2 million outlinks (deduped) from the SuperiorPics forums. That's going to take a while...
15:41 🔗 akierig has joined #archiveteam-bs
15:41 🔗 akierig_ has joined #archiveteam-bs
15:42 🔗 markedH has quit IRC (Read error: Operation timed out)
15:42 🔗 markedH has joined #archiveteam-bs
15:46 🔗 akierig has quit IRC (Read error: Operation timed out)
16:04 🔗 prq I'm coming up to speed on the various archiveteam projects. Which piece of software are you using for a project that large? Is this being handled by the warrior distributed archive thing that people can run?
16:05 🔗 JAA prq: For those images and outlinks from SuperiorPics? I'll probably throw them into ArchiveBot since they're not really urgent, so it doesn't matter much if it takes a month or two.
16:07 🔗 prq http://dashboard.at.ninjawedding.org/3 - this is the dashboard for the irc archivebot you're referring to, right?
16:07 🔗 astrid yea
16:07 🔗 prq neat.
16:08 🔗 prq https://www.archiveteam.org/index.php?title=ArchiveBot this says that it'll eventually be injected to the wayback machine-- is that possible because the archive.org folks trust the archiveteam? I had been looking for a way to inject a warc to the wayback machine and it seems that any random joe isn't able to.
16:08 🔗 JAA Correct
16:09 🔗 prq cool cool.
16:11 🔗 prq my story is that I'm going through the process of leaving a high control religion, and I've come across tons of dead links on older resources-- wayback machine is helpful, but doesn't have everything of course. I'm here trying to learn about all the tools available for site archival. I may be up for running my own mini-archive for my special interest, but I'm one person with a homelab freenas server.
16:11 🔗 akierig has joined #archiveteam-bs
16:12 🔗 astrid oofh
16:13 🔗 jc86035 has quit IRC (Quit: Leaving.)
16:13 🔗 jc86035 has joined #archiveteam-bs
16:14 🔗 jc86035 has quit IRC (Client Quit)
16:14 🔗 prq (high control religion is a nice way to say cult-- those groups like to control information, which turns into editing stuff they publish online (1984 style)
16:16 🔗 astrid yea
16:16 🔗 astrid im aware of the concept, thankfully haven't gotten tangled up in any of that
16:16 🔗 astrid sounds like a hell of a thing
16:17 🔗 akierig_ has quit IRC (Read error: Operation timed out)
16:21 🔗 prq one interesting thing in this exittor community is there are lots of podcasts. Those tend to not be in the wayback machine.
16:22 🔗 prq listening to old episodes of ones that are still around, they'll promote some other podcast and it's just gone off the face of the internet. :/
16:22 🔗 prq this is happening more and more as I dig deeper, hence my interest in archiving.
16:26 🔗 prq it is looking more and more like I'll end up needing to do some coding to get podcast archival to be a thing. wouldn't take too much to make it happen.
16:29 🔗 astrid there's a lot of podcasts uploaded via https://archive.org/upload/ : https://archive.org/search.php?query=podcast
16:36 🔗 prq those seem to be original content producers who have opted to on purpose host their podcast via archive.org instead of doing a self-host or a paid podcast hoster.
16:36 🔗 prq I didn't see a way to drop a podcast rss file into the wayback machine to be indexed though.
16:36 🔗 prq I did do a little experiment, and I can request the individual rss file, and even the linked mp3 files be indexed by the wayback machine.
16:37 🔗 prq but I can't turn around and point a podcast app at the wayback copy. it takes a little more doing to rehost a podcast.
17:05 🔗 astrid hm yeah
17:05 🔗 astrid youd have to edit the urls in the rss file
17:10 🔗 prq I have managed to "rescue" one podcast. its audio was still in stitcher and its show notes was in player.fm and stitcher. I host it in my homelab for just me. I would like to make a tool for podcast authors to help them preserve their content way past the day they stop paying for libsyn. maybe a tool they can run that will grab all the episodes and put them in the archive.org free podcast hosting.
17:18 🔗 manjaro-u has joined #archiveteam-bs
17:46 🔗 hook54321 prq: depending on the specific group, you might get lots of pushback, and possibly legal threats.
17:47 🔗 prq right-- that's a major concern in all of this for sure.
17:47 🔗 prq the podcast I "rescued" I cannot rehost due to those concerns.
17:48 🔗 prq I have talked to a few podcasters about this, and it is a fairly common concern that they have. I'm hoping to get some tools and resources to make it much much easier for them to help preserve their content.
18:06 🔗 Damme_ has quit IRC (Read error: Connection reset by peer)
18:06 🔗 Damme_ has joined #archiveteam-bs
18:09 🔗 DogsRNice has joined #archiveteam-bs
18:23 🔗 zhongfu has quit IRC (Ping timeout: 745 seconds)
18:24 🔗 zhongfu has joined #archiveteam-bs
18:35 🔗 X-Scale has quit IRC (Ping timeout: 252 seconds)
18:35 🔗 X-Scale` has joined #archiveteam-bs
18:36 🔗 X-Scale` is now known as X-Scale
18:41 🔗 akierig has quit IRC (Remote host closed the connection)
18:58 🔗 katocala has quit IRC ()
19:00 🔗 katocala has joined #archiveteam-bs
19:16 🔗 katocala has quit IRC ()
19:21 🔗 katocala has joined #archiveteam-bs
19:25 🔗 prq hook54321▸ is there a good resource to be intelligent about archiving content with regard to archiving different sites like that? I imagine that's a pretty regular contern for archiveteam and archive.org.
19:34 🔗 JAA prq: There is no "one-size-fits-all" archival approach. Every site is different, and while you can usually get quite far just doing a recursive crawl of all links (which is what we do for example with ArchiveBot and what the Internet Archive does in their web-wide crawls), that won't necessarily result in a complete archive and might have all sorts of other issues. Basically, web archival is somewhere
19:34 🔗 JAA between computer science and art. There is no way really to learn how to do it; you do it, you learn from your inevitable mistakes, and eventually you get an intuition on how you need to proceed with a particular site.
19:35 🔗 prq sorry, I should have included more context in my question. I was building on a comment about content owners pushing back on archival efforts and legal concerns.
19:35 🔗 JAA Ah
19:36 🔗 prq and of course there's no one-size-fits-all on that either.
19:36 🔗 prq but I'd like to understand the issue better
19:36 🔗 JAA One good approach is "archive it, keep copies safe in private, make it public at some point in the future when the copyright owners no longer care".
19:37 🔗 prq that's definitely something I've considered for some content.
19:39 🔗 katocala has quit IRC ()
19:44 🔗 hook54321 prq: If it's content owned by, for example, the church of scientology and it's stuff they don't want public, I wouldn't re-publish it unless you want there to be a potential you'll have to deal with serious legal stuff.
19:44 🔗 hook54321 It depends on who owns it and what it is for the most part.
19:45 🔗 prq fortunately, it isn't the church of scientology. The organization I have in mind is fortunately a lot less aggressive in those regards. They do put copyright notices on stuff, but don't seem to request takedowns from the wayback machine.
19:47 🔗 hook54321 There's some that afaik for the most part don't care as long as the content was public in the first place.
19:47 🔗 prq they rely more on manipulation tactics of the followers (like gaslighting) rather than trying to take on the whole internet. it does seem they don't quite know what to do in the internet age (there are well documented cases of collecting books, burning them, and distributing edited ones in the back in the 19th century)
19:47 🔗 prq that tactic just won't work as well these days
19:47 🔗 prq and if I have to sit on a collection privately to hedge against it, I'm prepared to, but I don't think that needs to be the only thing I'd do.
19:51 🔗 prq The fact that I'm only interested in their publicly available stuff is helpful. there are people that do work with insiders to leak private information. I might archive *their* site too, but I'm not going to do any leaking myself.
19:53 🔗 katocala has joined #archiveteam-bs
20:09 🔗 jc86035 has joined #archiveteam-bs
20:11 🔗 akierig has joined #archiveteam-bs
20:12 🔗 jc86035 hi. I've been archiving stuff to the wayback machine on my own for a while and was wondering if archive team would want to host some of those things. currently some of it runs on the wikimedia servers and pings web.archive.org directly, the rest is on my laptop (also mostly pinging web.archive.org directly) and I run it at irregular intervals.
20:12 🔗 jc86035 some of them are fairly small scale (e.g. apple music chart playlists), some of them are quite a bit bigger
20:14 🔗 jc86035 most of it is stuff that's ephemeral (i.e. it changes daily/hourly and isn't archived by the host website), not stuff that's very likely to disappear soon, so I'm not sure if it would fit within the archive team's scope
20:16 🔗 markedL you work at wikimedia?
20:16 🔗 ShellyRol has quit IRC (Read error: Connection reset by peer)
20:17 🔗 ShellyRol has joined #archiveteam-bs
20:18 🔗 X-Scale` has joined #archiveteam-bs
20:19 🔗 X-Scale has quit IRC (Ping timeout: 252 seconds)
20:19 🔗 X-Scale` is now known as X-Scale
20:23 🔗 jc86035 has quit IRC (Quit: Leaving.)
20:23 🔗 jc86035 has joined #archiveteam-bs
20:24 🔗 jc86035 markedL: no, I host things on Wikimedia Toolforge, which is technically something that anyone can sign up for but it's supposed to be only used for Wikimedia-related stuff
20:24 🔗 jc86035 https://wikitech.wikimedia.org/, https://tools.wmflabs.org/
20:29 🔗 jc86035 has quit IRC (Quit: Leaving.)
20:31 🔗 X-Scale has quit IRC (Ping timeout: 252 seconds)
20:33 🔗 jc86035 has joined #archiveteam-bs
20:38 🔗 jc86035 has quit IRC (Client Quit)
20:38 🔗 jc86035 has joined #archiveteam-bs
20:40 🔗 jc86035 has quit IRC (Client Quit)
20:41 🔗 jc86035 has joined #archiveteam-bs
20:43 🔗 jc86035 has quit IRC (Client Quit)
20:43 🔗 jc86035 has joined #archiveteam-bs
20:44 🔗 mls_ has joined #archiveteam-bs
20:46 🔗 jc86035 has quit IRC (Client Quit)
20:46 🔗 jc86035 has joined #archiveteam-bs
20:48 🔗 jc86035 has quit IRC (Client Quit)
20:49 🔗 X-Scale has joined #archiveteam-bs
20:49 🔗 jc86035 has joined #archiveteam-bs
20:54 🔗 jc86035 has quit IRC (Client Quit)
20:54 🔗 jc86035 has joined #archiveteam-bs
20:56 🔗 akierig_ has joined #archiveteam-bs
20:57 🔗 akierig has quit IRC (Read error: Operation timed out)
20:58 🔗 jc86035 has quit IRC (Client Quit)
20:58 🔗 jc86035 has joined #archiveteam-bs
21:11 🔗 jc86035 did anyone respond to me in the last hour? I can't see because I kept disconnecting from the server and I don't know where to find the logs
21:12 🔗 prq nope
21:15 🔗 jc86035 anyway, I guess the main question is whether archiveteam does stuff like scheduled periodic archiving (e.g. once an hour/week/month etc)
21:16 🔗 jc86035 since this is primarily what I've been doing but I'm aware it might not really fall into the scope (since iirc the archivebot instructions sort of discouraged that sort of thing)
21:17 🔗 astrid we don't have any tooling around that; it keeps coming up so maybe we should. want to work on a project? :)
21:18 🔗 jc86035 I'd like to, though I'm constantly being distracted by other projects in completely unrelated areas though so I'm not going to guarantee anything
21:19 🔗 jc86035 (and also have irl commitments other than those, ofc)
21:30 🔗 jc86035 astrid: how would this sort of project work, exactly? would one just, like, make a github and start putting code into it?
21:31 🔗 astrid pretty much
21:31 🔗 jc86035 I've pretty much done archival alone excluding my ArchiveTeam Warrior usage so idk how this actually works
21:31 🔗 astrid github.com/archiveteam has a bunch of examples of stuff
21:32 🔗 jc86035 I imagine we wouldn't just directly use the warrior architecture? I suppose you could just distribute it but it might be overkill
21:33 🔗 astrid it really depends on what you're doing!
21:33 🔗 jc86035 For the once-an-hour stuff I just used a cron job and wget/xargs/url list so most of that stuff wasn't very complicated at all
21:33 🔗 astrid warrior is a great way to run the same archiving tool on a bunch of different machines and collect all the results
21:34 🔗 astrid you could run an archivebot pipeline and restrict it to your periodic jobs, and then set up some kind of cron job to feed that
21:34 🔗 jc86035 On the other hand stuff like musescore.com actually requires getting multiple url fragments out of the page source (there aren't any direct links), the scale is a lot smaller than some other websites (there's only about 500k–600k public scores) so I'd probably favour something more akin to a bash script for that
21:35 🔗 astrid sounds like you might need several different things :)
21:35 🔗 jc86035 Yeah it would, I've had to write several different scripts for different things (e.g. the Spotify website is probably bigger than archivebot can handle, so I had to do outlinks one round at a time)
21:36 🔗 jc86035 [yes, Spotify is ephemeral, sometimes even artists get deleted, not to mention the constantly changing playlists and such]
21:39 🔗 prq this wget with delay command I started back on friday is still going, so I'm starting to read up on the archivebot/warrior pipeline stuff a bit.
21:40 🔗 Kaz that's kinda what we did with newsgrabber, but that's very out-of-action for the time being
21:41 🔗 jc86035 unfortunately I did almost everything in bash so some of the stuff actually ended up breaking my computer's maximum process limit, I eventually had to build in retrying sets of urls from the parent script just to work around the issue
21:41 🔗 JAA Spotify probably uses a bunch of JS and wouldn't work in ArchiveBot at all.
21:42 🔗 jc86035 On the contrary, if you use a different user agent you end up on their old site, which still has all the metadata and outlinks and such
21:42 🔗 jc86035 It's definitely not an accurate picture of the Spotify interface, but it gives a very good view of the Spotify library
21:42 🔗 jc86035 (* user agent: basically anything that the web player won't work with, wget for example)
21:43 🔗 markedL were you using savepagenow?
21:43 🔗 jc86035 yes
21:43 🔗 jc86035 for pretty much everything, I did automate archive.is for a while (for a very small number of tasks) but toolforge wasn't cooperative so it just stopped working
21:46 🔗 markedL so what kind of jobs did you find make sense? ignoring the technical questions
21:47 🔗 jc86035 make sense in terms of what?
21:47 🔗 jc86035 like, as in, what was I archiving with what process?
21:48 🔗 markedL yeah
21:50 🔗 markedL or what content at what frequency and scope
21:52 🔗 jc86035 txt/xargs/wget/cron: youtube trending (e.g. https://web.archive.org/web/*/youtube.com/feed/trending?gl=AE, 91 national plus gaming/movies/etc), apple music charts (all once a day), youtube music playlists (don't remember), socialblade/youtube data (once a day), wikipedia music charts (https://en.wikipedia.org/wiki/Wikipedia:Record_charts/List), youtube most viewed (from wikipedia articles) and some others I think
21:53 🔗 jc86035 I personally think some of it was overkill and I didn't really do it properly so I would definitely do something less intensive if I started those again
21:53 🔗 jc86035 they all stopped working at the end of August because IA introduced the 15/min rate limit
21:54 🔗 jc86035 also possibly they banned toolforge's IP addresses because I didn't figure out that the rate limit was being hit for a while and didn't fix it for a few weeks
21:54 🔗 jc86035 only apple music and wikipedia stuff are running right now
21:56 🔗 jc86035 specialized scripts: musescore (particularly successful, it's not even supposed to work but they made their image server URL structure predictable so I managed to archive every single public score in July),
21:57 🔗 jc86035 new alexa.com website (recursive from page links and also from other url lists, once every few months, in April or so I also fed a few million urls in from external sources and used their image server to test if they were worth archiving, no images need to be archived because there aren't any unique images on any pages)
21:58 🔗 jc86035 (musescore runs every hour and basically hovers up any new scores, unfortunately they limited the public score indexes to 101 pages but the scores have sequential identifiers)
22:01 🔗 jc86035 (technically speaking the socialblade/youtube stuff also used some specialized scripts to select youtube channel IDs but it's still a cron job, and I've done musescore runs without cron separate to the cron job)
22:01 🔗 jc86035 (also now that we have 11 months' worth of trending data we could potentially get all the youtube channel ids out of that data instead, it would be a lot of downloading though)
22:02 🔗 JAA That sounds like something ivan and #youtubearchive might be interested in.
22:02 🔗 jc86035 and more recently I've also tried to script the new save page now on wayback, primarily for alexa and youtube (screenshots are nice to have, also the outlinks function is very useful)
22:04 🔗 jc86035 so if you couldn't use archivebot and wanted to upload to web.archive.org you could technically just put a few urls in a list, open firefox and make spn go to town. not totally sure how reliable it is but it definitely seems to work
22:06 🔗 jc86035 I think Jason disapproves (he sent me here from the IA discord server) but I haven't really done much of it anyway because it's a lot more energy intensive than wget, and I kind of took a break after the server change in September
22:06 🔗 jc86035 has quit IRC (Quit: Leaving.)
22:07 🔗 jc86035 has joined #archiveteam-bs
22:08 🔗 jc86035 my internet connection stopped working a few minutes ago so here's what I last tried to send
22:08 🔗 jc86035 > and more recently I've also tried to script the new save page now on wayback, primarily for alexa and youtube (screenshots are nice to have, also the outlinks function is very useful)
22:08 🔗 jc86035 > so if you couldn't use archivebot and wanted to upload to web.archive.org you could technically just put a few urls in a list, open firefox and make spn go to town. not totally sure how reliable it is but it definitely seems to work
22:08 🔗 jc86035 > I think Jason disapproves (he sent me here from the IA discord server) but I haven't really done much of it anyway because it's a lot more energy intensive than wget, and I kind of took a break after the server change in September
22:09 🔗 astrid aye
22:11 🔗 jc86035 I don't use IRC a lot so would it be better if I give out my Discord ID or something? if this stuff is something that warrants further discussion
22:12 🔗 JAA Well, ArchiveTeam is on IRC, not on Discord, fortunately.
22:13 🔗 jc86035 I know, but I'm not really familiar with it (I've only sent PMs once IIRC)
22:13 🔗 hook54321 there's some clients that are pretty easy to use.
22:14 🔗 jc86035 should I just continue to discuss it here? I might go soon so I might just pop in later and then keep discussing it I guess
22:14 🔗 jc86035 I'm using Adium right now
22:14 🔗 astrid you're doing just fine
22:14 🔗 JAA ^
22:14 🔗 jc86035 astrid: thanks for the validation lol
22:15 🔗 astrid :)
22:24 🔗 jc86035 (also if anyone was wondering I scripted spn by using a bash script to create a temporary file and then send a POST form via a JS one-liner, credit where credit is due to https://unix.stackexchange.com/questions/375857/)
22:28 🔗 BlueMax has joined #archiveteam-bs
22:42 🔗 DogsRNice has quit IRC (Ping timeout: 252 seconds)
22:42 🔗 akierig_ has quit IRC (Quit: later_gator)
22:57 🔗 jc86035 has quit IRC (Quit: Leaving.)
23:08 🔗 BartoCH has quit IRC (Remote host closed the connection)
23:09 🔗 BartoCH has joined #archiveteam-bs
23:23 🔗 wp494 has joined #archiveteam-bs
23:38 🔗 Raccoon What are some handy tools for parsing grabbed pages, such that I can create a template to scrape metadata into columns
23:41 🔗 BartoCH has quit IRC (Ping timeout: 615 seconds)
23:42 🔗 foureyes has quit IRC (Quit: brb)
23:44 🔗 foureyes has joined #archiveteam-bs
23:55 🔗 markedL https://www.import.io/
23:56 🔗 JAA lol

irclogger-viewer