#archiveteam-bs 2019-01-16,Wed

↑back Search

Time	Nickname	Message
00:29 ^🔗		Oddly has quit IRC (Ping timeout: 255 seconds)
00:52 ^🔗		VerfiedJ has quit IRC (Quit: Leaving)
01:14 ^🔗		omarroth has quit IRC (Quit: Konversation terminated!)
01:47 ^🔗	t3	PurpleSym: So I'm trying to upload WARCs to the Internet Archive, but I get the HTTP status code 503. Am I being throttled?
01:48 ^🔗	Kaz	IA usually send's 429's I think
02:15 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
02:43 ^🔗		qw3rty113 has joined #archiveteam-bs
02:44 ^🔗		qw3rty112 has quit IRC (Ping timeout: 600 seconds)
02:46 ^🔗	Flashfire	How long until the savenow page usually grabs all the outlinks because one of mine has been trying for an hour now
02:48 ^🔗	t3	Kaz: Should I contact Internet Archive?
02:49 ^🔗		benjinsmi has quit IRC (Quit: Leaving)
02:49 ^🔗	Kaz	Try again later/tomorrow, then email
02:49 ^🔗	Flashfire	http://playerthree.com/ Saving... I dont think it should have been doing this for an hour do you?
02:50 ^🔗		benjins has joined #archiveteam-bs
02:52 ^🔗	Flashfire	kaz sorry to bother you but do you have any opinion on the savenow outlinks capture taking so long
02:53 ^🔗	Kaz	Not a clue
02:53 ^🔗	Kaz	I didn't even realise it did outlinks
02:53 ^🔗	Flashfire	https://web.archive.org/save/ you can choose to grab outlinks
03:11 ^🔗		wp494 has quit IRC (Ping timeout: 260 seconds)
03:11 ^🔗		wp494 has joined #archiveteam-bs
03:15 ^🔗	t3	Flashfire: Well maybe they're running low on space or bandwidth or something.
03:15 ^🔗	t3	Because they're throttling me too.
03:15 ^🔗	t3	I think.
03:22 ^🔗	Kaz	oh actually, looks like IA returns a 503 for slowdown
03:22 ^🔗	Kaz	see: http://monitor.archive.org/stats/s3.php
03:33 ^🔗		jianaran has joined #archiveteam-bs
03:33 ^🔗	jianaran	c&p from #archiveteam, as it should probably have been here instead: " hi all, I'm looking to archive some twitter accounts of niche importance with a habit of regularly deleting their tweets. This is all fairly new to me, so could someone please point me in a good direction to learn how to do this?"
03:34 ^🔗	jianaran	I'm reasonably familiar with Python and can do a bit of bash scripting, but am far from an expert.
03:52 ^🔗	jodizzle	jianaran: I think a decent workflow at the moment is using snscrape to produce a file full of tweet URLs and then using some other tool to request and save the URLs.
03:53 ^🔗	jodizzle	Here's the link to snscrape: https://github.com/JustAnotherArchivist/snscrape
03:53 ^🔗	jianaran	That's pretty much what I've got to, but I'm struggling with turning the list of tweet URLs into a usefully-formatted series of tweets (plus attached media!)
03:54 ^🔗		exoire has quit IRC (Read error: Operation timed out)
03:54 ^🔗	jodizzle	What tool are you using to download the tweets? And what do you mean by usefully-formatted?
03:55 ^🔗	jianaran	Well, I've only really tried wget
03:56 ^🔗	jianaran	Usefully formatted: anything that preserves the link between text and media
03:57 ^🔗	jianaran	As I said, I'm brand new to this so I don't really have any good idea of what the best solution would be
04:05 ^🔗	Flashfire	put them in a text file and !ao < with archivebot
04:05 ^🔗	Flashfire	single line per link
04:06 ^🔗	jianaran	I'd like to automate and run the job daily (or weekly, at least). Is that acceptable with archivebot?
04:08 ^🔗	Kaz	https://github.com/ludios/grab-site
04:08 ^🔗	Kaz	probably better if you want to do it automatically at that rate
04:08 ^🔗	jodizzle	jianaran: Hmm I actually don't know what options would be required to make wget grab the tweets appropriately, but you could check out grab-site: https://github.com/ludios/grab-site#twittercomuser
04:08 ^🔗	jodizzle	oops beat me to it
04:11 ^🔗	jodizzle	But yeah you could set up snscrape and grab-site as a cron job or something in order to grab a user's tweets periodically
04:11 ^🔗	jianaran	That looks great, thanks. I'll see if I can get it working
04:11 ^🔗		jianaran has quit IRC (Read error: Connection reset by peer)
04:11 ^🔗	jodizzle	Using archivebot has the benefit of the links you archive being visible in the wayback machine, but you would have to have Flashfire or someone do it for your each time.
04:12 ^🔗	Flashfire	not me lol I was naughty
04:12 ^🔗	Flashfire	No voice for me
04:12 ^🔗	Flashfire	soemone else
04:13 ^🔗	Flashfire	actually i think you can do !ao < Without voice
04:13 ^🔗	jodizzle	Does that archive without recursion?
04:13 ^🔗	Flashfire	Yes
04:18 ^🔗		newbie85 has joined #archiveteam-bs
04:18 ^🔗		newbie85 is now known as jianaran
04:24 ^🔗	jodizzle	omarroth: Is the 'archived_video_ids.csv' really a CSV file, or is it just a list of links? I think I downloaded it correctly but it doesn't seem to have any columns.
04:36 ^🔗		Sk1d has joined #archiveteam-bs
04:45 ^🔗	Fusl	i keep forgetting that this channel exists lol
04:46 ^🔗		qw3rty114 has joined #archiveteam-bs
04:48 ^🔗	jodizzle	Actually, does grab-site handle things like embedded video on tweets?
04:48 ^🔗	jodizzle	I don't think it does right?
04:49 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
04:49 ^🔗		qw3rty113 has quit IRC (Read error: Operation timed out)
04:53 ^🔗		Sk1d has joined #archiveteam-bs
04:55 ^🔗		odemgi has joined #archiveteam-bs
04:56 ^🔗		odemg has quit IRC (Ping timeout: 265 seconds)
04:57 ^🔗		odemgi_ has quit IRC (Read error: Operation timed out)
05:06 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
05:06 ^🔗		Frogging has quit IRC (Ping timeout: 252 seconds)
05:08 ^🔗		odemg has joined #archiveteam-bs
05:08 ^🔗		Frogging has joined #archiveteam-bs
05:09 ^🔗		Sk1d has joined #archiveteam-bs
05:19 ^🔗	Fusl	have an example url?
05:34 ^🔗	jianaran	OK! Been off playing with that for a while, and I've now got grab-site working and making a WARC
05:35 ^🔗	jianaran	However, it doesn't seem to have pulled the embedded images (at least, nowhere that I can find). I'm using Webrecorder player to view the WARC. Am I viewing it wrong, or does grab-site not save images for the tweets?
06:04 ^🔗	jodizzle	Fusl: Here's a random tweet with a video that I found: https://twitter.com/9GAG/status/1085416357049524224
06:10 ^🔗	Fusl	nope
06:10 ^🔗	Fusl	root@archiveteam:/data# fgrep video.twimg.com twitter.com-9GAG-status-1085416357049524224-2019-01-16-bd5a242a/wpull.log
06:10 ^🔗	Fusl	root@archiveteam:/data#
06:10 ^🔗	Fusl	yeah, thats because it gets the video url from a xhr request json file
06:11 ^🔗	jianaran	does grab-site get embedded images in tweets? I feel I've heard it does, but I can't seem to get it working
06:13 ^🔗	Fusl	yep it does
06:13 ^🔗	Fusl	root@archiveteam:/data# fgrep DvexosMXcAEvDTk.jpg twitter.com-OhNoItsFusl-status-1078526120318898176-2019-01-16-69f0b5b4/wpull.log
06:13 ^🔗	Fusl	2019-01-16 06:12:20,744 - wpull.processor.web - INFO - Fetching ‘https://pbs.twimg.com/media/DvexosMXcAEvDTk.jpg’.
06:13 ^🔗	Fusl	2019-01-16 06:12:20,844 - wpull.processor.web - INFO - Fetched ‘https://pbs.twimg.com/media/DvexosMXcAEvDTk.jpg’: 200 OK. Length: 96506 [image/jpeg].
06:17 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
06:21 ^🔗		Sk1d has joined #archiveteam-bs
06:29 ^🔗	psi	JAA: https://mastodon.mit.edu/@dukhovni/101424659386821523 they seem to have more of these issues so might be worth backing up in case it goes down permanently
06:34 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
06:36 ^🔗	jodizzle	jianaran: grab-site definitely gets at least some images, not sure why they wouldn't show up in Webrecorder though
06:37 ^🔗	jianaran	sorry, to clarify: it seems to get the embedded level of detail, but doesn't seem to be fetching the full-size image (what you get when clicking on images embedded into tweets)
06:39 ^🔗		Sk1d has joined #archiveteam-bs
06:53 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
06:56 ^🔗	jodizzle	jianaran: Ah okay. I'm not sure if there's a way to get all that in a "usefully-formatted" way without simulating a browser (or at least the clicking-behavior).
06:56 ^🔗		Sk1d has joined #archiveteam-bs
06:56 ^🔗	jodizzle	But someone else here might know better
06:56 ^🔗	jianaran	jodizzle: simulating a browser isn't the worst thing in the world, if necessary. But do you know how to scrape the original size embedded media in the first place?
06:57 ^🔗	Flashfire	use chromebot in #archivebot to simulate clicking
06:57 ^🔗	jodizzle	IIRC you can get it by just appending either ':orig' or ':large' to the end of a twitter image
06:57 ^🔗	Flashfire	alternatively build crocoite yourseld
06:58 ^🔗	jodizzle	e.g., 'https://pbs.twimg.com/media/whateverlongstring.jpg:orig'
06:59 ^🔗	jodizzle	Oh I was actually wondering about chromebot Flashfire
07:00 ^🔗	Flashfire	https://github.com/PromyLOPh/crocoite
07:00 ^🔗	jodizzle	Yeah just found it. This seems neat.
07:03 ^🔗	jodizzle	Though IIRC chromebot is pretty slow right?
07:06 ^🔗	Flashfire	only when recursion is enabled mainly because I dont think you can add ignores
07:39 ^🔗		jianaran has quit IRC (Read error: Operation timed out)
08:11 ^🔗	SketchCow	So, I've been spending the months, and am now high-gear, moving hundreds of thousands of files around ARCHIVE.ORG
08:45 ^🔗		turnkit_ has joined #archiveteam-bs
08:46 ^🔗		turnkit has quit IRC (Read error: Operation timed out)
08:48 ^🔗		turnkit has joined #archiveteam-bs
08:53 ^🔗		turnkit_ has quit IRC (Ping timeout: 360 seconds)
09:01 ^🔗		BlueMax has quit IRC (Quit: Leaving)
09:41 ^🔗		BlueMax has joined #archiveteam-bs
09:41 ^🔗		m007a83_ has joined #archiveteam-bs
09:46 ^🔗		m007a83 has quit IRC (Read error: Operation timed out)
09:49 ^🔗		m007a83_ has quit IRC (Read error: Operation timed out)
10:03 ^🔗		Oddly has joined #archiveteam-bs
10:18 ^🔗		Mateon1 has quit IRC (Quit: Mateon1)
10:20 ^🔗		Mateon1 has joined #archiveteam-bs
10:30 ^🔗		Oddly has quit IRC (Ping timeout: 255 seconds)
10:42 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
10:46 ^🔗		Sk1d has joined #archiveteam-bs
10:48 ^🔗		LFlare has quit IRC (Quit: Ping timeout (120 seconds))
10:49 ^🔗		LFlare has joined #archiveteam-bs
10:53 ^🔗		Wigser has joined #archiveteam-bs
10:54 ^🔗	Wigser	Hi
10:55 ^🔗		Wigser has quit IRC (Client Quit)
11:00 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
11:02 ^🔗		Sk1d has joined #archiveteam-bs
11:24 ^🔗		BlueMax has quit IRC (Quit: Leaving)
12:10 ^🔗		fredgido has quit IRC (Ping timeout: 633 seconds)
12:13 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
12:13 ^🔗		wp494 has joined #archiveteam-bs
12:40 ^🔗	JAA	psi: Sounds good to me. I'll throw it into ArchiveBot.
12:41 ^🔗	psi	Lovely thanks
12:46 ^🔗		fredgido has joined #archiveteam-bs
12:56 ^🔗		mistym has quit IRC (Ping timeout: 506 seconds)
12:56 ^🔗		mistym has joined #archiveteam-bs
13:30 ^🔗		Oddly has joined #archiveteam-bs
13:35 ^🔗		VerfiedJ has joined #archiveteam-bs
13:42 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
13:48 ^🔗		Sk1d has joined #archiveteam-bs
14:00 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
14:05 ^🔗		Sk1d has joined #archiveteam-bs
14:17 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
14:20 ^🔗		Sk1d has joined #archiveteam-bs
16:38 ^🔗	yano	I ended up finding these bittorrents of/for ArchiveTeam:
16:39 ^🔗	yano	magnet:?xt=urn:btih:7a318721571616333b993dd6172597deaa748083&dn=urlteam_2016-05-19-18-17-02
16:39 ^🔗	yano	magnet:?xt=urn:btih:1a00e5a54aa599d63cd5a3dc084760228d90f407&dn=archiveteam_newssites_20180217081616
16:40 ^🔗	yano	magnet:?xt=urn:btih:4cf5896b507f3ca6f50819a2788e99dfa5bcb58b&dn=urlteam
16:41 ^🔗	yano	magnet:?xt=urn:btih:82667bfe6bbeb2e928f583687071543552a59225&dn=astrid_archivebot_www_robotsandcomputers_com_20180708
16:47 ^🔗	Kaz	they sound like they're just IA items
17:02 ^🔗		Mateon1 has quit IRC (Ping timeout: 360 seconds)
17:02 ^🔗		Mateon1 has joined #archiveteam-bs
17:18 ^🔗		Arctic has joined #archiveteam-bs
18:07 ^🔗		Oddly has quit IRC (Ping timeout: 255 seconds)
18:21 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
18:24 ^🔗		Sk1d has joined #archiveteam-bs
18:32 ^🔗		Arctic has quit IRC (Quit: Page closed)
18:59 ^🔗		achip has quit IRC (Read error: Operation timed out)
19:02 ^🔗		achip has joined #archiveteam-bs
19:51 ^🔗		m007a83 has joined #archiveteam-bs
20:01 ^🔗		second has quit IRC (Quit: ZNC 1.6.5 - http://znc.in)
20:04 ^🔗		second has joined #archiveteam-bs
21:08 ^🔗		BlueMax has joined #archiveteam-bs
21:11 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
21:11 ^🔗		wp494 has joined #archiveteam-bs
21:21 ^🔗		Despatche has joined #archiveteam-bs
21:23 ^🔗		schbirid has quit IRC (Remote host closed the connection)
21:24 ^🔗		Despatche has quit IRC (Remote host closed the connection)
21:28 ^🔗		Despatche has joined #archiveteam-bs
21:29 ^🔗		Despatche has quit IRC (Read error: Connection reset by peer)
21:29 ^🔗		Despatche has joined #archiveteam-bs
21:52 ^🔗		Despatche has quit IRC (Read error: Operation timed out)
21:56 ^🔗		Despatche has joined #archiveteam-bs
22:14 ^🔗		omarroth has joined #archiveteam-bs
22:16 ^🔗	omarroth	jodizzle: Sorry for the delay. It's a list of newline-seperated video IDs, postgres accepts it as a valid CSV file. It may need a column name for the first line to work for you
22:18 ^🔗	jodizzle	omarroth: Oh no I think I can read it fine, I just wanted to make sure I wasn't missing anything.
22:19 ^🔗	jodizzle	So are the IDs ones that definitely had annotations, or just IDs that you checked at all?
22:19 ^🔗	omarroth	Those are the IDs that we checked. I'm still going through everything but I expect most of those had some form of annotation data
22:25 ^🔗	jodizzle	Okay. When I get a chance I'll check the IDs against videos I have downloaded.
22:25 ^🔗	omarroth	Thank you!
22:25 ^🔗	jodizzle	I also know ivan_ was a big youtube hoarder
22:26 ^🔗	ivan_	odemg is even bigger
22:27 ^🔗	ivan_	also had the foresight to grab h264 instead of vp9 (f'ing iOS devices)
22:27 ^🔗	omarroth	Please send any annotation datamy way!
22:30 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
22:35 ^🔗		Sk1d has joined #archiveteam-bs
22:35 ^🔗		BlueMax has quit IRC (Quit: Leaving)
22:37 ^🔗		Hani has quit IRC (Read error: Operation timed out)
22:45 ^🔗		Hani has joined #archiveteam-bs
22:48 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
22:51 ^🔗		Sk1d has joined #archiveteam-bs
23:00 ^🔗		erinmoon has quit IRC (Quit: WeeChat 2.1)
23:03 ^🔗		newbie96 has joined #archiveteam-bs
23:05 ^🔗		marked has quit IRC (Read error: Connection reset by peer)
23:05 ^🔗	newbie96	Hi all. Following on from yesterday's discussion re: archiving a twitter account that regularly deletes old postings, I've Following on from yesterday' Q's regarding archiving a twitter account that regularly deletes posts, I've now got a nightly cronjob running to snscrape -> web-grab the tweets into a WARC, and also download all images at original resolution with ripme.
23:05 ^🔗		newbie96 is now known as jianaran
23:05 ^🔗	jianaran	(whoops)
23:08 ^🔗	jianaran	This seems to work, but I'm generating a complete WARC of all currently-available tweets every day. This is obviously pretty inefficient. Two possible solutions:
23:08 ^🔗	jianaran	-Maintain a file with all previously scraped tweets, and have the daily job diff snscrape's output against this file, web-grab the diff, then update the file; original
23:08 ^🔗	jianaran	-Regularly merge the obtained WARCs to produce a single growing WARC containing all grabbed tweets
23:09 ^🔗	jianaran	Any thoughts? I'm not very familiar with the WARC format: how easy would it be to merge WARCs to produce a single archive of each URL, keeping the oldest version (so as not to overwrite content with the inevitable 404 page)
23:09 ^🔗		marked has joined #archiveteam-bs
23:10 ^🔗	JAA	I would suggest going with the first idea. Pure merging of WARCs is extremely easy (just concatenate them), but deduplication is a different beast. It can be done, but it's definitely more complex than keeping a list of grabbed tweets and using 'comm' to filter them out from the current snscrape output.
23:11 ^🔗	jodizzle	Isn't there an option in snscrape to get tweets newer than some date?
23:12 ^🔗	jodizzle	Only get tweets newer than some date, I mean
23:12 ^🔗	JAA	There is: snscrape twitter-search 'from:username since:2019-01-10'
23:12 ^🔗	jodizzle	Ah okay, so I think that would be another solution right? Just only get tweets newer than the date you last scraped.
23:12 ^🔗	jianaran	That could be useful, but I feel that getting all new tweets would be safer (if a nightly run doesn't execute for whatever reason, the next day should grab everything that was missed
23:13 ^🔗	jodizzle	You could configure your cron job to write the date to a file or something when it runs.
23:13 ^🔗	jodizzle	Then read it back at the beginning of the job.
23:13 ^🔗	JAA	when it completes successfully*
23:13 ^🔗	jodizzle	Right
23:14 ^🔗	JAA	That's exactly what I did with my ArchiveBot item listing thingy.
23:14 ^🔗	jianaran	that's true
23:15 ^🔗	jianaran	OK, so that approach would give me lots of small but non-overlapping WARCs, which should be easy enough to concatenate
23:18 ^🔗	jodizzle	JAA: What is the ArchiveBot thingy?
23:19 ^🔗	JAA	jodizzle: https://github.com/JustAnotherArchivist/archivebot-archives
23:20 ^🔗	JAA	It's a mediocre replacement for the ArchiveBot viewer, which doesn't always list all data. I stopped the automatic updates late last year though because it needs a rewrite.
23:23 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
23:24 ^🔗		BlueMax has joined #archiveteam-bs
23:27 ^🔗	jianaran	OK! How does this look, for a .sh script to be run nightly as a cronjob: https://pastebin.com/5VSBRm77
23:27 ^🔗		Sk1d has joined #archiveteam-bs
23:27 ^🔗	jianaran	($username is hardcoded: if I want to start scraping more than one account, it shouldn't be too hard to loop the whole thing and read from a list of usernames
23:32 ^🔗	JAA	jianaran: comm requires sorted input, so you'll have to do something like: comm -23 <(sort /twitter-archival/$username-snscrape) <(sort /twitter-archival/$username-snscrape-archive)
23:34 ^🔗		SmileyG has joined #archiveteam-bs
23:34 ^🔗	JAA	snscrape produces output sorted by decreasing tweet ID, but comm needs lexicographically ascending sorted files (e.g. tweet ID 100 would come before 19).
23:34 ^🔗	JAA	(snscrape's output order is actually not guaranteed. It just prints whatever Twitter's servers return.)
23:36 ^🔗		Smiley has quit IRC (Ping timeout: 265 seconds)
23:40 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
23:44 ^🔗		Sk1d has joined #archiveteam-bs
23:44 ^🔗		exoire has joined #archiveteam-bs
23:46 ^🔗	jianaran	JAA: thanks, that's really helpful.
23:59 ^🔗		omarroth has quit IRC (Ping timeout: 268 seconds)

irclogger-viewer